Skip to content

Arbitrage Free Neural Network for Option Pricing

Python
100%
Python
PyTorch
NumPy
Pandas
Scikit-Learn

Why read this?

This project builds on the Anderson & Ulrych result that machine‑learned pricers can match traditional American option models with far greater speed. The real contribution here is showing that by penalizing the loss function with financial‑theory constraints, we can enforce arbitrage‑free behavior and produce a more stable, reliable, and economically consistent pricing model.

Introduction

This project was inspired by the paper “Accelerated American option pricing with deep neural networks” by Anderson and Ulrych published in 2023. Given the extremely competitive nature of market-making, the ability to quote option prices near-instantaneously is essential to avoid arbitrage. Traditional numerical methods are often too slow for this environment.

My goal was to replicate and extend this work by fitting an arbitrage-free Neural Network to the pricing task, prioritizing computational speed and model interpretability.
https://www.aimspress.com/article/doi/10.3934/QFE.2023011

Core concept: Deep Explainable Pricer

The paper proposes using a Deep Neural Network (DNN) as a function approximator to replace computationally slow numerical option pricing algorithms.

The Model Architecture:

  • The Problem: Pricing American-style options is complex because the holder has the right to exercise the option at any time before maturity, which requires solving an optimal exercise decision.
  • The Approach: The proposed DNN pricer is structured in three key steps:
    • Pricing (Real-Time): The trained neural network is applied in the market-making environment, offering near-instantaneous price evaluation.
    • Training Data Generation (Offline): Option prices are calculated using a known, accepted standard pricing method (like the Finite Difference PDE method) across a dense grid of model parameters (e.g., Heston Stochastic Volatility parameters).
    • Learning (Offline): A feed-forward neural network is trained (calibrated) over this large, arbitrage-free parameter-price dataset. This allows the NN to learn the pricing map offline.

The Alpha Edge

By training the network on a well-known model (like the Heston model) and its parameter space, the resulting NN is not a “black box”. It inherits the parametric interpretability of the underlying financial model, which is necessary for compliance with regulatory guidelines (e.g., MiFID). This places the method in the area of eXplainable Artificial Intelligence (XAI) for finance.

Speed Accuracy Trade-off

The primary contribution of the paper is demonstrating a massive improvement in the speed-accuracy trade-off. While conventional methods calculate prices from scratch for every input, the NN simply evaluates a pre-trained function, leading to immense speed gains.

The paper compared the average evaluation speed of the deep explainable pricer to standard American option methods (based on a calculation over 14,366 input sets):

Even when compared at the same level of accuracy as the Finite Difference Method, the Neural Network pricer was found to be 58 times faster.

Experimental Setup

I tested four models in my experiment.

  • Random Forest Model
  • Gradient Boosting Model
  • Neural Network
  • Arbitrage Free Neural Network

Objective

My experiment was to test which model was faster and most accurate for the task of pricing American options.

Data Generation and Pre-processing

The dataset was created using the Quantlib Library and the Heston Model was used as the pricing engine. I created 20,000 data points for which some were invalid. The invalid data was filtered out using the Feller condition and anomaly detection techniques.

The Feller condition is a mathematical criterion used in stochastic volatility models, most notably the Heston model, that determines whether the instantaneous variance process can ever reach zero.

The Feller Condition is : 2\kappa\theta > \sigma^2

During model fitting on the raw dataset, I encountered a problem where certain call option prices became unrealistically large. Further investigation revealed that this behavior stems from parameter combinations such as a very high volatility-of-volatility (σ), a large long-term variance (θ), and a low mean-reversion speed (κ). Together, these values generate an “explosive” variance process that is extremely difficult to stabilize and accurately capture within the model.

To address this I dropped any data which had strike + Stock price < Call price.

import QuantLib as ql
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm  # For the progress bar
import time

def price_american_heston_fdm(S0, K, T, r, q, v0, kappa, theta, sigma, rho):
    """
    Prices an American Call Option using the FDHestonVanillaEngine.
    
    Parameters:
    v0: Initial Variance
    kappa: Mean Reversion Speed
    theta: Mean Reversion Level
    sigma: Volatility of Variance
    rho: Correlation
    """
    
    # We set a fixed date for consistency in our calculations
    today = ql.Date.todaysDate()
    ql.Settings.instance().evaluationDate = today
    
    day_count = ql.Actual365Fixed()
    # Interest rate Handle
    r_ts = ql.YieldTermStructureHandle(ql.FlatForward(today, r, day_count))
    # Dividend yield Handle
    q_ts = ql.YieldTermStructureHandle(ql.FlatForward(today, q, day_count))
    # Spot Price Handle
    spot_handle = ql.QuoteHandle(ql.SimpleQuote(S0))

    # Setup Option Contract
    # QuantLib needs an integer number of days
    expiry_date = today + ql.Period(int(T * 365), ql.Days)
    payoff = ql.PlainVanillaPayoff(ql.Option.Call, K)
    exercise = ql.AmericanExercise(today, expiry_date)
    # Combining the payoff and exercise to create the option
    option = ql.VanillaOption(payoff, exercise)

    # Setup Heston Model
    # Note the correct parameter order: (v0, kappa, theta, sigma, rho)
    heston_process = ql.HestonProcess(r_ts, q_ts, spot_handle, v0, kappa, theta, sigma, rho)
    heston_model = ql.HestonModel(heston_process)

    # Setup the FDM Engine
    # These grid sizes are VERY SMALL for speed
    # For a "real" accurate price, you might use (tGrid=100, xGrid=200, vGrid=50)
    tGrid = 50  # Time steps
    xGrid = 50  # Asset (spot) steps
    vGrid = 50  # Variance steps
    
    # Create the engine
    heston_engine = ql.FdHestonVanillaEngine(heston_model, tGrid, xGrid, vGrid)
    option.setPricingEngine(heston_engine)

    # Price the option
    # We use try except to catch any convergence errors because some parameters
    # failes to converge even if they are "valid"
    try:
        price = option.NPV()
        return price
    except RuntimeError:
        return np.nan # Return NaN if the C++ solver fails
def batch_price_american_heston(df, pricing_function, S0):
    """
    Batch-prices American call options using a Heston FDM engine.
    
    Parameters:
    - df: DataFrame with columns ['K', 'T', 'r', 'q', 'v0', 'kappa', 'theta', 'sigma', 'rho']
    - pricing_function: function that takes S0 and Heston parameters and returns a price
    - S0: initial stock price
    
    Returns:
    - df: original DataFrame with an added 'Call_Price' column
    """
    prices = []
    start_time = time.time()

    for row in tqdm(df.itertuples(), total=len(df), desc="Pricing Options"):
        c_price = pricing_function(
            S0=S0,
            K=row.K,
            T=row.T,
            r=row.r,
            q=row.q,
            v0=row.v0,
            kappa=row.kappa,
            theta=row.theta,
            sigma=row.sigma,
            rho=row.rho
        )
        prices.append(c_price)

    df['Call_Price'] = prices

    end_time = time.time()
    print(f"\n--- Generation Complete ---")
    print(f"Total time taken: {end_time - start_time:.2f} seconds")
    print(f"Average time per sample: {(end_time - start_time) / len(df) * 1000:.2f} milliseconds")

    return df
# Generating Data
N = 20_000    # Number of Samples
N_large = int(N * 1.5)  # Generate 50% more to filter because some may not converge
print(f"Generating {N_large} initial parameter sets...")

# Market Parameters
S = 100.00 # Stock Price
K = np.random.uniform(90, 110, N_large) # Strike Prices
m = K/S  # Moneyness
q = np.random.uniform(0.01, 0.03, N_large)  # Dividend Yield
r = np.random.uniform(0.01, 0.06, N_large) # Risk free rate
t = np.random.uniform(0.1, 1, N_large)  # Assuming t is in years, so 0.1 to 1 year

# Heston Parameters 
v0    = np.random.uniform(0.02, 0.5, N_large)
theta = np.random.uniform(0.01, 2.0, N_large)
kappa = np.random.uniform(0.01, 2.0, N_large)
sigma = np.random.uniform(0.01, 1.0, N_large)
rho   = np.random.uniform(-0.5, 0.0, N_large)

# Create a DataFrame
df = pd.DataFrame({
    'S': S,
    'K': K, 'T': t, 'r': r, 'q': q,
    'v0': v0, 'kappa': kappa, 'theta': theta, 'sigma': sigma, 'rho': rho
})

# Test for Feller Condition
# Feller condition: 2 * kappa * theta > sigma^2
print("Testing for Feller condition...")
feller_condition = 2 * df['kappa'] * df['theta'] > df['sigma']**2

df_valid = df[feller_condition].copy()

# Check if we have enough valid samples and take the first N
if len(df_valid) >= N:
    df_final = df_valid.head(N).reset_index(drop=True)
    print(f"Successfully filtered {len(df_final)} samples satisfying the Feller condition.")
else:
    print(f"Warning: Only found {len(df_valid)} valid samples. Re-run or increase N_large.")
    df_final = df_valid.reset_index(drop=True) # Continue with what we have




df_final = batch_price_american_heston(df_final, price_american_heston_fdm, S0=S)

If the call price is larger than the strike plus stock price, we are violating the most basic, fundamental arbitrage condition (specifically, the maximum possible value boundary for a call option).

S = 100.0
df_final.replace([np.inf, -np.inf], np.nan, inplace=True)

# You can also filter out any price that is clearly nonsensical
max_reasonable_price =  df_final['K']+S # An option can't be worth more than the strike + stock
df_final.loc[df_final['Call_Price'] > max_reasonable_price, 'Call_Price'] = np.nan

# Now drop all NaNs (from this or from the solver)
df_clean = df_final.dropna()

print(f"Original samples: {len(df_final)}")
print(f"Clean samples: {len(df_clean)}")

In words: If the call price is larger than the strike plus stock price, we are potentially looking at an outlier.

from sklearn.model_selection import train_test_split
# Create features and target
X = df_final.drop(columns=['Call_Price'])
y = df_final['Call_Price']
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Evaluation was done on two aspects:

  • Accuracy : Mean Absolute Error (MAE) – Quantifies how closely our model’s prices match the Quantlib price
  • Speed : Average Evaluation Time (in μs or ms) – Crucial for the market-making argument; demonstrates the advantage of the NN.

The architecture of the feed forward Neural Network is as follows.

import torchimport torch.nn as nn
import torch.optim as optim

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
class NeuralNetwork(nn.Module):
    def __init__(self, input_size):
        super(NeuralNetwork, self).__init__()
        self.fc1 = nn.Linear(input_size, 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 1)
        self.relu = nn.ReLU()
        
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x

Training

The training was done with 100 epochs and with batch size equal to 32.

from torch.utils.data import DataLoader, TensorDataset

dataset = TensorDataset(torch.tensor(X_train.values, dtype=torch.float32), 
                        torch.tensor(y_train.values, dtype=torch.float32).view(-1, 1))
train_loader = DataLoader(dataset, batch_size=32, shuffle=True)

nn_model = NeuralNetwork(input_size=X_train.shape[1])
criterion = nn.MSELoss()
optimizer = optim.Adam(nn_model.parameters(), lr=0.001)

# Training Loop
training_losses = 0.0
for epoch in range(100):
    for inputs, targets in train_loader:
        optimizer.zero_grad()
        outputs = nn_model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
        training_losses += loss.item()
        
    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/100], Loss: {loss.item():.4f}')

Evaluation

Evaluation was done on three metrices, Mean Absolute Error, Mean Squared Error , R-squared.

val_dataset = TensorDataset(torch.tensor(X_test.values, dtype=torch.float32), 
                            torch.tensor(y_test.values, dtype=torch.float32).view(-1, 1))
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

from torchmetrics import MeanAbsoluteError, MeanSquaredError, R2Score
import torch

# 1. Initialize metrics (using the improved RMSE method)
mae_metric = MeanAbsoluteError().to(device)
mse_metric = MeanSquaredError(squared=True).to(device)  # For MSE
rmse_metric = MeanSquaredError(squared=False).to(device) # For RMSE
r2_metric = R2Score().to(device)

# 2. Evaluation loop
nn_model.to(device)
nn_model.eval()
with torch.no_grad():
    for inputs, targets in val_loader:
        # 3. Move data to device
        inputs = inputs.to(device)
        targets = targets.to(device)

        outputs = nn_model(inputs)
        
        # 4. Fix potential shape mismatch
        targets = targets.view_as(outputs)

        # 5. Update metrics
        mae_metric.update(outputs, targets)
        mse_metric.update(outputs, targets)
        rmse_metric.update(outputs, targets)
        r2_metric.update(outputs, targets)

# 6. Compute final values
mae = mae_metric.compute().item()
mse = mse_metric.compute().item()
rmse = rmse_metric.compute().item() # No more np.sqrt()
r2 = r2_metric.compute().item()

# 7. Reset metrics
mae_metric.reset()
mse_metric.reset()
rmse_metric.reset()
r2_metric.reset()

# 8. Print results
print("\n--- Neural Network Model Evaluation ---")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"R-squared (R2 ): {r2:.4f}")

Arbitrage Free Neural Network

Description

A No-Arbitrage Neural Network is a standard NN trained with a custom loss function that enforces financial theory.

The total loss is a weighted sum:  L_{total} = L_{MSE} + \lambda \cdot L_{arbitrage\_penalty}

L_{MSE} (Accuracy): The normal MSELoss(V_pred, Y_true) to match our data.

L_{arbitrage\_penalty} (Rationality): This is a “financial” regularization. Using autograd, we penalize the model if it violates key no-arbitrage rules.

 

The Neural Network Architecture

class ArbitrageFreeHestonNN(nn.Module):
    def __init__(self, in_features, hidden_dim=256, out_features=1):
        super(ArbitrageFreeHestonNN, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(in_features, hidden_dim),
            nn.GELU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.GELU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.GELU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.GELU(),
            nn.Linear(hidden_dim, out_features)
        )

    def forward(self, x):
        # x is now UN-SCALED
        return self.net(x)
    

The Loss Function

This hybrid loss creates a model that is both accurate and financially trustworthy.

Intrinsic Value Constraint: The option price (V) can never be less than its immediate exercise value (intrinsic value). If it were, you could buy it, exercise it, and make free money.

    – Rule:  V \ge \max(S-K, 0)

    – Penalty (L_{intrinsic}): We penalize the network if V_pred < intrinsic_value.

Strike Convexity Constraint: The price of a call option must be “convex” with respect to the strike price (K).

    – This rule prevents a simple arbitrage called a butterfly spread.

    – Rule: \frac{\partial^2 V}{\partial K^2} \ge 0 (The price curve must be “smiley-faced”).

    – Penalty $L_{convexity}$: We use autograd to find the second derivative (\frac{\partial^2 V}{\partial K^2}) and heavily penalize the network if this value is negative.

Final Loss Function (No-Arbitrage NN Total Loss Function)

The total loss, L_{total}, is a weighted sum of three separate loss components:

L_{total} = L_{MSE} + \lambda_{int} \cdot L_{intrinsic} + \lambda_{conv} \cdot L_{convexity}

Training

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

LR = 1e-3
# Loss function weights
# These are the most important hyperparameters to tune
lambda_intrinsic = 1# Weight for intrinsic value constraint
lambda_convexity = 0.01 # Weight for strike convexity constraint

N_EPOCHS = 200
model = ArbitrageFreeHestonNN(in_features=X_train.shape[1]).to(device)
optimizer = optim.Adam(model.parameters(), lr=LR)
mse_loss_fn = nn.MSELoss()

for epoch in range(N_EPOCHS):
    model.train()
    total_loss_epoch = 0
    
    avg_raw_mse = 0
    avg_raw_intrinsic = 0
    avg_raw_convexity = 0
    for X_batch, Y_batch in train_loader:
        X_batch = X_batch.to(device)
        Y_batch = Y_batch.to(device)
        
        # Enable grad on the unscaled batch
        X_batch.requires_grad = True
        
        # Forward Pass (on unscaled data)
        optimizer.zero_grad()
        V_pred = model(X_batch)
        
        # Calculate Loss Components (on unscaled, real values)
        
        Y_batch = Y_batch.view(-1, 1)
        
        # 1. L_MSE: Standard price-matching loss
        loss_mse = mse_loss_fn(V_pred, Y_batch)
        
        # 2. L_intrinsic: American Early Exercise Constraint
        # V(t) must be >= max(S-K, 0) for a Call
        S = X_batch[:, 0].view(-1, 1) # Assumes S is col 0
        K = X_batch[:, 1].view(-1, 1) # Assumes K is col 1


        intrinsic_value = torch.relu(S - K)
        loss_intrinsic = torch.mean(torch.relu(intrinsic_value - V_pred)**2)
        
        # 3. L_convexity: Penalize d2V/dK2 < 0
        V_X = torch.autograd.grad(
            V_pred, X_batch, 
            grad_outputs=torch.ones_like(V_pred), 
            create_graph=True
        )[0]
        V_K = V_X[:, 1].view(-1, 1) # K is col 1
        
        V_KX = torch.autograd.grad(
            V_K, X_batch, 
            grad_outputs=torch.ones_like(V_K), 
            create_graph=True
        )[0]
        V_KK = V_KX[:, 1].view(-1, 1) # K is col 1
        
        loss_convexity = torch.mean(torch.relu(-V_KK)**2)

        # Total Loss and Backward Pass
        
        avg_raw_mse += loss_mse.item()
        avg_raw_intrinsic += loss_intrinsic.item()
        avg_raw_convexity += loss_convexity.item()

        total_loss = (
            loss_mse 
            + lambda_intrinsic * loss_intrinsic
            + lambda_convexity * loss_convexity
        )
        
        total_loss.backward()
        optimizer.step()
        total_loss_epoch += total_loss.item()
    
    if (epoch+1) % 10 == 0:

        print(f"Epoch {epoch+1}/{N_EPOCHS}, "
                f"Avg Train Loss: {total_loss_epoch / len(train_loader):.4e}")

print("Training finished.")

Evaluation

# Initialize metrics from torchmetrics
val_mae = MeanAbsoluteError().to(device)
val_mse = MeanSquaredError().to(device)
val_rmse = MeanSquaredError(squared=False).to(device)
val_r2 = R2Score().to(device)

model.eval()

with torch.no_grad():
    for X_val_batch, Y_val_batch in val_loader:
            X_val_batch = X_val_batch.to(device)
            Y_val_batch = Y_val_batch.to(device)
            
            # --- No scaling ---
            V_val_pred = model(X_val_batch)
            
            Y_val_batch = Y_val_batch.view(-1, 1)
            
            # Update metrics
            val_mae.update(V_val_pred, Y_val_batch)
            val_mse.update(V_val_pred, Y_val_batch)
            val_rmse.update(V_val_pred, Y_val_batch)
            val_r2.update(V_val_pred, Y_val_batch)

    # Compute final metrics
    mae_value = val_mae.compute().item()
    mse_value = val_mse.compute().item()
    rmse_value = val_rmse.compute().item()
    r2_value = val_r2.compute().item()

print("\n--- Arbitrage-Free Heston NN Model Evaluation ---")
print(f"Mean Absolute Error (MAE):    {mae_value:.4f}")
print(f"Mean Squared Error (MSE):     {mse_value:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse_value:.4f}")
print(f"R-squared (R2):               {r2_value:.4f}")

Accuracy Results

ModelMAEMSER2
Random Forest0.77342.54830.9585
Gradeint Boosting0.60442.42980.9604
Neural Network0.71302.32530.9621
Arbitrage Free Neural Network1.33024.44610.9276

The high MAE for the arbitrage free neural network can be due to the strict constraints that was imposed on the loss function. By changing the weightage given to the finacial constraint based loss components can help fit the data better.

This can be achieved by hyper-parameter tuning.

Prediction Speed

To carryout the prediction speed based analysis the following code was used. Instead taking the prediction on one sample data, I took hundred samples and calculated the average speed of prediction for the 100 samples.

def benchmark_prediction_speed(model_name, model, X_test_sample, device):
    """
    Measures the average prediction speed for a single data point.

    Args:
        model_name (str): The name of the model (for printing).
        model (torch.nn.Module or sklearn model): The model to test.
        X_test_sample (torch.Tensor or np.ndarray): 100 data points to test.
        device (torch.device): The device (cpu or cuda).
    """
    
    # --- PyTorch Model Handling ---
    if isinstance(model, torch.nn.Module):
        model.eval()
        model.to(device)
        
        # Ensure data is on the correct device
        if not isinstance(X_test_sample, torch.Tensor):
            X_test_sample_torch = torch.tensor(X_test_sample, dtype=torch.float32).to(device)
        else:
            X_test_sample_torch = X_test_sample.to(device)

        times = []
        with torch.no_grad():
            for i in range(len(X_test_sample_torch)):
                # Get a single data point, but keep it as a [1, N_features] batch
                single_point = X_test_sample_torch[i].unsqueeze(0)
                
                # GPU Synchronization for accurate timing
                if device.type == 'cuda':
                    torch.cuda.synchronize()
                
                start_time = time.perf_counter()
                
                # --- Predict ---
                _ = model(single_point)
                
                if device.type == 'cuda':
                    torch.cuda.synchronize()
                
                end_time = time.perf_counter()
                times.append(end_time - start_time)
                
    # --- Sklearn Model Handling (RF, GB) ---
    else:
        
        times = []
        for i in range(len(X_test_sample)):
            # Get a single data point
            # print(X_test_sample)
            single_point = X_test_sample.iloc[[i]]
            
            start_time = time.perf_counter()
            
            # --- Predict ---
            _ = model.predict(single_point)
            
            end_time = time.perf_counter()
            times.append(end_time - start_time)

    # Calculate average time in milliseconds
    avg_time_ms = (np.mean(times) * 1000)
    print(f"--- {model_name} Benchmark ---")
    print(f"Average prediction time: {avg_time_ms:.6f} ms (per single data point)")
    return avg_time_ms
ModelMean time elapsedRelative Speed Compared to the Neural Network
Neural Network0.00020
Arbitrage Free Neural Network0.000211.05 times slower
Gradient Boosting0.000864.43 times slower
Random Forest0.0138068.2 times slower

The Gradient Boosting Model achieved the lowest Mean Absolute Error (MAE), confirming its superior predictive accuracy on the historical parameters. However, the standard Neural Network (NN) provides a far more practical solution for market making. By tuning the Arbitrage Free Neural network further we might be able to achieve better results at a higher efficiency.

By sacrificing a minor increase in MAE (from 0.6044 to 0.7130), the standard Neural Network achieves an evaluation speed that is 4.43 times faster than the Gradient Boosting Model and 68.2 times faster than the Random Forest.

Final Conclusion

For a quantitative firm aiming for a low-latency market-making edge, the speed advantage offered by the Neural Network is decisive. Its ability to achieve competitive accuracy while reducing evaluation time to 0.00020 seconds validates the core hypothesis of deep learning in finance: computational speed unlocks new trading opportunities where traditional methods are simply too slow. Further optimization through architecture reduction (fewer layers) promises to enhance this speed advantage even further.

Leave a Reply

Your email address will not be published. Required fields are marked *