Why read this?
This project builds on the Anderson & Ulrych result that machine‑learned pricers can match traditional American option models with far greater speed. The real contribution here is showing that by penalizing the loss function with financial‑theory constraints, we can enforce arbitrage‑free behavior and produce a more stable, reliable, and economically consistent pricing model.
Introduction
This project was inspired by the paper “Accelerated American option pricing with deep neural networks” by Anderson and Ulrych published in 2023. Given the extremely competitive nature of market-making, the ability to quote option prices near-instantaneously is essential to avoid arbitrage. Traditional numerical methods are often too slow for this environment.
My goal was to replicate and extend this work by fitting an arbitrage-free Neural Network to the pricing task, prioritizing computational speed and model interpretability.
https://www.aimspress.com/article/doi/10.3934/QFE.2023011
Core concept: Deep Explainable Pricer
The paper proposes using a Deep Neural Network (DNN) as a function approximator to replace computationally slow numerical option pricing algorithms.
The Model Architecture:
- The Problem: Pricing American-style options is complex because the holder has the right to exercise the option at any time before maturity, which requires solving an optimal exercise decision.
- The Approach: The proposed DNN pricer is structured in three key steps:
- Pricing (Real-Time): The trained neural network is applied in the market-making environment, offering near-instantaneous price evaluation.
- Training Data Generation (Offline): Option prices are calculated using a known, accepted standard pricing method (like the Finite Difference PDE method) across a dense grid of model parameters (e.g., Heston Stochastic Volatility parameters).
- Learning (Offline): A feed-forward neural network is trained (calibrated) over this large, arbitrage-free parameter-price dataset. This allows the NN to learn the pricing map offline.
The Alpha Edge
By training the network on a well-known model (like the Heston model) and its parameter space, the resulting NN is not a “black box”. It inherits the parametric interpretability of the underlying financial model, which is necessary for compliance with regulatory guidelines (e.g., MiFID). This places the method in the area of eXplainable Artificial Intelligence (XAI) for finance.
Speed Accuracy Trade-off
The primary contribution of the paper is demonstrating a massive improvement in the speed-accuracy trade-off. While conventional methods calculate prices from scratch for every input, the NN simply evaluates a pre-trained function, leading to immense speed gains.
The paper compared the average evaluation speed of the deep explainable pricer to standard American option methods (based on a calculation over 14,366 input sets):
Even when compared at the same level of accuracy as the Finite Difference Method, the Neural Network pricer was found to be 58 times faster.
Experimental Setup
I tested four models in my experiment.
- Random Forest Model
- Gradient Boosting Model
- Neural Network
- Arbitrage Free Neural Network
Objective
My experiment was to test which model was faster and most accurate for the task of pricing American options.
Data Generation and Pre-processing
The dataset was created using the Quantlib Library and the Heston Model was used as the pricing engine. I created 20,000 data points for which some were invalid. The invalid data was filtered out using the Feller condition and anomaly detection techniques.
The Feller condition is a mathematical criterion used in stochastic volatility models, most notably the Heston model, that determines whether the instantaneous variance process can ever reach zero.
The Feller Condition is : ![]()
During model fitting on the raw dataset, I encountered a problem where certain call option prices became unrealistically large. Further investigation revealed that this behavior stems from parameter combinations such as a very high volatility-of-volatility (σ), a large long-term variance (θ), and a low mean-reversion speed (κ). Together, these values generate an “explosive” variance process that is extremely difficult to stabilize and accurately capture within the model.
To address this I dropped any data which had strike + Stock price < Call price.
import QuantLib as ql
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm # For the progress bar
import time
def price_american_heston_fdm(S0, K, T, r, q, v0, kappa, theta, sigma, rho):
"""
Prices an American Call Option using the FDHestonVanillaEngine.
Parameters:
v0: Initial Variance
kappa: Mean Reversion Speed
theta: Mean Reversion Level
sigma: Volatility of Variance
rho: Correlation
"""
# We set a fixed date for consistency in our calculations
today = ql.Date.todaysDate()
ql.Settings.instance().evaluationDate = today
day_count = ql.Actual365Fixed()
# Interest rate Handle
r_ts = ql.YieldTermStructureHandle(ql.FlatForward(today, r, day_count))
# Dividend yield Handle
q_ts = ql.YieldTermStructureHandle(ql.FlatForward(today, q, day_count))
# Spot Price Handle
spot_handle = ql.QuoteHandle(ql.SimpleQuote(S0))
# Setup Option Contract
# QuantLib needs an integer number of days
expiry_date = today + ql.Period(int(T * 365), ql.Days)
payoff = ql.PlainVanillaPayoff(ql.Option.Call, K)
exercise = ql.AmericanExercise(today, expiry_date)
# Combining the payoff and exercise to create the option
option = ql.VanillaOption(payoff, exercise)
# Setup Heston Model
# Note the correct parameter order: (v0, kappa, theta, sigma, rho)
heston_process = ql.HestonProcess(r_ts, q_ts, spot_handle, v0, kappa, theta, sigma, rho)
heston_model = ql.HestonModel(heston_process)
# Setup the FDM Engine
# These grid sizes are VERY SMALL for speed
# For a "real" accurate price, you might use (tGrid=100, xGrid=200, vGrid=50)
tGrid = 50 # Time steps
xGrid = 50 # Asset (spot) steps
vGrid = 50 # Variance steps
# Create the engine
heston_engine = ql.FdHestonVanillaEngine(heston_model, tGrid, xGrid, vGrid)
option.setPricingEngine(heston_engine)
# Price the option
# We use try except to catch any convergence errors because some parameters
# failes to converge even if they are "valid"
try:
price = option.NPV()
return price
except RuntimeError:
return np.nan # Return NaN if the C++ solver fails
def batch_price_american_heston(df, pricing_function, S0):
"""
Batch-prices American call options using a Heston FDM engine.
Parameters:
- df: DataFrame with columns ['K', 'T', 'r', 'q', 'v0', 'kappa', 'theta', 'sigma', 'rho']
- pricing_function: function that takes S0 and Heston parameters and returns a price
- S0: initial stock price
Returns:
- df: original DataFrame with an added 'Call_Price' column
"""
prices = []
start_time = time.time()
for row in tqdm(df.itertuples(), total=len(df), desc="Pricing Options"):
c_price = pricing_function(
S0=S0,
K=row.K,
T=row.T,
r=row.r,
q=row.q,
v0=row.v0,
kappa=row.kappa,
theta=row.theta,
sigma=row.sigma,
rho=row.rho
)
prices.append(c_price)
df['Call_Price'] = prices
end_time = time.time()
print(f"\n--- Generation Complete ---")
print(f"Total time taken: {end_time - start_time:.2f} seconds")
print(f"Average time per sample: {(end_time - start_time) / len(df) * 1000:.2f} milliseconds")
return df# Generating Data
N = 20_000 # Number of Samples
N_large = int(N * 1.5) # Generate 50% more to filter because some may not converge
print(f"Generating {N_large} initial parameter sets...")
# Market Parameters
S = 100.00 # Stock Price
K = np.random.uniform(90, 110, N_large) # Strike Prices
m = K/S # Moneyness
q = np.random.uniform(0.01, 0.03, N_large) # Dividend Yield
r = np.random.uniform(0.01, 0.06, N_large) # Risk free rate
t = np.random.uniform(0.1, 1, N_large) # Assuming t is in years, so 0.1 to 1 year
# Heston Parameters
v0 = np.random.uniform(0.02, 0.5, N_large)
theta = np.random.uniform(0.01, 2.0, N_large)
kappa = np.random.uniform(0.01, 2.0, N_large)
sigma = np.random.uniform(0.01, 1.0, N_large)
rho = np.random.uniform(-0.5, 0.0, N_large)
# Create a DataFrame
df = pd.DataFrame({
'S': S,
'K': K, 'T': t, 'r': r, 'q': q,
'v0': v0, 'kappa': kappa, 'theta': theta, 'sigma': sigma, 'rho': rho
})
# Test for Feller Condition
# Feller condition: 2 * kappa * theta > sigma^2
print("Testing for Feller condition...")
feller_condition = 2 * df['kappa'] * df['theta'] > df['sigma']**2
df_valid = df[feller_condition].copy()
# Check if we have enough valid samples and take the first N
if len(df_valid) >= N:
df_final = df_valid.head(N).reset_index(drop=True)
print(f"Successfully filtered {len(df_final)} samples satisfying the Feller condition.")
else:
print(f"Warning: Only found {len(df_valid)} valid samples. Re-run or increase N_large.")
df_final = df_valid.reset_index(drop=True) # Continue with what we have
df_final = batch_price_american_heston(df_final, price_american_heston_fdm, S0=S)If the call price is larger than the strike plus stock price, we are violating the most basic, fundamental arbitrage condition (specifically, the maximum possible value boundary for a call option).
S = 100.0
df_final.replace([np.inf, -np.inf], np.nan, inplace=True)
# You can also filter out any price that is clearly nonsensical
max_reasonable_price = df_final['K']+S # An option can't be worth more than the strike + stock
df_final.loc[df_final['Call_Price'] > max_reasonable_price, 'Call_Price'] = np.nan
# Now drop all NaNs (from this or from the solver)
df_clean = df_final.dropna()
print(f"Original samples: {len(df_final)}")
print(f"Clean samples: {len(df_clean)}")In words: If the call price is larger than the strike plus stock price, we are potentially looking at an outlier.
from sklearn.model_selection import train_test_split
# Create features and target
X = df_final.drop(columns=['Call_Price'])
y = df_final['Call_Price']
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)Evaluation was done on two aspects:
- Accuracy : Mean Absolute Error (MAE) – Quantifies how closely our model’s prices match the Quantlib price
- Speed : Average Evaluation Time (in μs or ms) – Crucial for the market-making argument; demonstrates the advantage of the NN.
The architecture of the feed forward Neural Network is as follows.
import torchimport torch.nn as nn
import torch.optim as optim
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
class NeuralNetwork(nn.Module):
def __init__(self, input_size):
super(NeuralNetwork, self).__init__()
self.fc1 = nn.Linear(input_size, 64)
self.fc2 = nn.Linear(64, 32)
self.fc3 = nn.Linear(32, 1)
self.relu = nn.ReLU()
def forward(self, x):
x = self.relu(self.fc1(x))
x = self.relu(self.fc2(x))
x = self.fc3(x)
return xTraining
The training was done with 100 epochs and with batch size equal to 32.
from torch.utils.data import DataLoader, TensorDataset
dataset = TensorDataset(torch.tensor(X_train.values, dtype=torch.float32),
torch.tensor(y_train.values, dtype=torch.float32).view(-1, 1))
train_loader = DataLoader(dataset, batch_size=32, shuffle=True)
nn_model = NeuralNetwork(input_size=X_train.shape[1])
criterion = nn.MSELoss()
optimizer = optim.Adam(nn_model.parameters(), lr=0.001)
# Training Loop
training_losses = 0.0
for epoch in range(100):
for inputs, targets in train_loader:
optimizer.zero_grad()
outputs = nn_model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
training_losses += loss.item()
if (epoch+1) % 10 == 0:
print(f'Epoch [{epoch+1}/100], Loss: {loss.item():.4f}')Evaluation
Evaluation was done on three metrices, Mean Absolute Error, Mean Squared Error , R-squared.
val_dataset = TensorDataset(torch.tensor(X_test.values, dtype=torch.float32),
torch.tensor(y_test.values, dtype=torch.float32).view(-1, 1))
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
from torchmetrics import MeanAbsoluteError, MeanSquaredError, R2Score
import torch
# 1. Initialize metrics (using the improved RMSE method)
mae_metric = MeanAbsoluteError().to(device)
mse_metric = MeanSquaredError(squared=True).to(device) # For MSE
rmse_metric = MeanSquaredError(squared=False).to(device) # For RMSE
r2_metric = R2Score().to(device)
# 2. Evaluation loop
nn_model.to(device)
nn_model.eval()
with torch.no_grad():
for inputs, targets in val_loader:
# 3. Move data to device
inputs = inputs.to(device)
targets = targets.to(device)
outputs = nn_model(inputs)
# 4. Fix potential shape mismatch
targets = targets.view_as(outputs)
# 5. Update metrics
mae_metric.update(outputs, targets)
mse_metric.update(outputs, targets)
rmse_metric.update(outputs, targets)
r2_metric.update(outputs, targets)
# 6. Compute final values
mae = mae_metric.compute().item()
mse = mse_metric.compute().item()
rmse = rmse_metric.compute().item() # No more np.sqrt()
r2 = r2_metric.compute().item()
# 7. Reset metrics
mae_metric.reset()
mse_metric.reset()
rmse_metric.reset()
r2_metric.reset()
# 8. Print results
print("\n--- Neural Network Model Evaluation ---")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"R-squared (R2 ): {r2:.4f}")Arbitrage Free Neural Network
Description
A No-Arbitrage Neural Network is a standard NN trained with a custom loss function that enforces financial theory.
The total loss is a weighted sum: ![]()
–
(Accuracy): The normal MSELoss(V_pred, Y_true) to match our data.
–
(Rationality): This is a “financial” regularization. Using autograd, we penalize the model if it violates key no-arbitrage rules.
The Neural Network Architecture
class ArbitrageFreeHestonNN(nn.Module):
def __init__(self, in_features, hidden_dim=256, out_features=1):
super(ArbitrageFreeHestonNN, self).__init__()
self.net = nn.Sequential(
nn.Linear(in_features, hidden_dim),
nn.GELU(),
nn.Linear(hidden_dim, hidden_dim),
nn.GELU(),
nn.Linear(hidden_dim, hidden_dim),
nn.GELU(),
nn.Linear(hidden_dim, hidden_dim),
nn.GELU(),
nn.Linear(hidden_dim, out_features)
)
def forward(self, x):
# x is now UN-SCALED
return self.net(x)
The Loss Function
This hybrid loss creates a model that is both accurate and financially trustworthy.
– Intrinsic Value Constraint: The option price (V) can never be less than its immediate exercise value (intrinsic value). If it were, you could buy it, exercise it, and make free money.
– Rule: ![]()
– Penalty
: We penalize the network if V_pred < intrinsic_value.
– Strike Convexity Constraint: The price of a call option must be “convex” with respect to the strike price (K).
– This rule prevents a simple arbitrage called a butterfly spread.
– Rule:
(The price curve must be “smiley-faced”).
– Penalty
: We use autograd to find the second derivative
and heavily penalize the network if this value is negative.
Final Loss Function (No-Arbitrage NN Total Loss Function)
The total loss,
, is a weighted sum of three separate loss components:
![]()
Training
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
LR = 1e-3
# Loss function weights
# These are the most important hyperparameters to tune
lambda_intrinsic = 1# Weight for intrinsic value constraint
lambda_convexity = 0.01 # Weight for strike convexity constraint
N_EPOCHS = 200
model = ArbitrageFreeHestonNN(in_features=X_train.shape[1]).to(device)
optimizer = optim.Adam(model.parameters(), lr=LR)
mse_loss_fn = nn.MSELoss()
for epoch in range(N_EPOCHS):
model.train()
total_loss_epoch = 0
avg_raw_mse = 0
avg_raw_intrinsic = 0
avg_raw_convexity = 0
for X_batch, Y_batch in train_loader:
X_batch = X_batch.to(device)
Y_batch = Y_batch.to(device)
# Enable grad on the unscaled batch
X_batch.requires_grad = True
# Forward Pass (on unscaled data)
optimizer.zero_grad()
V_pred = model(X_batch)
# Calculate Loss Components (on unscaled, real values)
Y_batch = Y_batch.view(-1, 1)
# 1. L_MSE: Standard price-matching loss
loss_mse = mse_loss_fn(V_pred, Y_batch)
# 2. L_intrinsic: American Early Exercise Constraint
# V(t) must be >= max(S-K, 0) for a Call
S = X_batch[:, 0].view(-1, 1) # Assumes S is col 0
K = X_batch[:, 1].view(-1, 1) # Assumes K is col 1
intrinsic_value = torch.relu(S - K)
loss_intrinsic = torch.mean(torch.relu(intrinsic_value - V_pred)**2)
# 3. L_convexity: Penalize d2V/dK2 < 0
V_X = torch.autograd.grad(
V_pred, X_batch,
grad_outputs=torch.ones_like(V_pred),
create_graph=True
)[0]
V_K = V_X[:, 1].view(-1, 1) # K is col 1
V_KX = torch.autograd.grad(
V_K, X_batch,
grad_outputs=torch.ones_like(V_K),
create_graph=True
)[0]
V_KK = V_KX[:, 1].view(-1, 1) # K is col 1
loss_convexity = torch.mean(torch.relu(-V_KK)**2)
# Total Loss and Backward Pass
avg_raw_mse += loss_mse.item()
avg_raw_intrinsic += loss_intrinsic.item()
avg_raw_convexity += loss_convexity.item()
total_loss = (
loss_mse
+ lambda_intrinsic * loss_intrinsic
+ lambda_convexity * loss_convexity
)
total_loss.backward()
optimizer.step()
total_loss_epoch += total_loss.item()
if (epoch+1) % 10 == 0:
print(f"Epoch {epoch+1}/{N_EPOCHS}, "
f"Avg Train Loss: {total_loss_epoch / len(train_loader):.4e}")
print("Training finished.")Evaluation
# Initialize metrics from torchmetrics
val_mae = MeanAbsoluteError().to(device)
val_mse = MeanSquaredError().to(device)
val_rmse = MeanSquaredError(squared=False).to(device)
val_r2 = R2Score().to(device)
model.eval()
with torch.no_grad():
for X_val_batch, Y_val_batch in val_loader:
X_val_batch = X_val_batch.to(device)
Y_val_batch = Y_val_batch.to(device)
# --- No scaling ---
V_val_pred = model(X_val_batch)
Y_val_batch = Y_val_batch.view(-1, 1)
# Update metrics
val_mae.update(V_val_pred, Y_val_batch)
val_mse.update(V_val_pred, Y_val_batch)
val_rmse.update(V_val_pred, Y_val_batch)
val_r2.update(V_val_pred, Y_val_batch)
# Compute final metrics
mae_value = val_mae.compute().item()
mse_value = val_mse.compute().item()
rmse_value = val_rmse.compute().item()
r2_value = val_r2.compute().item()
print("\n--- Arbitrage-Free Heston NN Model Evaluation ---")
print(f"Mean Absolute Error (MAE): {mae_value:.4f}")
print(f"Mean Squared Error (MSE): {mse_value:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse_value:.4f}")
print(f"R-squared (R2): {r2_value:.4f}")
Accuracy Results
| Model | MAE | MSE | R2 |
| Random Forest | 0.7734 | 2.5483 | 0.9585 |
| Gradeint Boosting | 0.6044 | 2.4298 | 0.9604 |
| Neural Network | 0.7130 | 2.3253 | 0.9621 |
| Arbitrage Free Neural Network | 1.3302 | 4.4461 | 0.9276 |
The high MAE for the arbitrage free neural network can be due to the strict constraints that was imposed on the loss function. By changing the weightage given to the finacial constraint based loss components can help fit the data better.
This can be achieved by hyper-parameter tuning.
Prediction Speed
To carryout the prediction speed based analysis the following code was used. Instead taking the prediction on one sample data, I took hundred samples and calculated the average speed of prediction for the 100 samples.
def benchmark_prediction_speed(model_name, model, X_test_sample, device):
"""
Measures the average prediction speed for a single data point.
Args:
model_name (str): The name of the model (for printing).
model (torch.nn.Module or sklearn model): The model to test.
X_test_sample (torch.Tensor or np.ndarray): 100 data points to test.
device (torch.device): The device (cpu or cuda).
"""
# --- PyTorch Model Handling ---
if isinstance(model, torch.nn.Module):
model.eval()
model.to(device)
# Ensure data is on the correct device
if not isinstance(X_test_sample, torch.Tensor):
X_test_sample_torch = torch.tensor(X_test_sample, dtype=torch.float32).to(device)
else:
X_test_sample_torch = X_test_sample.to(device)
times = []
with torch.no_grad():
for i in range(len(X_test_sample_torch)):
# Get a single data point, but keep it as a [1, N_features] batch
single_point = X_test_sample_torch[i].unsqueeze(0)
# GPU Synchronization for accurate timing
if device.type == 'cuda':
torch.cuda.synchronize()
start_time = time.perf_counter()
# --- Predict ---
_ = model(single_point)
if device.type == 'cuda':
torch.cuda.synchronize()
end_time = time.perf_counter()
times.append(end_time - start_time)
# --- Sklearn Model Handling (RF, GB) ---
else:
times = []
for i in range(len(X_test_sample)):
# Get a single data point
# print(X_test_sample)
single_point = X_test_sample.iloc[[i]]
start_time = time.perf_counter()
# --- Predict ---
_ = model.predict(single_point)
end_time = time.perf_counter()
times.append(end_time - start_time)
# Calculate average time in milliseconds
avg_time_ms = (np.mean(times) * 1000)
print(f"--- {model_name} Benchmark ---")
print(f"Average prediction time: {avg_time_ms:.6f} ms (per single data point)")
return avg_time_ms| Model | Mean time elapsed | Relative Speed Compared to the Neural Network |
| Neural Network | 0.00020 | – |
| Arbitrage Free Neural Network | 0.00021 | 1.05 times slower |
| Gradient Boosting | 0.00086 | 4.43 times slower |
| Random Forest | 0.01380 | 68.2 times slower |
The Gradient Boosting Model achieved the lowest Mean Absolute Error (MAE), confirming its superior predictive accuracy on the historical parameters. However, the standard Neural Network (NN) provides a far more practical solution for market making. By tuning the Arbitrage Free Neural network further we might be able to achieve better results at a higher efficiency.
By sacrificing a minor increase in MAE (from 0.6044 to 0.7130), the standard Neural Network achieves an evaluation speed that is 4.43 times faster than the Gradient Boosting Model and 68.2 times faster than the Random Forest.
Final Conclusion
For a quantitative firm aiming for a low-latency market-making edge, the speed advantage offered by the Neural Network is decisive. Its ability to achieve competitive accuracy while reducing evaluation time to 0.00020 seconds validates the core hypothesis of deep learning in finance: computational speed unlocks new trading opportunities where traditional methods are simply too slow. Further optimization through architecture reduction (fewer layers) promises to enhance this speed advantage even further.