The “Smart” Committee: Scaling Option Pricing with k-Sparse Gated Neural Networks

A single neural network often struggles to price options accurately across every regime. Why ask a LEAPS expert to price a 0DTE call? k-Sparse Gating solves this by acting as an intelligent ‘triage’ system. By activating only the Top-k specialists from a committee of 9 regime-trained experts, we eliminate ‘model noise’ and slash GPU latency. This tutorial moves beyond standard averaging to build a faster, specialized intelligence for complex market regimes.

Introduction

In the world of derivative pricing, a single “master” model often fails because it tries to be everything to everyone. An option’s “psychology” changes completely depending on its Moneyness and Maturity. A 0DTE (zero days to expiration) Out-of-the-Money (OTM) call behaves nothing like a 2-year Deep In-the-Money (ITM) LEAP.

One way to go about this is to create 9 expert neural networks which are trained on specific data regimes (eg: ITM short, ITM Medium, ITM Long,…) and use an averaging through a Gated Neural Network. However, using all the 9 experts for every prediction can be inefficient.

The solution is, Efficiency through Sparsity. We will build a system that manages 9 specialized experts, but only calls the “Top-k” specialists needed for the specific trade at hand.

The Concept: Why “k-Sparse” for Options?

If you have 9 experts trained on different data regimes (e.g., OTM-Short, ATM-Medium, ITM-Long), running all 9 for every single price calculation is a waste of VRAM and compute power.

k-Sparse Gating acts as a triage desk. It analyzes the input, specifically the moneyness and maturity features and “activates” only the k most relevant experts. By setting k=2, we ensure we get the specialized knowledge of the best two regimes without the “noise” of the other seven.

The Implementation Tutorial

Step A: The Sparse Gating Logic

We want the gate to output weights for the top experts while assigning a weight of zero to everyone else.

import torch
import torch.nn as nn
import torch.nn.functional as F

class OptionsMoELayer(nn.Module):
    def __init__(self, experts, input_dim=9, k=2):
        super().__init__()
        self.experts = nn.ModuleList(experts) # Your 9 pre-trained .pth models
        self.n_experts = len(experts)
        self.k = k

        # Gating network: The "Regime Classifier"
        self.gate = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Linear(64, self.n_experts)
        )
        
        # Setup CUDA Streams for parallel expert execution
        self.use_cuda = torch.cuda.is_available()
        if self.use_cuda:
            self.streams = [torch.cuda.Stream() for _ in range(self.n_experts)]

import torch
import torch.nn as nn
import torch.nn.functional as F

class OptionsMoELayer(nn.Module):
    def __init__(self, experts, input_dim=9, k=2):
        super().__init__()
        self.experts = nn.ModuleList(experts) # Your 9 pre-trained .pth models
        self.n_experts = len(experts)
        self.k = k

        # Gating network: The "Regime Classifier"
        self.gate = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Linear(64, self.n_experts)
        )
        
        # Setup CUDA Streams for parallel expert execution
        self.use_cuda = torch.cuda.is_available()
        if self.use_cuda:
            self.streams = [torch.cuda.Stream() for _ in range(self.n_experts)]

Step B: The Forward Pass with Sparsity

def forward(self, x, return_weights=False):
    # Gating Logic
    logits = self.gate(x) 
    top_k_logits, top_k_indices = torch.topk(logits, self.k, dim=1)

    # Sparsity Masking
    mask = torch.full_like(logits, float('-inf'))
    mask.scatter_(1, top_k_indices, top_k_logits)
    weights = F.softmax(mask, dim=1) 

    # Parallel Expert Execution
    active_indices = torch.unique(top_k_indices)
    expert_outputs = [None] * self.n_models

    for idx in active_indices:
        i = idx.item()
        if self.use_cuda:
            with torch.cuda.stream(self.streams[i]):
                expert_outputs[i] = self.experts[i](x)
        else:
            expert_outputs[i] = self.experts[i](x)

    # Sync and Combine
    if self.use_cuda:
        for idx in active_indices:
            torch.cuda.current_stream().wait_stream(self.streams[idx.item()])

    # Convert None results to zeros so we can stack them
    processed_outputs = []
    for out in expert_outputs:
        if out is not None:
            processed_outputs.append(out)
        else:
            # Create a dummy zero tensor on the same device as x
            processed_outputs.append(torch.zeros(x.shape[0], 1, device=x.device))

    # Stack to [batch, 9, 1] then squeeze to [batch, 9]
    stacked_outputs = torch.stack(processed_outputs, dim=1).squeeze(-1)

    # Final Weighted Sum
    final_prediction = (weights * stacked_outputs).sum(dim=1, keepdim=True)

    if return_weights:
        return final_prediction, weights
    return final_prediction

def forward(self, x, return_weights=False):
    # Gating Logic
    logits = self.gate(x) 
    top_k_logits, top_k_indices = torch.topk(logits, self.k, dim=1)

    # Sparsity Masking
    mask = torch.full_like(logits, float('-inf'))
    mask.scatter_(1, top_k_indices, top_k_logits)
    weights = F.softmax(mask, dim=1) 

    # Parallel Expert Execution
    active_indices = torch.unique(top_k_indices)
    expert_outputs = [None] * self.n_models

    for idx in active_indices:
        i = idx.item()
        if self.use_cuda:
            with torch.cuda.stream(self.streams[i]):
                expert_outputs[i] = self.experts[i](x)
        else:
            expert_outputs[i] = self.experts[i](x)

    # Sync and Combine
    if self.use_cuda:
        for idx in active_indices:
            torch.cuda.current_stream().wait_stream(self.streams[idx.item()])

    # Convert None results to zeros so we can stack them
    processed_outputs = []
    for out in expert_outputs:
        if out is not None:
            processed_outputs.append(out)
        else:
            # Create a dummy zero tensor on the same device as x
            processed_outputs.append(torch.zeros(x.shape[0], 1, device=x.device))

    # Stack to [batch, 9, 1] then squeeze to [batch, 9]
    stacked_outputs = torch.stack(processed_outputs, dim=1).squeeze(-1)

    # Final Weighted Sum
    final_prediction = (weights * stacked_outputs).sum(dim=1, keepdim=True)

    if return_weights:
        return final_prediction, weights
    return final_prediction

Why This Matters for Performance

By implementing this tutorial, you solve three major problems in high-speed financial modeling:

Latency: In your “benchmark_prediction_speed_cuda” tests, you’ll see a massive drop in latency because the GPU only processes k kernels instead of all 9.
Specialization: Experts no longer have to “compromise” their internal weights to accommodate every regime. They become hyper-accurate in their specific niche (e.g., Deep OTM volatility).
Signal vs. Noise: You avoid the “averaging effect” where a LEAPS expert’s prediction might negatively influence a 0DTE price.

Bottom Line

The true power of a k-Sparse Gated Neural Network for option pricing lies in its ability to respect the unique “physics” of different market regimes. In the world of derivatives, a 0DTE OTM call and a LEAPS ITM put operate under completely different Greeks and volatility profiles. By utilizing a mixture of 9 specialized experts and only activating the Top-2 relevant specialists, you achieve:

Superior Accuracy: Each expert focuses on its niche (e.g., Deep OTM volatility), preventing the “dilution” of signals that occurs in a single monolithic model.
Hardware Efficiency: Using CUDA Streams ensures that the chosen experts run in parallel, while k-sparsity ensures you don’t waste GPU cycles on irrelevant computations.
Operational Speed: Your benchmarks will show that this “Smart Committee” provides faster inference than a dense network, crucial for high-frequency trading environments.