Skip to content

Outlier GPT

Python
100%
Python
Pypi
OpenAI
NumPy
Pandas
Scikit-Learn
PyTorch
Git

Why read this?

This project addresses a major gap in modern analytics: detecting anomalies is easy, but explaining them is slow, manual, and often inconclusive. Outlier‑GPT automates that missing step by using an agentic LLM to merge internal data context with external real‑world signals, producing structured, defensible explanations for any anomaly. The result is a system that dramatically reduces investigation time and turns raw outliers into actionable intelligence.

Introduction

In the field of data analysis, while Large Language Models (LLMs) have become ubiquitous for predictive modeling, a major bottleneck remains: explaining why an anomaly occurred. Traditional anomaly detection flags the what, but leaves analysts to manually research the why—a time-consuming and often ambiguous task.

This project tackles that underexplored use case by leveraging the Agentic Framework and LLMs’ powerful reasoning and web-browsing capabilities.

We introduce ‘outlier-GPT’, a Python package designed to transition anomaly investigation from a manual research chore to an automated, systematic process. This tool creates an agent that performs three key steps:

  1. Contextualization: Gathers rich data context (nearby points, related features) surrounding the anomaly.
  2. External Validation: Scans the internet (web-browsing) for real-world events or external data that correlate with the anomaly’s index/timestamp.
  3. Reasoning: Uses the LLM to synthesize the internal data context with external findings to generate structured, plausible hypotheses (e.g., ‘Data Error,’ ‘Market Event,’ or ‘Structural Pattern’).

The result is an automated anomaly explanation system that provides actionable insights, drastically reducing the time required for data validation and incident response.

Anomaly Explanation Proof of Concept (The Stock Data Example)

  • Input Data: AAPL stock data from yfinance.
  • Execution: Outlier-GPT has its multiple ways of detecting anomalies. Once the outlier is detected, the index can be fed to the agent. The agent will then execute the required tolls to identify an explanation.

Download data

import pandas as pd
# Install yfinance: pip install yfinance
import yfinance as yf
from outlier_gpt.agents import OutlierAgent

# 1. Fetch Time Series Data (AAPL stock)
TICKER = 'AAPL'
data_column = 'Close'

# Fetch a few months of data, the index will be a DatetimeIndex
df = yf.download(TICKER, start='2024-01-01', end='2024-05-01')

Data preprocessing

df.columns = df.columns.droplevel(1)
df.reset_index(inplace=True)
df.rename(columns={"Date": "timestamp", "Close": "value"}, inplace=True)
df.set_index('timestamp', inplace=True)

Initializing the agent

# 2. Initialize the agent 
# Note: Use a model that supports web browsing/tools like gpt-4-turbo or gpt-4o for deep search
agent = OutlierAgent(api_key="your-openai-api-key", model="gpt-5")

Optional: You can set the global context to the agent

# Optional: Set a global context (accessible via the property setter)
agent.context = f"The primary focus of this analysis is extreme volatility in {TICKER} stock."

Detecting Outliers

# 3. Detect Outliers
# We look for outliers in the 'Close' price using the Z-score method.
outlier_indices = outlier_detection.rolling_window_method(
    stock_data, 'value', window_size=5, threshold=3)

# Visualizing
plt.figure(figsize=(10, 4))
plt.plot(df.index, df['value'], label='Value')
plt.scatter(outliers_stock, df.loc[outliers_stock]['value'], color='red', label='Outliers')
plt.title("Rolling Window Outlier Detection")
plt.xlabel("Date")
plt.ylabel("Value")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

print(f"Detected {len(outlier_indices)} outlier(s) in {TICKER} stock.")
print(f"Indices (Dates): {list(outlier_indices)}")

Explain Outliers

You have two options:

  • You can ask the agent to explain one outlier.
# 4. Explain Outliers

# --- Option A (Recommended): Explain a Single Outlier with Deep Search ---
# Analyze the first detected outlier in depth.
# The `deep_search=True` flag tells the LLM to use external tools 
# to find news matching the date/ticker, leading to a much stronger "Market Event" justification.

if outlier_indices:
    first_outlier_index = outlier_indices[0]
    
    explanation_single = agent.explain_outlier(
        df,
        data_column=data_column,
        outlier_index=first_outlier_index,
        context_columns=['Volume', 'High', 'Low'], # Additional columns for LLM context
        deep_search=True # Uses web search/tools for external context
    )
    
    print("\n--- Single Outlier Explanation (Deep Search) ---")
    print(f"Index {first_outlier_index}:\n{explanation_single}")
  • You can ask the agent to explain Multiple outliers.
# --- Option B: Explain Multiple Outliers (Batch Processing) ---
# Use explain_outliers for quick, generalized explanations of all detected anomalies.
# This typically does NOT use deep search for cost/speed reasons.

explanations_batch = agent.explain_outliers(
    df,
    data_column=data_column,
    outlier_indices=outlier_indices,
    context_columns=['Volume']
)

print("\n--- Batch Outlier Explanations (Standard) ---")
for idx, explanation in explanations_batch.items():
    print(f"Index {idx} ({TICKER} Price):\n{explanation}\n")

Results: For one outlier Case

Hypothesis: External Event
Classification: External Event
Justification: The examination of the full row data around the identified outlier value in the APPL stock on December 1st shows a significant increase in stock price compared to the previous trading day with the price at 189.45, higher than the surrounding days. This spike aligns with the higher price and could potentially be explained by a significant company announcement, product launch, major acquisition, or other strategic corporate events that might have occurred on or around December 1st. No clear evidence of a data quality issue is detectable since only a single value stands out without associated anomalies in related columns such as Volume or extreme values in High, Low, or Open prices. Considering APPL’s market sensitivity to news and its impact on stock volatility, an external event influencing investor behavior is the most plausible explanation.

Documentation

https://test.pypi.org/project/outlier-gpt/

Conclusion

Outlier-GPT is a transformative tool that bridges the gap between anomaly detection and actionable intelligence. It provides immediate clarity for critical data points, drastically reducing the time required for manual investigation.

The package delivers a complete, systematic explanation for any anomaly, structuring the findings into three key elements:

  • Hypothesis: The most plausible underlying reason for the anomaly.
  • Classification: A definitive categorization (e.g., External Event / Data Quality Issue).
  • Justification: Concrete reasoning, leveraging data patterns and external web search evidence to support the claim.

This approach ensures that users can quickly identify, classify, and respond to anomalies with high confidence, turning data confusion into strategic clarity.

Future Work

1. Enhanced Outlier Detection Automation

Currently, users must run a detection method (like Z-score) separately and then feed the indices to the agent. Future work will focus on integrating the detection step directly into the agent’s pipeline:

  • Continuous Monitoring: Developing a listener framework that allows the agent to continuously monitor a live data stream (e.g., streaming API or database) and automatically trigger the explanation pipeline the moment an anomaly is detected.
  • Hierarchical Clustering: Implementing logic to group multiple adjacent outliers (e.g., a volatility spike spanning three days) and generating a single, unified explanation rather than requiring multiple API calls, improving both efficiency and contextual coherence.
2. Reducing Query Costs and Latency

API costs, particularly from web-browsing and token usage, are a primary operational expense. We will introduce strategies to reduce these costs:

  • Intelligent Caching: Implementing a caching layer for external searches. If two anomalies occur on the same day for the same entity (e.g., APPL on December 1st), the agent will reuse the existing web search results rather than running redundant queries.
  • Prompt Minimization: Fine-tuning the prompt templates to be more concise and token-efficient. This involves selectively including only the most relevant surrounding data (e.g., only the High and Low values, not the full row) and optimizing the LLM’s system instruction for brevity without sacrificing reasoning quality.
  • Local Search Integration: Exploring the integration of local, curated knowledge bases (e.g., indexed financial news archives) to reduce reliance on expensive, slower, real-time web browsing for known historical events.

Leave a Reply

Your email address will not be published. Required fields are marked *