Skip to content

Inside the Engine: How a Python Metaclass Automatically Registers Data Schemas

Python Data Engineering Data Contracts Open Source Metaclasses Schema Evolution
plugin-data-quality-lab

This is Part 2 of the data-contracts series. Part 1 introduced the problem and the project — read it here if you haven’t already. This post goes inside the engine room: how a class registers itself automatically using a metaclass, how fields are extracted from type hints, how incoming data is validated, and how the schema registry keeps everything together.

Where we left off

In Part 1 we saw that writing this:

class TradeSchema(ContractBase):
    symbol: str
    price: float
    volume: int

automatically registers the schema, attaches a validate() method, and makes it available to the diff engine and CLI — all without a single manual call.

This post answers the obvious follow-up question: how?

Step 1: understanding what Python does with a class definition

Before we look at any framework code, you need to understand what Python actually does when it reads a class statement.

Most people think of a class definition as a declaration — you describe something and Python stores it. But that’s not quite right. A class definition is executable code. Python runs it, top to bottom, and then calls something to assemble the result into a class object.

That “something” is called a metaclass. By default it’s Python’s built-in type — the factory that builds every class you’ve ever written. A metaclass is simply your own replacement for type, one where you get to intercept construction and do extra work.

Think of it like a city building department. Every new building has to go through the department before it opens — permits are filed, codes are checked, the address is registered in the city database. The building doesn’t choose to do this. It happens automatically because of the rules the department enforces on every new building. A metaclass is that department, and your class is the building.

Here is the exact sequence Python follows when it reads class TradeSchema(ContractBase)::

  1. Execute the class body. Annotations, methods, and assignments land in a fresh dict called namespace.
  2. Look up the metaclass. Because ContractBase declares metaclass=ContractMeta, every subclass inherits it automatically.
  3. Call ContractMeta.__new__(mcs, "TradeSchema", (ContractBase,), namespace). This is where your code runs.
  4. The result is bound to the name TradeSchema in the enclosing scope.

Step 3 happens before any instance is created — before you even reach the line after the class definition. That’s the key. By the time Python moves on, the schema is already registered.

Step 2: the metaclass code, line by line

Here is ContractMeta in full. We’ll walk through every part:

class ContractMeta(type):

    def __new__(mcs, name, bases, namespace):

        # 1. build the class normally first
        cls = super().__new__(mcs, name, bases, namespace)

        # 2. skip ContractBase itself
        if name == "ContractBase":
            return cls

        # 3. collect annotations from the full inheritance chain
        annotations: dict = {}
        for base in reversed(cls.__mro__):
            annotations.update(getattr(base, "__annotations__", {}))

        # 4. resolve string annotations into real types
        try:
            resolved = get_type_hints(cls)
        except Exception:
            resolved = annotations

        # 5. filter out private fields
        schema_fields = {
            k: v for k, v in resolved.items()
            if not k.startswith("_")
        }

        # 6. register in the global schema registry
        SchemaRegistry.register(SchemaVersion(
            name=name,
            version=getattr(cls, "__version__", "1.0.0"),
            fields=schema_fields,
        ))

        # 7. inject validate() using a closure
        def _validate(data: dict):
            return validate_data(schema_fields, data)

        cls.validate = staticmethod(_validate)
        return cls


class ContractBase(metaclass=ContractMeta):
    """Inherit from this to opt into the framework."""
    pass

Seven steps. Let’s go through each one.

Step 1 — build the class normally first

super().__new__(mcs, name, bases, namespace) calls Python’s built-in type.__new__ to do the standard class construction. We do this first and then modify the result. This gives us a real, working class to attach things to.

Step 2 — the guard clause

When Python reads the definition of ContractBase itself, it also runs through ContractMeta.__new__ — because ContractBase declares metaclass=ContractMeta. But ContractBase has no fields and shouldn’t be registered. Without this guard, you’d get an empty schema called ContractBase in your registry every time the module loads.

Step 3 — the MRO walk (an important detour)

This is the most interesting part, and it’s worth understanding properly.

annotations: dict = {}
for base in reversed(cls.__mro__):
    annotations.update(getattr(base, "__annotations__", {}))

__mro__ stands for Method Resolution Order. It’s the ordered list Python uses to look up attributes and methods through the inheritance chain. For TradeSchema, it looks like this:

TradeSchema.__mro__
# (TradeSchema, ContractBase, object)

Now here’s the problem this loop is solving. Imagine you have a base schema with shared fields, and a child that adds its own:

class TimestampedSchema(ContractBase):
    created_at: str   # a field every schema should have

class TradeSchema(TimestampedSchema):
    symbol: str
    price: float

If you just read TradeSchema.__annotations__, you only get {"symbol": str, "price": float}. The created_at field lives on TimestampedSchema.__annotations__ — it’s invisible from the child.

The MRO loop fixes this by walking every class in the inheritance chain and merging all their __annotations__ dicts together:

# reversed(__mro__) walks from the top (object) downward
# so child fields overwrite parent fields if names clash

# iteration 1: object         -> {} (no annotations)
# iteration 2: ContractBase   -> {} (no annotations)
# iteration 3: TimestampedSchema -> {"created_at": str}
# iteration 4: TradeSchema    -> {"symbol": str, "price": float}

# final result: {"created_at": str, "symbol": str, "price": float}

Walking in reverse MRO order (root first, child last) means that if a child redefines a field with a different type, the child’s version wins — because it’s the last one written to the dict.

Why get_type_hints() is still better

The MRO loop is good, but it has a blind spot: string annotations.

When a type isn’t defined yet at the point of annotation — a common pattern called a forward reference — Python lets you write it as a string:

from __future__ import annotations   # makes ALL annotations strings lazily

class TradeSchema(ContractBase):
    symbol: str
    price: float

With from __future__ import annotations at the top of the file (which is common in modern Python codebases), every annotation is stored as a string. So __annotations__ holds:

{"symbol": "str", "price": "float"}
# strings! not the actual type objects

That’s a problem. Our validation engine needs real type objects to call isinstance(value, float). Trying to call isinstance(182.5, "float") throws a TypeError.

get_type_hints() resolves this. It takes the class, finds all the string annotations, evaluates them in the right namespace, and returns a dict of actual type objects:

from typing import get_type_hints

# __annotations__ might give you strings:
TradeSchema.__annotations__
# {"symbol": "str", "price": "float"}  <- useless for isinstance()

# get_type_hints() gives you real types:
get_type_hints(TradeSchema)
# {"symbol": <class 'str'>, "price": <class 'float'>}  <- what we need

It also automatically walks the inheritance chain — so it does the MRO walk’s job and resolves string annotations. The reason we keep the manual MRO loop in the code is as a fallback: get_type_hints() can fail in rare edge cases (circular imports, unusual generics), so we catch the exception and fall back to the manually merged dict.

Think of it this way: the MRO loop is the safety net. get_type_hints() is the primary tool.

Step 4 — filtering private fields

Any field starting with _ is stripped. This means users can add framework metadata to their schema class — like __version__ = "2.0.0" — without it accidentally showing up as a required data field in validation.

Step 5 — registering the schema

SchemaVersion snapshot is created and stored in the SchemaRegistry. We’ll look at the registry in detail shortly, but the key point is: this happens at class definition time, not at runtime. By the time any pipeline tries to validate data against this schema, it’s already in the registry.

Step 6 — injecting validate() with a closure

This is the subtlest step. We define a function _validate inside __new__. That means it captures a reference to schema_fields from the enclosing scope — a Python feature called a closure.

Months later, when someone calls TradeSchema.validate(row), that function still knows exactly which fields TradeSchema declared. It’s holding onto them permanently, even though ContractMeta.__new__ finished running long ago.

# Each schema gets its OWN closure with its OWN fields

class TradeSchema(ContractBase):
    symbol: str
    price: float

class OrderSchema(ContractBase):
    order_id: str
    amount: float

# These call different closures remembering different fields
TradeSchema.validate({"symbol": "AAPL", "price": 182.5})
OrderSchema.validate({"order_id": "X1", "amount": 500.0})

We wrap _validate in staticmethod() so Python doesn’t pass self as the first argument. The call is TradeSchema.validate(data) — no instance needed.

Step 3: the validation engine

The validate_data function is deliberately kept separate from the metaclass. It’s a pure function: takes a schema definition and a data dict, returns a result. No side effects. This makes it independently testable and easy to swap out later.

def validate_data(schema_fields: dict, data: dict) -> ValidationResult:

    errors   = []
    warnings = []

    schema_keys = set(schema_fields.keys())
    data_keys   = set(data.keys())

    # missing required fields -> ERROR
    for f in schema_keys - data_keys:
        errors.append(ValidationError(
            field_name=f,
            message=f"Missing required field: '{f}'"
        ))

    # extra fields not in schema -> WARNING (additive changes are safe)
    for f in data_keys - schema_keys:
        warnings.append(ValidationError(
            field_name=f,
            severity=Severity.WARNING,
            message=f"Unknown field '{f}' — may be a new schema addition"
        ))

    # wrong type -> ERROR
    for f in schema_keys & data_keys:
        expected = schema_fields[f]
        value    = data[f]
        if expected is float and isinstance(value, int):
            continue   # allow int -> float coercion
        if not isinstance(value, expected):
            errors.append(ValidationError(
                field_name=f,
                expected=expected.__name__,
                received=value,
            ))

    return ValidationResult(
        is_valid=len(errors) == 0,
        errors=errors,
        warnings=warnings,
    )

Three things worth noting about these design decisions:

Set arithmetic for field comparison. schema_keys - data_keys gives you fields that should be there but aren’t (missing). data_keys - schema_keys gives you fields that are there but shouldn’t be (extras). schema_keys & data_keys gives the overlap — the fields to type-check. This is cleaner and faster than nested loops.

Extras are warnings, not errors. If the incoming data has a field your schema doesn’t know about, it produces a warning — not a validation failure. This is deliberate. In distributed systems, upstream producers often add new fields before consumers are updated to use them. Treating that as an error would cause false alarms constantly. Only removals, renames, and type changes actually break consumers.

int fills a float field. If your schema declares price: float and the data sends 182 (an integer), that passes. The integer is numerically valid where a float is expected. Without this coercion, you’d get spurious failures from APIs that return whole numbers without a decimal point.

Let’s see it in action:

# Good data
result = TradeSchema.validate({
    "symbol": "AAPL", "price": 182.5, "volume": 1000
})
print(result.summary())
# ✅ Validation passed

# Wrong type + missing field
result = TradeSchema.validate({
    "symbol": "AAPL", "price": "one-eighty"
})
print(result.summary())
# ❌ Validation failed (2 errors):
#    • Missing required field: 'volume'
#    • Field 'price': expected float, got str = 'one-eighty'

# Extra field — warning, not error
result = TradeSchema.validate({
    "symbol": "AAPL", "price": 182.5,
    "volume": 1000, "new_field": "surprise"
})
print(result.is_valid)    # True — still valid
print(result.warnings)   # [Warning: unknown field 'new_field']

Step 4: the schema registry

The SchemaRegistry is a central store — a dict keyed by schema name, holding every version of every schema the framework has seen.

class SchemaRegistry:
    _store: dict[str, list[SchemaVersion]] = {}

    @classmethod
    def register(cls, version: SchemaVersion) -> None:
        if version.name not in cls._store:
            cls._store[version.name] = []
        cls._store[version.name].append(version)

    @classmethod
    def get_latest(cls, name: str) -> SchemaVersion | None:
        versions = cls._store.get(name, [])
        return versions[-1] if versions else None

    @classmethod
    def get_history(cls, name: str) -> list[SchemaVersion]:
        return cls._store.get(name, [])

    @classmethod
    def list_schemas(cls) -> list[str]:
        return list(cls._store.keys())

    @classmethod
    def clear(cls) -> None:
        cls._store.clear()   # used in tests to reset between runs

A few design decisions worth understanding here.

Why a class-level dict instead of a module-level variable? Both are global state. But a class-level dict gives you a clean interface — SchemaRegistry.register()SchemaRegistry.get_latest() — instead of raw dict access scattered through the codebase. It’s also easier to mock in tests and easier to swap for a database-backed implementation later.

Why store a list per schema name? Because you want version history, not just the latest. When the diff engine compares V1 to V2, it needs access to both. Storing a list means you can call get_history("TradeSchema") and get every version that’s ever been registered — useful for auditing which version introduced a breaking change.

Why is there a clear() method? Because the registry is global state, and global state bleeds between tests. If TradeSchema gets registered in one test, it’s still there when the next test runs. clear() exists specifically to be called in a pytest fixture that resets the registry before and after every test. We’ll cover that in detail in Part 4.

Here’s the registry in action:

class TradeSchemaV1(ContractBase):
    __version__ = "1.0.0"
    symbol: str
    price: float

class TradeSchemaV2(ContractBase):
    __version__ = "2.0.0"
    symbol: str
    close_price: float   # renamed

# Both are now in the registry — automatically
SchemaRegistry.list_schemas()
# ["TradeSchemaV1", "TradeSchemaV2"]

SchemaRegistry.get_latest("TradeSchemaV1").fields
# {"symbol": str, "price": float}

SchemaRegistry.get_latest("TradeSchemaV2").fields
# {"symbol": str, "close_price": float}

This is the foundation the rest of the framework stands on. The diff engine in Part 3 will pull these two versions out of the registry and compare them. The CLI will look up schemas by name from here. The validation endpoint will fetch the latest version and run incoming rows against it.

Nobody passes schema objects around. Everyone asks the registry.

How it all fits together so far

Let’s trace through exactly what happens from the moment you write a class definition to the moment you validate a row of data:

  1. You write class TradeSchema(ContractBase): ...
  2. Python calls ContractMeta.__new__ immediately, before the next line of your code runs.
  3. The metaclass walks the MRO to collect annotations, calls get_type_hints() to resolve any strings into real types, strips private fields, and registers a SchemaVersion snapshot.
  4. A closure is created over schema_fields and attached to the class as validate().
  5. Later, when data arrivesTradeSchema.validate(row) calls the closure, which passes schema_fields and row to validate_data().
  6. validate_data() uses set arithmetic to find missing fields, extra fields, and type mismatches, then returns a ValidationResult.

The metaclass runs once at definition time. Validation runs at runtime. The registry connects them.

What’s next

You now understand the core engine. In Part 3 we build on top of it — the diff engine that compares two schema versions, the rename heuristic that detects price → close_price as a rename rather than a removal, and the notification bus that alerts downstream consumers when a breaking change lands.

git clone https://github.com/devminda/data-contracts
cd data-contracts
pip install -e ".[dev]"
python examples/trade_pipeline.py

Questions about the MRO walk or the closure behaviour? Drop a comment below. The most common follow-up I get is “why not just use Pydantic?” — short answer: Pydantic validates instances, not schema evolution between versions. That’s the topic for Part 3.

2 thoughts on “Inside the Engine: How a Python Metaclass Automatically Registers Data Schemas”

  1. Pingback: Breaking vs Safe: How the data-contracts Diff Engine Detects Schema Changes and Notifies Consumers

  2. Pingback: How to Test a Python Framework: Global State, Metaclasses, Parametrize, and Edge Cases

Leave a Reply

Your email address will not be published. Required fields are marked *