Introduction
Most beginner Python projects start simple. You write one script. Then you add another function. Then another. Before long, your code becomes a large file full of hardcoded logic.
This is fine at the start, but it becomes a problem when you want your system to grow. For example, imagine we are building a data quality tool. We want to check a dataset for:
- missing values
- duplicate rows
- negative values
- outliers
- invalid dates
- custom business rules
The beginner approach would be to write everything in one script:
check_missing_values(df)
check_duplicates(df)
check_negative_values(df)At first, this works perfectly well. The problem only starts once the project begins to grow. New rules are added, different datasets need different checks, and suddenly the codebase becomes a collection of disconnected functions scattered across multiple files.
The deeper issue is not validation itself. The real issue is architecture.
How do we design systems that can grow without rewriting the core engine every time we need new functionality?
This question led me to explore concepts such as:
- abstract base classes
- runtime introspection
- AST parsing
- dynamic code loading
- plugin architectures
To experiment with these ideas, I built a small project called:
plugin-data-quality-labThe goal of the project was not to create the most advanced validation framework. Instead, the focus was understanding how extensible systems are designed internally.
Understanding the Core Problem
Imagine an airport security system. Every passenger goes through the same pipeline:
Passenger → Security Checks → ResultBut internally, the checks themselves are completely different:
- passport verification
- baggage scanning
- visa checks
- metal detection
Each check has different logic, but from the airport’s perspective every check behaves similarly:
- Receive input
- Validate something
- Return a result
This is exactly the idea behind plugin architectures.
Instead of hardcoding every validation rule into the engine itself, we can think of each rule as a plugin that follows a common interface.
The engine does not need to know how the plugin works internally. It only needs to know:
“Every plugin can validate data.”
This is where abstract base classes become useful.
Abstract Base Classes and Why They Matter
In Python, an abstract base class lets us define a contract that every plugin must follow. In my project, every validation rule extends a common Rule class:
from abc import ABC, abstractmethod
class Rule(ABC):
@abstractmethod
def validate(self, data):
passThe important part here is the @abstractmethod.
This forces every child class to implement a validate() method. If a developer forgets to implement it, Python raises an error immediately. Without this structure, every rule could behave differently:
- one rule might return a dataframe
- another might return a boolean
- another might print directly to console
That becomes chaos very quickly. By enforcing a common interface, the engine can treat every rule identically. For example:
issues = rule.validate(df)The engine does not care whether the rule checks missing values, duplicates, or outliers. It simply knows:
“Every rule has a
validate()method.”
This is one of the biggest ideas behind extensible software design:
build around contracts, not implementations.
Building Rules as Plugins
Once the base class exists, individual rules become very simple. For example:
class MissingValueRule(Rule):
def validate(self, data):
failed_rows = data[data.isnull().any(axis=1)].copy()
failed_rows["rule_name"] = "MissingValueRule"
failed_rows["issue"] = "Missing value detected"
return failed_rowsThis rule only focuses on one responsibility:
detecting missing values.
The engine itself does not contain missing value logic anymore. That logic lives inside the plugin.
This separation becomes extremely powerful because new rules can now be added independently without changing the engine itself.
The Rule Runner
Once every rule follows the same interface, we can build a generic execution engine.
class RuleRunner:
def __init__(self, rules):
self.rules = rules
def run(self, data):
all_issues = []
for rule in self.rules:
issues = rule.validate(data)
if not issues.empty:
all_issues.append(issues)
return pd.concat(all_issues)Notice how the RuleRunner has no idea what each rule actually does.
It does not know:
- how missing values are detected
- how duplicates are detected
- how negative values are detected
It only knows:
“Every plugin follows the Rule contract.”
This is the heart of plugin architecture.
Runtime Introspection with inspect
At this point, the system works, but another interesting problem appears. Suppose a frontend or API wants to dynamically display the parameters required by each rule.
For example:
class MissingValueRule(Rule):
def __init__(self, columns=None):
...How does the frontend know this rule requires columns? One option would be to manually document every parameter for every rule, but that becomes difficult to maintain. This is where Python’s inspect module becomes incredibly useful.
The inspect module allows Python to examine itself at runtime.
A good mental model is:
inspectis Python looking into a mirror.
For example:
import inspect
sig = inspect.signature(rule_class.__init__)This lets us dynamically discover:
- parameter names
- default values
- type hints
without hardcoding anything.
In my project, this allows the engine to automatically generate parameter schemas for plugins. For example, the system can inspect:
class MissingValueRule:
def __init__(self, columns=None):
...which has the parameter called columns. We use the following function to detect the parameters in MissingValueRule class. We run this:
import inspect
@staticmethod
def introspect_params(rule_class: type[Rule]) -> list[dict]:
sig = inspect.signature(rule_class.__init__)
schema = []
for name, param in sig.parameters.items():
if name == "self":
continue
default = (
param.default
if param.default is not inspect.Parameter.empty
else None
)
annotation = param.annotation
if annotation is int:
type_str = "int"
elif annotation is float:
type_str = "float"
elif annotation is str:
type_str = "str"
elif annotation is bool:
type_str = "bool"
elif "list" in str(annotation):
type_str = "list"
else:
type_str = "str"
schema.append({
"name": name,
"type": type_str,
"default": default,
})
return schemaand dynamically produce:
[
{
"name": "columns",
"type": "list",
"default": None
}
]This is a very powerful idea because it means the engine can adapt to new plugins automatically. The plugin describes itself.
A frontend could automatically generate input fields based on the rule’s constructor.
One thing worth understanding here is inspect.Parameter.empty. This is not None — it is a special sentinel object Python uses specifically to mean “this parameter has no default value defined.” The check param.default is not inspect.Parameter.empty is the correct way to test whether a default exists. If you checked param.default is not None instead, you would get the wrong answer for parameters whose default actually is None.
Understanding AST Parsing
The next problem is even more interesting. Suppose users want to submit their own custom validation rules dynamically.
For example:
class HighSalaryRule(Rule):
def __init__(self, threshold=100000):
super().__init__(name="HighSalaryRule")
self.threshold = threshold
def validate(self, data):
self.validate_input(data)
df = data.copy()
failed_rows = df[df["salary"] > self.threshold].copy()
failed_rows["rule_name"] = self.name
failed_rows["issue"] = "Salary above threshold"
return failed_rowsBefore executing this code, we need to validate it.
For an example we need to answer some of these questions:
- Is this valid Python?
- Does it extend
Rule? - Does it implement
validate()? - Is it trying to import dangerous modules?
This is where AST becomes useful.
AST stands for: Abstract Syntax Tree
When Python reads code, it first converts the code into a tree-like structure internally. Instead of immediately executing the code, Python first understands its structure. A useful mental model is:
AST is an X-ray of your code.
You are examining the structure of the code before running it. For example:
tree = ast.parse(code)This converts raw Python code into a navigable tree structure. Then we can walk through that tree:
ast.walk(tree)and inspect:
- class definitions
- imports
- function definitions
- inheritance relationships
This allows us to validate plugins safely before execution.
Using AST for Validation
In the project, AST is used to enforce plugin rules. For example, the engine checks:
- whether the submitted class extends
Rule - whether it contains a
validate()method - whether it adds required columns like
rule_name - whether it attempts unsafe imports
These are the steps we follow.
Step 1 — ast.parse(code) converts the string into a tree. If the syntax is broken, it raises a SyntaxError here, before anything runs.
tree = ast.parse(code)Step 2 — ast.walk(tree) visits every node in the tree. We filter for ast.ClassDef nodes whose bases contain a Name with id == "Rule". This is how you check inheritance without executing the code.
ast.walk(tree)Step 3 — We narrow the walk to just the class node we found, and look for a FunctionDef named validate. Method exists? If it exists, the Contract is met.
Step 4 — String checks for "rule_name" and "issue". These confirm the output contract — the method must produce the columns the engine expects.
All these steps ensures that the submitted code follows our plugin contract.
@staticmethod
def validate_rule_code(code: str) -> tuple[bool, str | None]:
try:
tree = ast.parse(code)
except SyntaxError as e:
return False, f"Syntax error: {e}"
rule_classes = [
node for node in ast.walk(tree)
if isinstance(node, ast.ClassDef)
and any(
(isinstance(base, ast.Name) and base.id == "Rule") or
(isinstance(base, ast.Attribute) and base.attr == "Rule")
for base in node.bases
)
]
if not rule_classes:
return False, "Class must extend Rule."
class_node = rule_classes[0]
method_names = [
node.name for node in ast.walk(class_node)
if isinstance(node, ast.FunctionDef)
]
if "validate" not in method_names:
return False, "Missing required method: validate(self, data)"
if '"rule_name"' not in code and "'rule_name'" not in code:
return False, 'validate() must add a "rule_name" column.'
if '"issue"' not in code and "'issue'" not in code:
return False, 'validate() must add an "issue" column.'
if "return" not in code:
return False, "validate() must return a DataFrame."
return True, NoneThis creates a validation pipeline before execution even happens. Without AST, we would be blindly executing user-submitted code. With AST, we can inspect the structure first.
This idea appears everywhere in modern software systems such as:
- compilers
- linters
- static analysis tools
- code formatters
- IDEs
Dynamic Loading with exec
Once the AST validation passes, we know the code is structurally correct. Now we need to actually run it and get a usable Python class out of it.
The challenge is that the code exists as a plain string, not a file on disk, not an imported module. Python’s normal import system can’t help here. We need a different approach.
This is where exec and types.ModuleType come in.
First, what is a namespace?
A namespace is simply a dictionary that maps names to objects. When Python runs code, every variable, class, and function you define gets stored in a namespace. For example, when you write:
x = 10
class MyRule(Rule):
...Python stores x and MyRule as keys in a dictionary behind the scenes. That dictionary is the namespace. When code runs, Python looks up names in that dictionary. If a name isn’t there, you get a NameError.
This is the key insight: if you control the dictionary, you control what names the code can see.
What types.ModuleType does
Normally, a Python module is a file. When you do import pandas, Python finds pandas.py (or a package folder), runs it, and stores the result as a module object.
types.ModuleType lets you create that same module object without a file. It is a blank Python module that exists only in memory — with its own namespace, isolated from everything else in your application.
module = types.ModuleType("user_rule")This creates an empty module named "user_rule". Its namespace — module.__dict__ — is completely empty at this point.
Seeding the namespace
The user’s code contains class HighSalaryRule(Rule). For that to work, the name Rule must exist in the namespace when the code runs. Same for pd (pandas) and np (numpy).
So before running anything, we place exactly those three objects into the module’s namespace:
module.__dict__["Rule"] = Rule
module.__dict__["pd"] = pd
module.__dict__["np"] = npNow the namespace contains three keys. The user’s code can reference Rule, pd, and np freely — not because they imported them, but because we put them there. Nothing else from our application is visible. The user cannot accidentally (or deliberately) access anything we haven’t explicitly provided.
Running the code
exec(textwrap.dedent(code), module.__dict__)exec runs the code string as Python, using module.__dict__ as the namespace. After this line executes, the user’s class definition has run and the class now exists as a key inside that dictionary.
textwrap.dedent strips any leading indentation from the code string first — without it, code that was pasted from an indented context would raise an IndentationError.
Pulling the class out
rule_class = module.__dict__.get(class_name)We retrieve the class by name from the same dictionary. Then we verify it actually extends Rule before returning it:
if not issubclass(rule_class, Rule):
raise ValueError(f"{class_name} does not extend Rule.")The full function in context:
import types
import textwrap
import pandas as pd
import numpy as np
@staticmethod
def load_rule_class_from_code(code: str, class_name: str) -> type[Rule]:
valid, error = RuleService.validate_rule_code(code)
if not valid:
raise ValueError(error)
module = types.ModuleType("user_rule")
module.__dict__["Rule"] = Rule
module.__dict__["pd"] = pd
module.__dict__["np"] = np
exec(textwrap.dedent(code), module.__dict__)
rule_class = module.__dict__.get(class_name)
if rule_class is None:
raise ValueError(f"Class '{class_name}' not found in submitted code.")
if not issubclass(rule_class, Rule):
raise ValueError(f"{class_name} does not extend Rule.")
return rule_classWhy this pattern matters
The alternative would be writing the user’s code to a .py file and importing it. That works, but it requires filesystem access, leaves files behind, and is much harder to control. The types.ModuleType + exec approach keeps everything in memory, gives you precise control over what the code can access, and leaves no trace when the module is garbage collected.
The three steps always follow the same order:
- Validate with AST — check structure before running anything
- Create a sealed namespace — control exactly what the code can see
- Execute and extract — run the code, pull out what you need
By the time exec runs, the code has already passed four structural checks. The namespace contains only what it needs. The class comes out the other side ready to use.
A useful analogy is:
creating a temporary room for plugin code.
Mental model:
types.ModuleTypecreates a temporary plugin room.execruns the plugin code inside that room.
Then we pull out the class we need.
Correct order to think about these
The Bigger Lesson
Although this project focuses on data validation, the architecture itself is much more general. The exact same ideas appear in:
- workflow engines
- fraud detection systems
- grading systems
- ETL pipelines
- monitoring frameworks
- strategy engines
- ML validation systems
The domain changes, but the architecture remains similar. The deeper lesson is this:
build the engine once, then let plugins extend it.
That shift in mindset changes how you think about software design entirely.