
Simpson’s Paradox and LLM Token Efficiency
- Mark Kendall
- 11 minutes ago
- 3 min read
Simpson’s Paradox and LLM Token Efficiency
Why Aggregated Context Hurts Accuracy — and How to Fix It in Microservices
Most engineers have heard of Simpson’s Paradox.
Fewer engineers realize it’s happening every day inside their LLM calls.
And even fewer realize it’s quietly costing them money.
This article isn’t about hype.
It’s about architecture discipline.
What Is Simpson’s Paradox?
Simpson’s Paradox is a statistical phenomenon where:
A trend appears within separate groups of data — but reverses or disappears when those groups are combined.
In other words:
Aggregation changes meaning.
When you merge different contexts, signals distort.
The “whole” can tell a different story than the “parts.”
That lesson applies directly to LLM usage.
The LLM Version of Simpson’s Paradox
When calling a large language model, we often:
Send entire chat histories
Include logs, failed attempts, brainstorming
Mix planning, building, and reviewing into one thread
Let context grow indefinitely
Over time, the model starts reasoning over:
Summarized history
Mixed intents
Conflicting constraints
Irrelevant tokens
And output quality drops.
Not because the model is bad.
Because we aggregated semantic contexts that shouldn’t have been merged.
Just like Simpson’s Paradox.
Why This Matters Technically
LLMs charge by token in API scenarios:
Input tokens
Output tokens
If you send 15,000 tokens of prior context with every call, you pay for that every time.
But cost is only half the issue.
The deeper issue is semantic dilution:
The model cannot distinguish critical constraints from conversational noise.
It may overweight irrelevant information.
It may generalize across contexts that should be separated.
This leads to subtle quality degradation — often mistaken for “the model getting worse.”
In reality, it’s context collapse.
The Solution: Scoped LLM Calls
Instead of treating an LLM like a memory container, treat it like a compute node.
Break workflows into structured phases:
Planning
Implementation
Review
Testing
Each phase receives only the necessary information.
Nothing more.
This:
Reduces token usage
Improves determinism
Prevents semantic mixing
Lowers cost at scale
It is separation of concerns applied to AI.
A Practical Python Microservice Example
Below is a simplified FastAPI microservice that demonstrates scoped LLM calls.
Each stage sends only what it needs — no accumulated history.
from fastapi import FastAPI
from openai import OpenAI
app = FastAPI()
# Initialize client (replace with your provider if needed)
client = OpenAI(api_key="YOUR_API_KEY")
PROMPTS = {
"plan": """You are a planning agent.
Goal:
{goal}
Provide a structured plan with clear steps.
""",
"implement": """You are an implementation agent.
Here is the approved plan:
{plan}
Produce the implementation based strictly on this plan.
""",
"review": """You are a reviewer.
Here is the produced implementation:
{implementation}
Evaluate correctness, risks, and improvements.
"""
}
def llm_call(prompt: str, model="gpt-4o-mini", max_tokens=800):
"""
Stateless LLM call.
Only the prompt for this specific phase is sent.
"""
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "Be precise and concise."},
{"role": "user", "content": prompt}
],
temperature=0.2,
max_tokens=max_tokens,
)
return response.choices[0].message.content
@app.post("/execute-task")
def execute_task(goal: str):
# Phase 1: Planning
plan_prompt = PROMPTS["plan"].format(goal=goal)
plan_output = llm_call(plan_prompt)
# Phase 2: Implementation
implement_prompt = PROMPTS["implement"].format(plan=plan_output)
implementation_output = llm_call(implement_prompt)
# Phase 3: Review
review_prompt = PROMPTS["review"].format(implementation=implementation_output)
review_output = llm_call(review_prompt)
return {
"goal": goal,
"plan": plan_output,
"implementation": implementation_output,
"review": review_output
}
Why This Saves Tokens
Each call:
Sends only a single phase’s input.
Does not resend earlier conversational clutter.
Avoids accumulating irrelevant tokens.
If your planning phase is 400 tokens, and your implementation phase is 1,000 tokens, you’re not re-sending the entire history with every call.
At scale, that becomes real cost savings.
More importantly, it improves clarity.
Where This Matters Most
This approach becomes critical when:
Running agents inside CI pipelines
Automating refactoring
Orchestrating multiple AI microservices
Operating at enterprise scale
Calling models thousands of times per day
It is less important for casual chat use.
But it is essential for production systems.
The Architectural Lesson
Simpson’s Paradox teaches:
Aggregation can distort truth.
In LLM systems:
Aggregated context can distort reasoning.
The fix is not a bigger context window.
The fix is disciplined context boundaries.
Final Thought
The next generation of AI architecture will not be defined by:
Larger prompts
Longer chat threads
“Smarter” magic sessions
It will be defined by:
Scoped reasoning
Separation of concerns
Token discipline
Deterministic orchestration
Just like good engineering always has been.
If nothing else, remember this:
Don’t let aggregated context fool your model.
Keep it scoped.

Comments