Simpson’s Paradox and LLM Token Efficiency

Mark Kendall
11 minutes ago
3 min read

Simpson’s Paradox and LLM Token Efficiency

Why Aggregated Context Hurts Accuracy — and How to Fix It in Microservices

Most engineers have heard of Simpson’s Paradox.

Fewer engineers realize it’s happening every day inside their LLM calls.

And even fewer realize it’s quietly costing them money.

This article isn’t about hype.

It’s about architecture discipline.

What Is Simpson’s Paradox?

Simpson’s Paradox is a statistical phenomenon where:

A trend appears within separate groups of data — but reverses or disappears when those groups are combined.

In other words:

Aggregation changes meaning.

When you merge different contexts, signals distort.

The “whole” can tell a different story than the “parts.”

That lesson applies directly to LLM usage.

The LLM Version of Simpson’s Paradox

When calling a large language model, we often:

Send entire chat histories
Include logs, failed attempts, brainstorming
Mix planning, building, and reviewing into one thread
Let context grow indefinitely

Over time, the model starts reasoning over:

Summarized history
Mixed intents
Conflicting constraints
Irrelevant tokens

And output quality drops.

Not because the model is bad.

Because we aggregated semantic contexts that shouldn’t have been merged.

Just like Simpson’s Paradox.

Why This Matters Technically

LLMs charge by token in API scenarios:

Input tokens
Output tokens

If you send 15,000 tokens of prior context with every call, you pay for that every time.

But cost is only half the issue.

The deeper issue is semantic dilution:

The model cannot distinguish critical constraints from conversational noise.
It may overweight irrelevant information.
It may generalize across contexts that should be separated.

This leads to subtle quality degradation — often mistaken for “the model getting worse.”

In reality, it’s context collapse.

The Solution: Scoped LLM Calls

Instead of treating an LLM like a memory container, treat it like a compute node.

Break workflows into structured phases:

Planning
Implementation
Review
Testing

Each phase receives only the necessary information.

Nothing more.

This:

Reduces token usage
Improves determinism
Prevents semantic mixing
Lowers cost at scale

It is separation of concerns applied to AI.

A Practical Python Microservice Example

Below is a simplified FastAPI microservice that demonstrates scoped LLM calls.

Each stage sends only what it needs — no accumulated history.

from fastapi import FastAPI

from openai import OpenAI

app = FastAPI()

# Initialize client (replace with your provider if needed)

client = OpenAI(api_key="YOUR_API_KEY")

PROMPTS = {

"plan": """You are a planning agent.

Goal:

{goal}

Provide a structured plan with clear steps.

""",

"implement": """You are an implementation agent.

Here is the approved plan:

{plan}

Produce the implementation based strictly on this plan.

""",

"review": """You are a reviewer.

Here is the produced implementation:

{implementation}

Evaluate correctness, risks, and improvements.

"""

}

def llm_call(prompt: str, model="gpt-4o-mini", max_tokens=800):

"""

Stateless LLM call.

Only the prompt for this specific phase is sent.

"""

response = client.chat.completions.create(

model=model,

messages=[

{"role": "system", "content": "Be precise and concise."},

{"role": "user", "content": prompt}

temperature=0.2,

max_tokens=max_tokens,

)

return response.choices[0].message.content

@app.post("/execute-task")

def execute_task(goal: str):

# Phase 1: Planning

plan_prompt = PROMPTS["plan"].format(goal=goal)

plan_output = llm_call(plan_prompt)

# Phase 2: Implementation

implement_prompt = PROMPTS["implement"].format(plan=plan_output)

implementation_output = llm_call(implement_prompt)

# Phase 3: Review

review_prompt = PROMPTS["review"].format(implementation=implementation_output)

review_output = llm_call(review_prompt)

return {

"goal": goal,

"plan": plan_output,

"implementation": implementation_output,

"review": review_output

}

Why This Saves Tokens

Each call:

Sends only a single phase’s input.
Does not resend earlier conversational clutter.
Avoids accumulating irrelevant tokens.

If your planning phase is 400 tokens, and your implementation phase is 1,000 tokens, you’re not re-sending the entire history with every call.

At scale, that becomes real cost savings.

More importantly, it improves clarity.

Where This Matters Most

This approach becomes critical when:

Running agents inside CI pipelines
Automating refactoring
Orchestrating multiple AI microservices
Operating at enterprise scale
Calling models thousands of times per day

It is less important for casual chat use.

But it is essential for production systems.

The Architectural Lesson

Simpson’s Paradox teaches:

Aggregation can distort truth.

In LLM systems:

Aggregated context can distort reasoning.

The fix is not a bigger context window.

The fix is disciplined context boundaries.

Final Thought

The next generation of AI architecture will not be defined by:

Larger prompts
Longer chat threads
“Smarter” magic sessions

It will be defined by:

Scoped reasoning
Separation of concerns
Token discipline
Deterministic orchestration

Just like good engineering always has been.

If nothing else, remember this:

Don’t let aggregated context fool your model.

Keep it scoped.

Simpson’s Paradox and LLM Token Efficiency

Recent Posts

Comments

Subscribe Form