
Minimal-Memory RAG with OpenAI-Powered Memory (Production Pattern)
- Mark Kendall
- Feb 10
- 4 min read
Minimal-Memory RAG with OpenAI-Powered Memory (Production Pattern)
Most teams make the same mistake with RAG:
They treat memory as “more context”.
That’s backwards.
In production systems, memory is compression, not accumulation. This article shows a battle-tested pattern where OpenAI is used only to compress memory, while the system stays deterministic, cheap, and Kubernetes-safe.
⸻
The Core Idea
OpenAI does not store memory.
OpenAI compresses memory.
All real memory lives outside the LLM.
What this design guarantees
• No growing context windows
• No replaying chat history
• Hard token caps
• Stateless pods
• Horizontal scalability
• Auditable behavior
⸻
Architecture Overview
User Request
↓
Load Conversation Summary (external store)
↓
OpenAI → Update Summary (≤ 3 sentences)
↓
Persist Summary
↓
Fusion Retrieval (vector + keyword)
↓
Top 2–4 chunks only
↓
Answer
This is Fusion RAG + Lite Conversational RAG + Light CRAG, optimized for limited memory.
⸻
Folder Structure
rag-service/
├── app/
│ ├── main.py
│ ├── api.py
│ ├── settings.py
│ ├── models.py
│ ├── rag/
│ │ ├── pipeline.py
│ │ ├── retrieval.py
│ │ ├── rerank.py
│ │ └── crag.py
│ ├── memory/
│ │ ├── store.py
│ │ └── service.py
│ └── llm/
│ └── openai_client.py
├── requirements.txt
└── Dockerfile
⸻
OpenAI Memory Compression (The Important Part)
app/llm/openai_client.py
from openai import OpenAI
client = OpenAI()
def compress_memory(previous_summary: str, user_input: str) -> str:
prompt = f"""
You are a memory compression system.
Existing summary:
{previous_summary or "None"}
New user input:
{user_input}
Update the summary so it captures:
- The user's goal
- Key systems or entities mentioned
- Constraints or preferences
Rules:
- Max 3 sentences
- No speculation
- No filler
"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.0,
max_tokens=120
)
return response.choices[0].message.content.strip()
Why this model
• Cheap
• Deterministic
• Excellent summarizer
• Perfect for memory compression
⸻
External Memory Store (Redis example)
app/memory/store.py
import redis
redis_client = redis.Redis(
host="redis",
port=6379,
decode_responses=True
)
def load_summary(session_id: str) -> str:
return redis_client.get(session_id) or ""
def save_summary(session_id: str, summary: str):
redis_client.set(session_id, summary)
Memory is:
• Small (≤ 512 chars)
• Persistent
• Independent of pods
⸻
Memory Update Service
app/memory/service.py
from app.memory.store import load_summary, save_summary
from app.llm.openai_client import compress_memory
def update_memory(session_id: str, user_input: str) -> str:
previous = load_summary(session_id)
updated = compress_memory(previous, user_input)
save_summary(session_id, updated)
return updated
This runs once per request, not in a loop.
⸻
Fusion Retrieval (Minimal Example)
app/rag/retrieval.py
def vector_search(query: str) -> list[dict]:
return [{"text": "API Gateway timeout is 29 seconds.", "score": 0.82}]
def keyword_search(query: str) -> list[dict]:
return [{"text": "AWS API Gateway default timeout is 29 seconds.", "score": 0.76}]
def fusion_retrieve(query: str) -> list[dict]:
return vector_search(query) + keyword_search(query)
⸻
Re-ranking and CRAG Guardrail
def rerank(chunks: list[dict]) -> list[dict]:
return sorted(chunks, key=lambda c: c["score"], reverse=True)
CONFIDENCE_THRESHOLD = 0.35
def is_confident(chunks: list[dict]) -> bool:
if not chunks:
return False
avg = sum(c["score"] for c in chunks) / len(chunks)
return avg >= CONFIDENCE_THRESHOLD
⸻
RAG Pipeline with OpenAI Memory
app/rag/pipeline.py
from app.memory.service import update_memory
from app.rag.retrieval import fusion_retrieve
from app.rag.rerank import rerank
from app.rag.crag import is_confident
MAX_CHUNKS = 4
def run_rag(query: str, session_id: str | None):
memory = ""
if session_id:
memory = update_memory(session_id, query)
chunks = rerank(fusion_retrieve(query))[:MAX_CHUNKS]
if not is_confident(chunks):
return {
"answer": "I don’t have enough information. Can you clarify?",
"confidence": "low",
"memory": memory
}
context = "\n".join(c["text"] for c in chunks)
return {
"answer": f"{context}",
"confidence": "high",
"memory": memory
}
⸻
API Layer
from pydantic import BaseModel
class RAGRequest(BaseModel):
query: str
session_id: str | None = None
class RAGResponse(BaseModel):
answer: str
confidence: str
memory: str
from fastapi import APIRouter
from app.models import RAGRequest, RAGResponse
from app.rag.pipeline import run_rag
router = APIRouter()
@router.post("/rag", response_model=RAGResponse)
def rag(req: RAGRequest):
return run_rag(req.query, req.session_id)
⸻
Why This Pattern Works in the Real World
• Memory never grows
• Tokens are predictable
• No hidden agent loops
• No context window roulette
• Easy to observe and debug
• Safe for regulated environments
This is how you build trustworthy RAG, not demo-ware.
⸻
The Principle to Remember
LLMs are terrible memory stores.
They are excellent memory compressors.
Once you internalize that, everything else gets easier.
⸻
If my reader want next, I can:
• Add DynamoDB instead of Redis
• Add TTL + decay rules
• Turn this into a shared platform service
• Zip this as a starter repo
• Add Bedrock instead of OpenAI
This is production-grade RAG, not hype.
Comments