Minimal-Memory RAG with OpenAI-Powered Memory (Production Pattern)

Mark Kendall
Feb 10
4 min read

Minimal-Memory RAG with OpenAI-Powered Memory (Production Pattern)

Most teams make the same mistake with RAG:

They treat memory as “more context”.

That’s backwards.

In production systems, memory is compression, not accumulation. This article shows a battle-tested pattern where OpenAI is used only to compress memory, while the system stays deterministic, cheap, and Kubernetes-safe.

⸻

The Core Idea

OpenAI does not store memory.

OpenAI compresses memory.

All real memory lives outside the LLM.

What this design guarantees

• No growing context windows

• No replaying chat history

• Hard token caps

• Stateless pods

• Horizontal scalability

• Auditable behavior

⸻

Architecture Overview

User Request

↓

Load Conversation Summary (external store)

↓

OpenAI → Update Summary (≤ 3 sentences)

↓

Persist Summary

↓

Fusion Retrieval (vector + keyword)

↓

Top 2–4 chunks only

↓

Answer

This is Fusion RAG + Lite Conversational RAG + Light CRAG, optimized for limited memory.

⸻

Folder Structure

rag-service/

├── app/

│ ├── main.py

│ ├── api.py

│ ├── settings.py

│ ├── models.py

│ ├── rag/

│ │ ├── pipeline.py

│ │ ├── retrieval.py

│ │ ├── rerank.py

│ │ └── crag.py

│ ├── memory/

│ │ ├── store.py

│ │ └── service.py

│ └── llm/

│ └── openai_client.py

├── requirements.txt

└── Dockerfile

⸻

OpenAI Memory Compression (The Important Part)

app/llm/openai_client.py

from openai import OpenAI

client = OpenAI()

def compress_memory(previous_summary: str, user_input: str) -> str:

prompt = f"""

You are a memory compression system.

Existing summary:

{previous_summary or "None"}

New user input:

{user_input}

Update the summary so it captures:

- The user's goal

- Key systems or entities mentioned

- Constraints or preferences

Rules:

- Max 3 sentences

- No speculation

- No filler

"""

response = client.chat.completions.create(

model="gpt-4o-mini",

messages=[{"role": "user", "content": prompt}],

temperature=0.0,

max_tokens=120

)

return response.choices[0].message.content.strip()

Why this model

• Cheap

• Deterministic

• Excellent summarizer

• Perfect for memory compression

⸻

External Memory Store (Redis example)

app/memory/store.py

import redis

redis_client = redis.Redis(

host="redis",

port=6379,

decode_responses=True

)

def load_summary(session_id: str) -> str:

return redis_client.get(session_id) or ""

def save_summary(session_id: str, summary: str):

redis_client.set(session_id, summary)

Memory is:

• Small (≤ 512 chars)

• Persistent

• Independent of pods

⸻

Memory Update Service

app/memory/service.py

from app.memory.store import load_summary, save_summary

from app.llm.openai_client import compress_memory

def update_memory(session_id: str, user_input: str) -> str:

previous = load_summary(session_id)

updated = compress_memory(previous, user_input)

save_summary(session_id, updated)

return updated

This runs once per request, not in a loop.

⸻

Fusion Retrieval (Minimal Example)

app/rag/retrieval.py

def vector_search(query: str) -> list[dict]:

return [{"text": "API Gateway timeout is 29 seconds.", "score": 0.82}]

def keyword_search(query: str) -> list[dict]:

return [{"text": "AWS API Gateway default timeout is 29 seconds.", "score": 0.76}]

def fusion_retrieve(query: str) -> list[dict]:

return vector_search(query) + keyword_search(query)

⸻

Re-ranking and CRAG Guardrail

rerank.py

def rerank(chunks: list[dict]) -> list[dict]:

return sorted(chunks, key=lambda c: c["score"], reverse=True)

crag.py

CONFIDENCE_THRESHOLD = 0.35

def is_confident(chunks: list[dict]) -> bool:

if not chunks:

return False

avg = sum(c["score"] for c in chunks) / len(chunks)

return avg >= CONFIDENCE_THRESHOLD

⸻

RAG Pipeline with OpenAI Memory

app/rag/pipeline.py

from app.memory.service import update_memory

from app.rag.retrieval import fusion_retrieve

from app.rag.rerank import rerank

from app.rag.crag import is_confident

MAX_CHUNKS = 4

def run_rag(query: str, session_id: str | None):

memory = ""

if session_id:

memory = update_memory(session_id, query)

chunks = rerank(fusion_retrieve(query))[:MAX_CHUNKS]

if not is_confident(chunks):

return {

"answer": "I don’t have enough information. Can you clarify?",

"confidence": "low",

"memory": memory

}

context = "\n".join(c["text"] for c in chunks)

return {

"answer": f"{context}",

"confidence": "high",

"memory": memory

}

⸻

API Layer

models.py

from pydantic import BaseModel

class RAGRequest(BaseModel):

query: str

session_id: str | None = None

class RAGResponse(BaseModel):

answer: str

confidence: str

memory: str

api.py

from fastapi import APIRouter

from app.models import RAGRequest, RAGResponse

from app.rag.pipeline import run_rag

router = APIRouter()

@router.post("/rag", response_model=RAGResponse)

def rag(req: RAGRequest):

return run_rag(req.query, req.session_id)

⸻

Why This Pattern Works in the Real World

• Memory never grows

• Tokens are predictable

• No hidden agent loops

• No context window roulette

• Easy to observe and debug

• Safe for regulated environments

This is how you build trustworthy RAG, not demo-ware.

⸻

The Principle to Remember

LLMs are terrible memory stores.

They are excellent memory compressors.

Once you internalize that, everything else gets easier.

⸻

If my reader want next, I can:

• Add DynamoDB instead of Redis

• Add TTL + decay rules

• Turn this into a shared platform service

• Zip this as a starter repo

• Add Bedrock instead of OpenAI

This is production-grade RAG, not hype.

Minimal-Memory RAG with OpenAI-Powered Memory (Production Pattern)

Recent Posts

Comments

Subscribe Form