top of page
Search

Minimal-Memory RAG with OpenAI-Powered Memory (Production Pattern)

  • Writer: Mark Kendall
    Mark Kendall
  • Feb 10
  • 4 min read

Minimal-Memory RAG with OpenAI-Powered Memory (Production Pattern)


Most teams make the same mistake with RAG:


They treat memory as “more context”.


That’s backwards.


In production systems, memory is compression, not accumulation. This article shows a battle-tested pattern where OpenAI is used only to compress memory, while the system stays deterministic, cheap, and Kubernetes-safe.



The Core Idea


OpenAI does not store memory.

OpenAI compresses memory.


All real memory lives outside the LLM.


What this design guarantees

• No growing context windows

• No replaying chat history

• Hard token caps

• Stateless pods

• Horizontal scalability

• Auditable behavior



Architecture Overview


User Request

   ↓

Load Conversation Summary (external store)

   ↓

OpenAI → Update Summary (≤ 3 sentences)

   ↓

Persist Summary

   ↓

Fusion Retrieval (vector + keyword)

   ↓

Top 2–4 chunks only

   ↓

Answer


This is Fusion RAG + Lite Conversational RAG + Light CRAG, optimized for limited memory.



Folder Structure


rag-service/

├── app/

│   ├── main.py

│   ├── api.py

│   ├── settings.py

│   ├── models.py

│   ├── rag/

│   │   ├── pipeline.py

│   │   ├── retrieval.py

│   │   ├── rerank.py

│   │   └── crag.py

│   ├── memory/

│   │   ├── store.py

│   │   └── service.py

│   └── llm/

│       └── openai_client.py

├── requirements.txt

└── Dockerfile




OpenAI Memory Compression (The Important Part)


app/llm/openai_client.py


from openai import OpenAI


client = OpenAI()


def compress_memory(previous_summary: str, user_input: str) -> str:

    prompt = f"""

You are a memory compression system.


Existing summary:

{previous_summary or "None"}


New user input:

{user_input}


Update the summary so it captures:

- The user's goal

- Key systems or entities mentioned

- Constraints or preferences


Rules:

- Max 3 sentences

- No speculation

- No filler

"""


    response = client.chat.completions.create(

        model="gpt-4o-mini",

        messages=[{"role": "user", "content": prompt}],

        temperature=0.0,

        max_tokens=120

    )


    return response.choices[0].message.content.strip()


Why this model

• Cheap

• Deterministic

• Excellent summarizer

• Perfect for memory compression



External Memory Store (Redis example)


app/memory/store.py


import redis


redis_client = redis.Redis(

    host="redis",

    port=6379,

    decode_responses=True

)


def load_summary(session_id: str) -> str:

    return redis_client.get(session_id) or ""


def save_summary(session_id: str, summary: str):

    redis_client.set(session_id, summary)


Memory is:

• Small (≤ 512 chars)

• Persistent

• Independent of pods



Memory Update Service


app/memory/service.py


from app.memory.store import load_summary, save_summary

from app.llm.openai_client import compress_memory


def update_memory(session_id: str, user_input: str) -> str:

    previous = load_summary(session_id)

    updated = compress_memory(previous, user_input)

    save_summary(session_id, updated)

    return updated


This runs once per request, not in a loop.



Fusion Retrieval (Minimal Example)


app/rag/retrieval.py


def vector_search(query: str) -> list[dict]:

    return [{"text": "API Gateway timeout is 29 seconds.", "score": 0.82}]


def keyword_search(query: str) -> list[dict]:

    return [{"text": "AWS API Gateway default timeout is 29 seconds.", "score": 0.76}]


def fusion_retrieve(query: str) -> list[dict]:

    return vector_search(query) + keyword_search(query)




Re-ranking and CRAG Guardrail



def rerank(chunks: list[dict]) -> list[dict]:

    return sorted(chunks, key=lambda c: c["score"], reverse=True)



CONFIDENCE_THRESHOLD = 0.35


def is_confident(chunks: list[dict]) -> bool:

    if not chunks:

        return False

    avg = sum(c["score"] for c in chunks) / len(chunks)

    return avg >= CONFIDENCE_THRESHOLD




RAG Pipeline with OpenAI Memory


app/rag/pipeline.py


from app.memory.service import update_memory

from app.rag.retrieval import fusion_retrieve

from app.rag.rerank import rerank

from app.rag.crag import is_confident


MAX_CHUNKS = 4


def run_rag(query: str, session_id: str | None):

    memory = ""


    if session_id:

        memory = update_memory(session_id, query)


    chunks = rerank(fusion_retrieve(query))[:MAX_CHUNKS]


    if not is_confident(chunks):

        return {

            "answer": "I don’t have enough information. Can you clarify?",

            "confidence": "low",

            "memory": memory

        }


    context = "\n".join(c["text"] for c in chunks)


    return {

        "answer": f"{context}",

        "confidence": "high",

        "memory": memory

    }




API Layer



from pydantic import BaseModel


class RAGRequest(BaseModel):

    query: str

    session_id: str | None = None


class RAGResponse(BaseModel):

    answer: str

    confidence: str

    memory: str



from fastapi import APIRouter

from app.models import RAGRequest, RAGResponse

from app.rag.pipeline import run_rag


router = APIRouter()


@router.post("/rag", response_model=RAGResponse)

def rag(req: RAGRequest):

    return run_rag(req.query, req.session_id)




Why This Pattern Works in the Real World

• Memory never grows

• Tokens are predictable

• No hidden agent loops

• No context window roulette

• Easy to observe and debug

• Safe for regulated environments


This is how you build trustworthy RAG, not demo-ware.



The Principle to Remember


LLMs are terrible memory stores.

They are excellent memory compressors.


Once you internalize that, everything else gets easier.



If my reader want next, I can:

• Add DynamoDB instead of Redis

• Add TTL + decay rules

• Turn this into a shared platform service

• Zip this as a starter repo

• Add Bedrock instead of OpenAI


This is production-grade RAG, not hype.

 
 
 

Recent Posts

See All

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
Post: Blog2_Post

Subscribe Form

Thanks for submitting!

©2020 by LearnTeachMaster DevOps. Proudly created with Wix.com

bottom of page