
How to Build a Real Multi-Agent System: Visual Inventory, Replenishment, and Intent-Driven Architecture
- Mark Kendall
- 3 days ago
- 13 min read
Intro
After watching a full round of AI demos, one thing became clear:
Most teams understand that “agents” are important.
But many teams are still unclear on why an agent exists, what each agent owns, and when an LLM should actually be used.
That confusion is normal. The terminology around AI agents has become crowded: orchestrators, tools, workflows, RAG, LLMs, vector databases, model calls, autonomy, multi-agent systems, and human-in-the-loop approval. It can all sound brand new.
But when you strip away the buzzwords, the architecture becomes very familiar.
A multi-agent system is not magic.
It is a governed workflow made up of specialized capabilities.
The difference is that some of those capabilities can now reason, interpret images, summarize ambiguity, retrieve knowledge, and make recommendations in ways traditional microservices never could.
That is the power.
The mistake is thinking every step needs an LLM.
It does not.
The better approach is Intent-Driven Engineering:
Define the outcome first.
Define the constraints.
Define the responsibilities.
Define the data.
Define the approval points.
Then decide which parts need AI.
That is how you move beyond prompting and start designing real AI-enabled systems.
What Is a Multi-Agent System?
A multi-agent system is a software architecture where multiple specialized agents cooperate to complete a larger goal.
Each agent should have a clear responsibility.
Each agent should have a defined input and output.
Each agent should have a reason to exist.
A true multi-agent application is not simply:
“Agent 1 calls Gemini, Agent 2 calls Gemini, Agent 3 calls Gemini, and Agent 4 calls Gemini.”
That is not architecture.
That is just multiple endpoints calling the same model.
A stronger definition is this:
A multi-agent system is a governed workflow of specialized capabilities, where each agent owns a distinct responsibility boundary and the orchestrator manages the overall process.
That distinction matters.
The orchestrator controls the flow.
The agents perform bounded tasks.
The LLM is used only where it adds value.
The business rules remain deterministic.
The human approves high-impact decisions.
That is the difference between AI theater and enterprise architecture.
The Use Case: Visual Inventory and Replenishment
Let’s use a practical warehouse example.
Imagine a warehouse associate walking to a bin, shelf, pallet, or storage location.
Instead of manually counting every box, they open a mobile app, scan or select the location, and take a picture.
The system analyzes the image and says:
Location: Aisle 4 / Bin B12
Detected Item: Medium shipping box
Visual Count: 10
Expected Count: 22
Minimum Stock Level: 20
Status: Below minimum and inventory mismatch
Recommended Action: Create replenishment request
Confidence: 87%
That is useful.
This is not about replacing the warehouse system of record.
It is about comparing physical reality against the system of record.
The warehouse system may say there are 22 boxes in a bin. The photo may show only 10. That discrepancy matters.
It could mean:
Inventory depletion
Missed scan
Unrecorded movement
Shrinkage
Theft
Damaged goods
Misplaced inventory
Receiving error
The app becomes a visual audit assistant.
It helps warehouse teams count faster, detect exceptions earlier, and trigger replenishment with evidence.
The strongest framing is this:
Don’t make humans count everything. Make humans verify exceptions.
That is the business value.
The Wrong Way to Build It
The weak version of the architecture looks like this:
Mobile App
↓
Orchestrator
↓
Agent 1 → LLM
↓
Agent 2 → LLM
↓
Agent 3 → LLM
↓
Agent 4 → LLM
↓
Order placed
That sounds impressive in a demo, but it leaves too many unanswered questions.
What does each agent actually own?
Why does every step need an LLM?
What if the model miscounts?
What if the confidence is low?
What if the supplier is wrong?
What if the inventory system disagrees?
Who approves the order?
Where is the audit trail?
What prevents a hallucinated supplier from becoming a real purchase order?
That is where many AI demos break down.
They show the flow, but they do not explain the architecture.
They say “multi-agent,” but what they really built is a chain of model calls.
The Better Way: Intent-Driven Agent Architecture
The better version starts with intent.
The system intent is:
Enable warehouse workers to visually verify inventory levels, detect discrepancies, recommend replenishment actions, and route high-impact decisions for human approval.
That one sentence changes the architecture.
Now the goal is not:
“Use AI to count boxes.”
The goal is:
“Improve inventory accuracy and replenishment decisions using visual evidence.”
That means each agent gets a specific intent of its own.
For example:
Image Intake Agent Intent:
Validate the submitted image, location, and session before any AI analysis occurs.
Vision Agent Intent:
Detect visible inventory units, read available labels, estimate quantity, and return confidence.
Inventory Reconciliation Agent Intent:
Compare the visual result against the warehouse system of record and minimum stock rules.
Supplier Matching Agent Intent:
Map the detected item to the correct SKU, vendor, reorder policy, and procurement path.
Recommendation Agent Intent:
Explain the discrepancy, summarize the evidence, and recommend the next action.
Approval Agent Intent:
Route the recommendation to a human or system workflow and record the decision.
That is the shift.
You are not throwing agents at a problem.
You are defining responsibility boundaries.
High-Level Architecture
A strong architecture would look like this:
Mobile App
↓
FastAPI Orchestrator
↓
Image Intake Agent
↓
Vision Agent
↓
Inventory Reconciliation Agent
↓
Supplier Matching Agent
↓
Recommendation Agent
↓
Approval and Action Agent
↓
WMS / ERP / Procurement System
The key point:
The orchestrator controls the process.
The agents perform specialized work.
The models and tools are selected based on the agent’s responsibility.
FastAPI is not the intelligence.
FastAPI is just the service boundary.
Inside each service, you decide whether the work requires:
Normal Python code
A database query
A rules engine
A barcode/OCR library
A vision model
A retrieval framework
An LLM
A workflow engine
A human approval step
That is how you keep the system clean, explainable, and cost-controlled.
Agent 1: Image Intake Agent
The Image Intake Agent should not call an LLM.
Its job is basic but important.
It validates the request before the expensive AI work begins.
Input:
{
"sessionId": "abc123",
"locationId": "BIN-A12",
"imageUrl": "s3://warehouse-images/bin-a12.jpg",
"submittedBy": "worker-456"
}
Responsibilities:
Validate the session
Validate the bin or location ID
Check that the image exists
Check image size and format
Store the raw image
Create an audit record
Reject unusable submissions
Output:
{
"sessionId": "abc123",
"locationId": "BIN-A12",
"imageQuality": "usable",
"imageStored": true,
"nextStep": "VISION_ANALYSIS"
}
This is not AI work.
This is standard application logic.
Using an LLM here would add cost and risk for no reason.
That is one of the most important architecture lessons:
Not every agent needs an LLM.
Agent 2: Vision Agent
The Vision Agent is where AI starts to make sense.
Its job is to interpret the image.
It may use a multimodal model, a computer vision model, an OCR library, or a barcode reader.
Responsibilities:
Detect boxes or inventory units
Estimate quantity
Read visible labels
Read barcodes if visible
Generate bounding boxes
Produce an annotated image
Return confidence score
Output:
{
"sessionId": "abc123",
"locationId": "BIN-A12",
"detectedItem": "medium_shipping_box",
"estimatedQuantity": 10,
"confidence": 0.87,
"visibleLabels": ["ACME-BOX-MED"],
"annotatedImageUrl": "s3://warehouse-images/bin-a12-annotated.jpg"
}
This agent may use a model like Gemini Vision, Claude vision capability, a specialized CV model, or another approved enterprise model.
But the decision should be based on capability, not hype.
The question is not:
“Which model do we like?”
The question is:
“Which capability is required here?”
For this agent, the capability is visual interpretation.
That means a multimodal vision model or specialized computer vision model is appropriate.
Agent 3: Inventory Reconciliation Agent
This agent compares the visual result against the warehouse system.
This is where many demos go wrong.
The LLM should not be the system of record.
The warehouse system, ERP, inventory database, or WMS owns the truth.
The reconciliation agent should call those systems.
Input:
{
"locationId": "BIN-A12",
"detectedItem": "medium_shipping_box",
"estimatedQuantity": 10,
"confidence": 0.87
}
Responsibilities:
Look up expected quantity
Look up minimum stock threshold
Look up reorder policy
Compare visual count to system count
Classify the discrepancy
Determine whether action is needed
Output:
{
"systemExpectedQuantity": 22,
"minimumStock": 20,
"visualQuantity": 10,
"discrepancy": -12,
"status": "below_minimum_and_mismatch",
"recommendedAction": "replenishment_review"
}
This agent mostly needs deterministic logic.
It may call:
WMS API
ERP API
Inventory database
Rules engine
Policy configuration
It does not need an LLM to calculate whether 10 is less than 20.
That is business logic.
This is where cost control begins.
Use AI where interpretation is required.
Use code where precision is required.
Agent 4: Supplier Matching Agent
The Supplier Matching Agent maps the item to the correct SKU, vendor, reorder rule, and procurement path.
This may require a mix of database lookup, retrieval, and occasional reasoning.
Responsibilities:
Identify the SKU
Find preferred supplier
Check reorder quantity
Check vendor rules
Check contract or purchasing constraints
Prepare purchase request draft
Output:
{
"sku": "BOX-MED-001",
"supplier": "Acme Packaging",
"recommendedOrderQuantity": 25,
"estimatedCost": 112.50,
"requiresApproval": true,
"purchaseRequestDraft": "PR-789"
}
This is a good place for retrieval-augmented generation if supplier data is spread across documents, PDFs, procurement policies, or vendor catalogs.
But if the supplier is already in a database, then use the database.
Do not use an LLM to guess something that already exists in a system of record.
A mature design might use:
SQL for known SKU/vendor mappings
ERP API for official supplier records
RAG for procurement documents
LLM only to interpret ambiguous supplier text or summarize policy constraints
Again, the agent intent determines the tool.
Agent 5: Recommendation Agent
This is a good place for an LLM.
The system now has structured evidence:
Visual count
Confidence score
Expected inventory
Minimum stock level
Supplier information
Reorder recommendation
Annotated image
Audit history
The Recommendation Agent can turn that into a human-readable explanation.
Responsibilities:
Summarize the discrepancy
Explain confidence and risk
Generate recommended action
Prepare supervisor message
Produce audit-friendly language
Example output:
The system expected 22 medium shipping boxes in BIN-A12, with a minimum required stock level of 20. The visual analysis detected approximately 10 boxes with 87% confidence. This creates a discrepancy of 12 units and places the bin below the minimum stock threshold. Recommended action: create a replenishment request for 25 units from the preferred supplier, pending supervisor approval.
That is where an LLM adds real value.
It explains the situation.
It helps the human understand the decision.
It does not become the decision-maker.
That distinction matters.
Agent 6: Approval and Action Agent
The final agent handles workflow action.
This agent should be tightly governed.
It should not simply place orders because an AI model said so.
Responsibilities:
Create approval task
Notify supervisor
Attach evidence
Record approval or rejection
Create purchase request after approval
Update audit log
Trigger order only when allowed
Output:
{
"approvalStatus": "pending",
"notificationSent": true,
"purchaseRequestDraft": "PR-789",
"evidenceAttached": true
}
The human approval step is critical.
For low-risk items, the company may eventually allow automatic replenishment under strict rules.
For example:
Only if confidence is above 95%
Only if item is approved for auto-replenishment
Only if cost is below $250
Only if supplier is already approved
Only if discrepancy is confirmed by system history
Only if no policy exception exists
Until then, the safe pattern is:
AI recommends. Humans approve. Systems transact.
The Orchestrator: The Real Control Layer
The orchestrator is the workflow controller.
It should not be just a pass-through API.
It owns the state machine.
It knows what step runs next.
It knows what happens when something fails.
It knows when to stop.
It knows when to ask for a human.
A simple workflow could look like this:
START
↓
IMAGE_INTAKE
↓
If image usable → VISION_ANALYSIS
If image unusable → REQUEST_NEW_IMAGE
↓
If confidence >= 0.75 → INVENTORY_RECONCILIATION
If confidence < 0.75 → MANUAL_REVIEW
↓
If below minimum → SUPPLIER_MATCHING
If inventory healthy → CLOSE_SESSION
↓
If supplier found → RECOMMENDATION
If supplier unknown → PROCUREMENT_REVIEW
↓
If approval required → HUMAN_APPROVAL
If auto-approved by policy → CREATE_ORDER
↓
END
That is architecture.
That is not a loose chain of prompts.
The orchestrator should track:
Session state
Agent outputs
Confidence scores
Errors
Retries
Fallbacks
Human approvals
Audit events
Final outcome
This is where enterprise systems become serious.
Why Agent Count Is the Wrong Question
One of the common mistakes in multi-agent demos is treating the number of agents as the achievement.
More agents does not mean better architecture.
The better question is:
Why does this agent exist?
An agent should exist because it has a distinct:
Responsibility
Input/output contract
Tool requirement
Data source
Failure mode
Security boundary
Scaling profile
Approval requirement
Ownership boundary
If two agents do the same thing, combine them.
If one agent has too many unrelated responsibilities, split it.
The right number of agents is the number that makes the workflow understandable, governable, testable, and maintainable.
A good reviewer question is:
“Why could this not just be a normal function?”
If the answer is:
“Because we needed more agents,”
that is weak.
If the answer is:
“Because this step requires a separate capability, model, data source, failure path, or approval boundary,”
that is strong.
Choosing the Right Tool for Each Agent
A mature AI architecture does not start with the model.
It starts with the task.
Here is a practical decision guide:
If the task is image interpretation:
Use a multimodal model or computer vision model.
If the task is barcode reading:
Use a barcode/OCR library first.
If the task is inventory lookup:
Use the WMS, ERP, or database.
If the task is threshold comparison:
Use deterministic business rules.
If the task is supplier lookup:
Use procurement database or RAG over trusted supplier documents.
If the task is explanation:
Use an LLM.
If the task is transaction execution:
Use system APIs with approval and audit controls.
This is how you prevent runaway cost.
This is how you prevent over-engineering.
This is how you prevent model hallucination from becoming business action.
The principle is simple:
Use the cheapest, safest, most deterministic tool that can correctly perform the task.
That may be an LLM.
It may also be a database query.
Why This Is Still Familiar Architecture
This is where the whole thing starts to click.
A lot of this looks like the same architecture we have been building for years:
APIs
Microservices
Workflow engines
Databases
Event logs
Rules engines
System integrations
Human approval flows
Observability
Audit trails
The difference is that some endpoints are now smarter.
Some services can reason over messy inputs.
Some services can interpret images.
Some services can summarize ambiguity.
Some services can retrieve context from unstructured documents.
Some services can recommend actions instead of only executing fixed logic.
That is the real shift.
Traditional microservices were deterministic.
They did what we explicitly coded them to do.
AI-enabled agents can operate inside a bounded intent.
They can interpret, classify, summarize, and recommend.
But that does not remove the need for architecture.
It increases the need for architecture.
Because now the system can be wrong in new ways.
That means we need:
Confidence thresholds
Fallback paths
Human approval
Model evaluation
Audit logging
Prompt/version control
Structured outputs
Policy gates
Cost controls
Security boundaries
The old architecture skills still matter.
They matter even more.
Intent at the System Level
The system-level intent defines the business outcome.
For this use case:
system_intent:
name: "Visual Inventory Replenishment Assistant"
goal: "Help warehouse teams visually verify inventory, detect discrepancies, and recommend replenishment actions."
users:
- warehouse associates
- inventory managers
- supervisors
- procurement teams
constraints:
- "Do not automatically place orders without policy approval."
- "Use warehouse system of record for official inventory."
- "Use image analysis only as supporting evidence."
- "Require human approval for high-impact transactions."
- "Capture audit trail for every decision."
success_metrics:
- "Reduce manual cycle count time."
- "Improve inventory discrepancy detection."
- "Reduce stockout events."
- "Improve replenishment accuracy."
- "Increase auditability."
This is the foundation.
The system is not being told to “build an AI app.”
It is being told to solve a specific operational problem within specific constraints.
That is Intent-Driven Engineering.
Intent at the Agent Level
Each agent should also have its own intent file or configuration.
Example:
agent:
name: "VisionInventoryAgent"
intent: "Estimate visible inventory quantity from a warehouse image."
inputs:
- imageUrl
- locationId
- sessionId
outputs:
- detectedItem
- estimatedQuantity
- confidence
- visibleLabels
- annotatedImageUrl
tools:
- multimodalVisionModel
- ocrReader
- barcodeReader
constraints:
- "Return structured JSON only."
- "Do not infer supplier unless label evidence exists."
- "Return low confidence when image is unclear."
- "Never trigger purchase action directly."
failure_modes:
- unreadable_image
- low_confidence
- multiple_item_types_detected
- no_visible_labels
That is powerful.
Now the agent is not just a prompt.
It has a job description, contract, tools, constraints, and failure rules.
That is how you make agents governable.
Suggested Technical Stack
One possible implementation stack:
Frontend:
React web app or mobile app
Eventually native iOS/Android if needed
Backend:
FastAPI orchestrator
Python agent services
PostgreSQL for operational data
Object storage for images
Redis or queue for async jobs
AI / Model Layer:
Vision model for image interpretation
OCR/barcode library for label extraction
LLM for explanation and recommendation
LlamaIndex or similar framework for retrieval if supplier docs are unstructured
Enterprise Integrations:
WMS API
ERP API
Procurement API
Notification service
Ticketing system
Approval workflow
Observability:
Structured logs
Metrics
Trace IDs
Audit log
Model output history
Cost tracking
The exact tools can change.
The architecture pattern stays the same.
Data Model
A basic data model might include:
InventorySession
sessionId
locationId
submittedBy
submittedAt
status
ImageEvidence
imageId
sessionId
rawImageUrl
annotatedImageUrl
qualityStatus
VisionResult
resultId
sessionId
detectedItem
estimatedQuantity
confidence
visibleLabels
InventorySnapshot
locationId
systemExpectedQuantity
minimumStock
reorderThreshold
lastUpdated
ReconciliationResult
sessionId
discrepancy
status
recommendedAction
SupplierRecommendation
sessionId
sku
supplier
recommendedOrderQuantity
estimatedCost
ApprovalDecision
sessionId
approver
decision
decisionTime
notes
This matters because the system needs memory and accountability.
A real enterprise system cannot just show a pretty answer and forget what happened.
It has to remember:
What image was submitted
What the model saw
What the system expected
What recommendation was made
Who approved it
What action was taken
That is auditability.
Error Handling and Fallbacks
A serious architecture must define what happens when things go wrong.
Examples:
Image is blurry:
Ask user to retake photo.
Vision confidence is low:
Route to manual count.
Multiple product types detected:
Ask user to isolate bin or select product.
Supplier cannot be determined:
Route to procurement review.
Inventory system unavailable:
Hold session and retry.
Reorder amount exceeds limit:
Require manager approval.
Model output malformed:
Retry with stricter structured output or fail safely.
Human rejects recommendation:
Close as rejected and store reason.
This is where many demos are incomplete.
Happy path is easy.
Enterprise architecture is about the unhappy paths.
Security and Governance
This type of system touches operational and financial workflows.
That means governance matters.
Important controls include:
Authentication
Role-based access control
Location-level permissions
Approved vendor lists
Purchase limits
Human approval thresholds
Immutable audit logs
Model output retention
Prompt versioning
Data privacy controls
Cost monitoring
API rate limits
The AI should never become an uncontrolled actor inside the business.
It should operate within policy.
The safest mindset is:
The AI can recommend. The system governs. The human approves where required.
The Real Business Value
This use case could create real value because inventory is money.
If a warehouse thinks it has 500 units but physically has 350, that gap affects:
Finance
Procurement
Fulfillment
Customer orders
Loss prevention
Operations
Warehouse planning
Supplier management
The app could help with:
Faster cycle counts
Reduced manual counting
Better replenishment timing
Fewer stockouts
Better discrepancy detection
Shrinkage and theft investigation
Evidence-based audit trails
Improved warehouse visibility
That is why this kind of system could be valuable.
Not because it is “AI.”
Because it reduces operational friction around inventory accuracy.
Why It Matters
The bigger lesson is not just about warehouses.
The bigger lesson is about how we design AI systems.
The old way was:
Write requirements
Build services
Connect APIs
Manually encode every decision
The new way is:
Define intent
Break it into bounded capabilities
Choose the right model or tool for each capability
Use orchestration to control workflow
Use policy to govern decisions
Use humans for high-impact approvals
Continuously improve from feedback
This is the practical meaning of Intent-Driven Engineering.
It does not throw away traditional architecture.
It upgrades it.
The agents are not magic creatures.
They are intelligent service boundaries.
Some use models.
Some use rules.
Some use APIs.
Some use retrieval.
Some use humans.
The power comes from combining them into a governed workflow that can act on intent.
Key Takeaways
A true multi-agent system is not measured by how many agents it has.
It is measured by whether each agent has a clear reason to exist.
The orchestrator should control the workflow.
The agents should perform bounded capabilities.
The LLM should be used only where reasoning, interpretation, retrieval, or summarization is needed.
Business rules should remain deterministic.
Systems of record should remain authoritative.
Humans should approve high-impact actions.
For the visual inventory use case, the strongest architecture is:
Phone captures evidence
Vision estimates inventory
System compares against expected quantity
Rules determine discrepancy
Supplier data drives replenishment
LLM explains the recommendation
Human approves the action
Audit trail records everything
That is not AI theater.
That is enterprise architecture.
And that is the real point:
Intent-Driven Engineering is not about giving the model more control.
It is about giving the system clearer intent, stronger boundaries, better tools, and governed execution.
That is how we move beyond prompts.


Comments