How to Build a Real Multi-Agent System: Visual Inventory, Replenishment, and Intent-Driven Architecture

Mark Kendall
3 days ago
13 min read

Intro

After watching a full round of AI demos, one thing became clear:

Most teams understand that “agents” are important.

But many teams are still unclear on why an agent exists, what each agent owns, and when an LLM should actually be used.

That confusion is normal. The terminology around AI agents has become crowded: orchestrators, tools, workflows, RAG, LLMs, vector databases, model calls, autonomy, multi-agent systems, and human-in-the-loop approval. It can all sound brand new.

But when you strip away the buzzwords, the architecture becomes very familiar.

A multi-agent system is not magic.

It is a governed workflow made up of specialized capabilities.

The difference is that some of those capabilities can now reason, interpret images, summarize ambiguity, retrieve knowledge, and make recommendations in ways traditional microservices never could.

That is the power.

The mistake is thinking every step needs an LLM.

It does not.

The better approach is Intent-Driven Engineering:

Define the outcome first.

Define the constraints.

Define the responsibilities.

Define the data.

Define the approval points.

Then decide which parts need AI.

That is how you move beyond prompting and start designing real AI-enabled systems.

What Is a Multi-Agent System?

A multi-agent system is a software architecture where multiple specialized agents cooperate to complete a larger goal.

Each agent should have a clear responsibility.

Each agent should have a defined input and output.

Each agent should have a reason to exist.

A true multi-agent application is not simply:

“Agent 1 calls Gemini, Agent 2 calls Gemini, Agent 3 calls Gemini, and Agent 4 calls Gemini.”

That is not architecture.

That is just multiple endpoints calling the same model.

A stronger definition is this:

A multi-agent system is a governed workflow of specialized capabilities, where each agent owns a distinct responsibility boundary and the orchestrator manages the overall process.

That distinction matters.

The orchestrator controls the flow.

The agents perform bounded tasks.

The LLM is used only where it adds value.

The business rules remain deterministic.

The human approves high-impact decisions.

That is the difference between AI theater and enterprise architecture.

The Use Case: Visual Inventory and Replenishment

Let’s use a practical warehouse example.

Imagine a warehouse associate walking to a bin, shelf, pallet, or storage location.

Instead of manually counting every box, they open a mobile app, scan or select the location, and take a picture.

The system analyzes the image and says:

Location: Aisle 4 / Bin B12

Detected Item: Medium shipping box

Visual Count: 10

Expected Count: 22

Minimum Stock Level: 20

Status: Below minimum and inventory mismatch

Recommended Action: Create replenishment request

Confidence: 87%

That is useful.

This is not about replacing the warehouse system of record.

It is about comparing physical reality against the system of record.

The warehouse system may say there are 22 boxes in a bin. The photo may show only 10. That discrepancy matters.

It could mean:

Inventory depletion

Missed scan

Unrecorded movement

Shrinkage

Theft

Damaged goods

Misplaced inventory

Receiving error

The app becomes a visual audit assistant.

It helps warehouse teams count faster, detect exceptions earlier, and trigger replenishment with evidence.

The strongest framing is this:

Don’t make humans count everything. Make humans verify exceptions.

That is the business value.

The Wrong Way to Build It

The weak version of the architecture looks like this:

Mobile App

↓

Orchestrator

↓

Agent 1 → LLM

↓

Agent 2 → LLM

↓

Agent 3 → LLM

↓

Agent 4 → LLM

↓

Order placed

That sounds impressive in a demo, but it leaves too many unanswered questions.

What does each agent actually own?

Why does every step need an LLM?

What if the model miscounts?

What if the confidence is low?

What if the supplier is wrong?

What if the inventory system disagrees?

Who approves the order?

Where is the audit trail?

What prevents a hallucinated supplier from becoming a real purchase order?

That is where many AI demos break down.

They show the flow, but they do not explain the architecture.

They say “multi-agent,” but what they really built is a chain of model calls.

The Better Way: Intent-Driven Agent Architecture

The better version starts with intent.

The system intent is:

Enable warehouse workers to visually verify inventory levels, detect discrepancies, recommend replenishment actions, and route high-impact decisions for human approval.

That one sentence changes the architecture.

Now the goal is not:

“Use AI to count boxes.”

The goal is:

“Improve inventory accuracy and replenishment decisions using visual evidence.”

That means each agent gets a specific intent of its own.

For example:

Image Intake Agent Intent:

Validate the submitted image, location, and session before any AI analysis occurs.

Vision Agent Intent:

Detect visible inventory units, read available labels, estimate quantity, and return confidence.

Inventory Reconciliation Agent Intent:

Compare the visual result against the warehouse system of record and minimum stock rules.

Supplier Matching Agent Intent:

Map the detected item to the correct SKU, vendor, reorder policy, and procurement path.

Recommendation Agent Intent:

Explain the discrepancy, summarize the evidence, and recommend the next action.

Approval Agent Intent:

Route the recommendation to a human or system workflow and record the decision.

That is the shift.

You are not throwing agents at a problem.

You are defining responsibility boundaries.

High-Level Architecture

A strong architecture would look like this:

Mobile App

↓

FastAPI Orchestrator

↓

Image Intake Agent

↓

Vision Agent

↓

Inventory Reconciliation Agent

↓

Supplier Matching Agent

↓

Recommendation Agent

↓

Approval and Action Agent

↓

WMS / ERP / Procurement System

The key point:

The orchestrator controls the process.

The agents perform specialized work.

The models and tools are selected based on the agent’s responsibility.

FastAPI is not the intelligence.

FastAPI is just the service boundary.

Inside each service, you decide whether the work requires:

Normal Python code

A database query

A rules engine

A barcode/OCR library

A vision model

A retrieval framework

An LLM

A workflow engine

A human approval step

That is how you keep the system clean, explainable, and cost-controlled.

Agent 1: Image Intake Agent

The Image Intake Agent should not call an LLM.

Its job is basic but important.

It validates the request before the expensive AI work begins.

Input:

{

"sessionId": "abc123",

"locationId": "BIN-A12",

"imageUrl": "s3://warehouse-images/bin-a12.jpg",

"submittedBy": "worker-456"

}

Responsibilities:

Validate the session

Validate the bin or location ID

Check that the image exists

Check image size and format

Store the raw image

Create an audit record

Reject unusable submissions

Output:

{

"sessionId": "abc123",

"locationId": "BIN-A12",

"imageQuality": "usable",

"imageStored": true,

"nextStep": "VISION_ANALYSIS"

}

This is not AI work.

This is standard application logic.

Using an LLM here would add cost and risk for no reason.

That is one of the most important architecture lessons:

Not every agent needs an LLM.

Agent 2: Vision Agent

The Vision Agent is where AI starts to make sense.

Its job is to interpret the image.

It may use a multimodal model, a computer vision model, an OCR library, or a barcode reader.

Responsibilities:

Detect boxes or inventory units

Estimate quantity

Read visible labels

Read barcodes if visible

Generate bounding boxes

Produce an annotated image

Return confidence score

Output:

{

"sessionId": "abc123",

"locationId": "BIN-A12",

"detectedItem": "medium_shipping_box",

"estimatedQuantity": 10,

"confidence": 0.87,

"visibleLabels": ["ACME-BOX-MED"],

"annotatedImageUrl": "s3://warehouse-images/bin-a12-annotated.jpg"

}

This agent may use a model like Gemini Vision, Claude vision capability, a specialized CV model, or another approved enterprise model.

But the decision should be based on capability, not hype.

The question is not:

“Which model do we like?”

The question is:

“Which capability is required here?”

For this agent, the capability is visual interpretation.

That means a multimodal vision model or specialized computer vision model is appropriate.

Agent 3: Inventory Reconciliation Agent

This agent compares the visual result against the warehouse system.

This is where many demos go wrong.

The LLM should not be the system of record.

The warehouse system, ERP, inventory database, or WMS owns the truth.

The reconciliation agent should call those systems.

Input:

{

"locationId": "BIN-A12",

"detectedItem": "medium_shipping_box",

"estimatedQuantity": 10,

"confidence": 0.87

}

Responsibilities:

Look up expected quantity

Look up minimum stock threshold

Look up reorder policy

Compare visual count to system count

Classify the discrepancy

Determine whether action is needed

Output:

{

"systemExpectedQuantity": 22,

"minimumStock": 20,

"visualQuantity": 10,

"discrepancy": -12,

"status": "below_minimum_and_mismatch",

"recommendedAction": "replenishment_review"

}

This agent mostly needs deterministic logic.

It may call:

WMS API

ERP API

Inventory database

Rules engine

Policy configuration

It does not need an LLM to calculate whether 10 is less than 20.

That is business logic.

This is where cost control begins.

Use AI where interpretation is required.

Use code where precision is required.

Agent 4: Supplier Matching Agent

The Supplier Matching Agent maps the item to the correct SKU, vendor, reorder rule, and procurement path.

This may require a mix of database lookup, retrieval, and occasional reasoning.

Responsibilities:

Identify the SKU

Find preferred supplier

Check reorder quantity

Check vendor rules

Check contract or purchasing constraints

Prepare purchase request draft

Output:

{

"sku": "BOX-MED-001",

"supplier": "Acme Packaging",

"recommendedOrderQuantity": 25,

"estimatedCost": 112.50,

"requiresApproval": true,

"purchaseRequestDraft": "PR-789"

}

This is a good place for retrieval-augmented generation if supplier data is spread across documents, PDFs, procurement policies, or vendor catalogs.

But if the supplier is already in a database, then use the database.

Do not use an LLM to guess something that already exists in a system of record.

A mature design might use:

SQL for known SKU/vendor mappings

ERP API for official supplier records

RAG for procurement documents

LLM only to interpret ambiguous supplier text or summarize policy constraints

Again, the agent intent determines the tool.

Agent 5: Recommendation Agent

This is a good place for an LLM.

The system now has structured evidence:

Visual count

Confidence score

Expected inventory

Minimum stock level

Supplier information

Reorder recommendation

Annotated image

Audit history

The Recommendation Agent can turn that into a human-readable explanation.

Responsibilities:

Summarize the discrepancy

Explain confidence and risk

Generate recommended action

Prepare supervisor message

Produce audit-friendly language

Example output:

The system expected 22 medium shipping boxes in BIN-A12, with a minimum required stock level of 20. The visual analysis detected approximately 10 boxes with 87% confidence. This creates a discrepancy of 12 units and places the bin below the minimum stock threshold. Recommended action: create a replenishment request for 25 units from the preferred supplier, pending supervisor approval.

That is where an LLM adds real value.

It explains the situation.

It helps the human understand the decision.

It does not become the decision-maker.

That distinction matters.

Agent 6: Approval and Action Agent

The final agent handles workflow action.

This agent should be tightly governed.

It should not simply place orders because an AI model said so.

Responsibilities:

Create approval task

Notify supervisor

Attach evidence

Record approval or rejection

Create purchase request after approval

Update audit log

Trigger order only when allowed

Output:

{

"approvalStatus": "pending",

"notificationSent": true,

"purchaseRequestDraft": "PR-789",

"evidenceAttached": true

}

The human approval step is critical.

For low-risk items, the company may eventually allow automatic replenishment under strict rules.

For example:

Only if confidence is above 95%

Only if item is approved for auto-replenishment

Only if cost is below $250

Only if supplier is already approved

Only if discrepancy is confirmed by system history

Only if no policy exception exists

Until then, the safe pattern is:

AI recommends. Humans approve. Systems transact.

The Orchestrator: The Real Control Layer

The orchestrator is the workflow controller.

It should not be just a pass-through API.

It owns the state machine.

It knows what step runs next.

It knows what happens when something fails.

It knows when to stop.

It knows when to ask for a human.

A simple workflow could look like this:

START

↓

IMAGE_INTAKE

↓

If image usable → VISION_ANALYSIS

If image unusable → REQUEST_NEW_IMAGE

↓

If confidence >= 0.75 → INVENTORY_RECONCILIATION

If confidence < 0.75 → MANUAL_REVIEW

↓

If below minimum → SUPPLIER_MATCHING

If inventory healthy → CLOSE_SESSION

↓

If supplier found → RECOMMENDATION

If supplier unknown → PROCUREMENT_REVIEW

↓

If approval required → HUMAN_APPROVAL

If auto-approved by policy → CREATE_ORDER

↓

END

That is architecture.

That is not a loose chain of prompts.

The orchestrator should track:

Session state

Agent outputs

Confidence scores

Errors

Retries

Fallbacks

Human approvals

Audit events

Final outcome

This is where enterprise systems become serious.

Why Agent Count Is the Wrong Question

One of the common mistakes in multi-agent demos is treating the number of agents as the achievement.

More agents does not mean better architecture.

The better question is:

Why does this agent exist?

An agent should exist because it has a distinct:

Responsibility

Input/output contract

Tool requirement

Data source

Failure mode

Security boundary

Scaling profile

Approval requirement

Ownership boundary

If two agents do the same thing, combine them.

If one agent has too many unrelated responsibilities, split it.

The right number of agents is the number that makes the workflow understandable, governable, testable, and maintainable.

A good reviewer question is:

“Why could this not just be a normal function?”

If the answer is:

“Because we needed more agents,”

that is weak.

If the answer is:

“Because this step requires a separate capability, model, data source, failure path, or approval boundary,”

that is strong.

Choosing the Right Tool for Each Agent

A mature AI architecture does not start with the model.

It starts with the task.

Here is a practical decision guide:

If the task is image interpretation:

Use a multimodal model or computer vision model.

If the task is barcode reading:

Use a barcode/OCR library first.

If the task is inventory lookup:

Use the WMS, ERP, or database.

If the task is threshold comparison:

Use deterministic business rules.

If the task is supplier lookup:

Use procurement database or RAG over trusted supplier documents.

If the task is explanation:

Use an LLM.

If the task is transaction execution:

Use system APIs with approval and audit controls.

This is how you prevent runaway cost.

This is how you prevent over-engineering.

This is how you prevent model hallucination from becoming business action.

The principle is simple:

Use the cheapest, safest, most deterministic tool that can correctly perform the task.

That may be an LLM.

It may also be a database query.

Why This Is Still Familiar Architecture

This is where the whole thing starts to click.

A lot of this looks like the same architecture we have been building for years:

APIs

Microservices

Workflow engines

Databases

Event logs

Rules engines

System integrations

Human approval flows

Observability

Audit trails

The difference is that some endpoints are now smarter.

Some services can reason over messy inputs.

Some services can interpret images.

Some services can summarize ambiguity.

Some services can retrieve context from unstructured documents.

Some services can recommend actions instead of only executing fixed logic.

That is the real shift.

Traditional microservices were deterministic.

They did what we explicitly coded them to do.

AI-enabled agents can operate inside a bounded intent.

They can interpret, classify, summarize, and recommend.

But that does not remove the need for architecture.

It increases the need for architecture.

Because now the system can be wrong in new ways.

That means we need:

Confidence thresholds

Fallback paths

Human approval

Model evaluation

Audit logging

Prompt/version control

Structured outputs

Policy gates

Cost controls

Security boundaries

The old architecture skills still matter.

They matter even more.

Intent at the System Level

The system-level intent defines the business outcome.

For this use case:

system_intent:

name: "Visual Inventory Replenishment Assistant"

goal: "Help warehouse teams visually verify inventory, detect discrepancies, and recommend replenishment actions."

users:

- warehouse associates

- inventory managers

- supervisors

- procurement teams

constraints:

- "Do not automatically place orders without policy approval."

- "Use warehouse system of record for official inventory."

- "Use image analysis only as supporting evidence."

- "Require human approval for high-impact transactions."

- "Capture audit trail for every decision."

success_metrics:

- "Reduce manual cycle count time."

- "Improve inventory discrepancy detection."

- "Reduce stockout events."

- "Improve replenishment accuracy."

- "Increase auditability."

This is the foundation.

The system is not being told to “build an AI app.”

It is being told to solve a specific operational problem within specific constraints.

That is Intent-Driven Engineering.

Intent at the Agent Level

Each agent should also have its own intent file or configuration.

Example:

agent:

name: "VisionInventoryAgent"

intent: "Estimate visible inventory quantity from a warehouse image."

inputs:

- imageUrl

- locationId

- sessionId

outputs:

- detectedItem

- estimatedQuantity

- confidence

- visibleLabels

- annotatedImageUrl

tools:

- multimodalVisionModel

- ocrReader

- barcodeReader

constraints:

- "Return structured JSON only."

- "Do not infer supplier unless label evidence exists."

- "Return low confidence when image is unclear."

- "Never trigger purchase action directly."

failure_modes:

- unreadable_image

- low_confidence

- multiple_item_types_detected

- no_visible_labels

That is powerful.

Now the agent is not just a prompt.

It has a job description, contract, tools, constraints, and failure rules.

That is how you make agents governable.

Suggested Technical Stack

One possible implementation stack:

Frontend:

React web app or mobile app

Eventually native iOS/Android if needed

Backend:

FastAPI orchestrator

Python agent services

PostgreSQL for operational data

Object storage for images

Redis or queue for async jobs

AI / Model Layer:

Vision model for image interpretation

OCR/barcode library for label extraction

LLM for explanation and recommendation

LlamaIndex or similar framework for retrieval if supplier docs are unstructured

Enterprise Integrations:

WMS API

ERP API

Procurement API

Notification service

Ticketing system

Approval workflow

Observability:

Structured logs

Metrics

Trace IDs

Audit log

Model output history

Cost tracking

The exact tools can change.

The architecture pattern stays the same.

Data Model

A basic data model might include:

InventorySession

sessionId

locationId

submittedBy

submittedAt

status

ImageEvidence

imageId

sessionId

rawImageUrl

annotatedImageUrl

qualityStatus

VisionResult

resultId

sessionId

detectedItem

estimatedQuantity

confidence

visibleLabels

InventorySnapshot

locationId

systemExpectedQuantity

minimumStock

reorderThreshold

lastUpdated

ReconciliationResult

sessionId

discrepancy

status

recommendedAction

SupplierRecommendation

sessionId

sku

supplier

recommendedOrderQuantity

estimatedCost

ApprovalDecision

sessionId

approver

decision

decisionTime

notes

This matters because the system needs memory and accountability.

A real enterprise system cannot just show a pretty answer and forget what happened.

It has to remember:

What image was submitted

What the model saw

What the system expected

What recommendation was made

Who approved it

What action was taken

That is auditability.

Error Handling and Fallbacks

A serious architecture must define what happens when things go wrong.

Examples:

Image is blurry:

Ask user to retake photo.

Vision confidence is low:

Route to manual count.

Multiple product types detected:

Ask user to isolate bin or select product.

Supplier cannot be determined:

Route to procurement review.

Inventory system unavailable:

Hold session and retry.

Reorder amount exceeds limit:

Require manager approval.

Model output malformed:

Retry with stricter structured output or fail safely.

Human rejects recommendation:

Close as rejected and store reason.

This is where many demos are incomplete.

Happy path is easy.

Enterprise architecture is about the unhappy paths.

Security and Governance

This type of system touches operational and financial workflows.

That means governance matters.

Important controls include:

Authentication

Role-based access control

Location-level permissions

Approved vendor lists

Purchase limits

Human approval thresholds

Immutable audit logs

Model output retention

Prompt versioning

Data privacy controls

Cost monitoring

API rate limits

The AI should never become an uncontrolled actor inside the business.

It should operate within policy.

The safest mindset is:

The AI can recommend. The system governs. The human approves where required.

The Real Business Value

This use case could create real value because inventory is money.

If a warehouse thinks it has 500 units but physically has 350, that gap affects:

Finance

Procurement

Fulfillment

Customer orders

Loss prevention

Operations

Warehouse planning

Supplier management

The app could help with:

Faster cycle counts

Reduced manual counting

Better replenishment timing

Fewer stockouts

Better discrepancy detection

Shrinkage and theft investigation

Evidence-based audit trails

Improved warehouse visibility

That is why this kind of system could be valuable.

Not because it is “AI.”

Because it reduces operational friction around inventory accuracy.

Why It Matters

The bigger lesson is not just about warehouses.

The bigger lesson is about how we design AI systems.

The old way was:

Write requirements

Build services

Connect APIs

Manually encode every decision

The new way is:

Define intent

Break it into bounded capabilities

Choose the right model or tool for each capability

Use orchestration to control workflow

Use policy to govern decisions

Use humans for high-impact approvals

Continuously improve from feedback

This is the practical meaning of Intent-Driven Engineering.

It does not throw away traditional architecture.

It upgrades it.

The agents are not magic creatures.

They are intelligent service boundaries.

Some use models.

Some use rules.

Some use APIs.

Some use retrieval.

Some use humans.

The power comes from combining them into a governed workflow that can act on intent.

Key Takeaways

A true multi-agent system is not measured by how many agents it has.

It is measured by whether each agent has a clear reason to exist.

The orchestrator should control the workflow.

The agents should perform bounded capabilities.

The LLM should be used only where reasoning, interpretation, retrieval, or summarization is needed.

Business rules should remain deterministic.

Systems of record should remain authoritative.

Humans should approve high-impact actions.

For the visual inventory use case, the strongest architecture is:

Phone captures evidence

Vision estimates inventory

System compares against expected quantity

Rules determine discrepancy

Supplier data drives replenishment

LLM explains the recommendation

Human approves the action

Audit trail records everything

That is not AI theater.

That is enterprise architecture.

And that is the real point:

Intent-Driven Engineering is not about giving the model more control.

It is about giving the system clearer intent, stronger boundaries, better tools, and governed execution.

That is how we move beyond prompts.

How to Build a Real Multi-Agent System: Visual Inventory, Replenishment, and Intent-Driven Architecture

Recent Posts

Comments

Subscribe Form