How It Works

This page explains how Agent Red processes a customer conversation from first message to delivered response. Understanding the pipeline helps you configure agents, tune escalation rules, and interpret analytics data.

End-to-end conversation flow

A single customer message passes through multiple agents before a response is delivered. The diagram below shows the complete path, including the feedback loop when the Critic rejects a response.

What happens at each step

1. API Gateway receives the message. The customer's message arrives over HTTPS. The API Gateway authenticates the request using the tenant's API key, attaches tenant context, and forwards the message into the agent pipeline.

2. Intent Classifier determines the customer's need. The classifier analyzes the message text and assigns one of 18 intent categories using GPT-4o-mini. Seventeen are customer-facing intents, and one (admin_assistance) is reserved for admin-authenticated flows. The classified intent determines which knowledge sources the retrieval agent searches and how the response generator frames its reply.

3. Escalation Detection runs in parallel. While the main pipeline processes the message, the escalation agent independently evaluates whether the conversation requires a human. It assesses customer sentiment, issue complexity, account value, and conversation history. If escalation triggers, the system routes the conversation to a human agent in your help desk (Zendesk, or another connected platform) and notifies the customer that a person is taking over.

Escalation rules are configurable per tenant — you control which situations trigger a handoff to a human agent.

4. Knowledge Retrieval searches your data. The retrieval agent takes the classified intent and customer message and runs a hybrid search against your knowledge base — combining semantic vector similarity with keyword matching for maximum recall. The knowledge base includes:

Product catalog — synced from Shopify (names, descriptions, prices, availability)
FAQ database — your custom question-and-answer pairs
Policy documents — return policies, shipping rules, warranty terms

Knowledge retrieval technical detail

This section covers how articles are vectorized, indexed, and searched. Understanding these details helps you write knowledge base content that retrieves well.

Embedding and indexing

When you publish a knowledge base article, Agent Red immediately generates a vector embedding:

Parameter	Value
Embedding model	OpenAI `text-embedding-3-large`
Vector dimensions	3,072
Similarity metric	Cosine distance
Index type	Cosmos DB DiskANN (approximate nearest neighbor)
Data type	float32

How articles are prepared for embedding:

The article's entry type label, title, tags, and content are combined into a single text block.
The title appears first because positional importance affects semantic encoding — the embedding model gives more weight to earlier text.
Content is truncated at 8,000 characters to stay within the model's token budget.
The resulting text is sent to OpenAI and the 3,072-dimension vector is stored directly on the Cosmos DB document alongside the article content.

Each article produces one embedding (not chunked). This differs from conversation memory (Layer 2), which chunks transcripts into ~250-token segments with 30-token overlap.

Change detection: Agent Red hashes the article content (SHA-256 of title + content) and skips re-embedding if the hash matches the previous version. This avoids unnecessary API calls when you save an article without changing its text.

Hybrid search with Reciprocal Rank Fusion

Agent Red does not rely on vector similarity alone. Every search uses a hybrid strategy that fuses two ranking signals:

Signal	Weight	What it captures
Vector similarity	70%	Semantic meaning — understands that "Where's my package?" and "shipping status" are related even though they share no keywords
BM25 keyword score	30%	Exact term matching — ensures that a search for "SKU-4521" finds the article containing that exact string

Why hybrid matters: Pure vector search can miss exact identifiers (order numbers, SKUs, policy names). Pure keyword search misses paraphrased questions. The hybrid approach captures both — and the title receives a 3x keyword boost so articles with relevant titles rank higher.

Reciprocal Rank Fusion (RRF) merges the two ranked lists using the formula: score(d) = Σ(weight / (k + rank)) with a smoothing constant of k=60. This produces a single score normalized to the 0–1 range. Results scoring below 0.1 are excluded. A score of 0.7 or above indicates a strong match.

Retrieval parameters

Parameter	Value	Description
Default results returned	5	Top 5 highest-scoring articles
Maximum results	20	Hard ceiling to control context size
Candidate pool	3× top-k	Wider initial retrieval improves fusion quality
Minimum relevance score	0.1	Results below this threshold are excluded
High relevance threshold	0.7	Tracked in analytics as "strong match"
Maximum context budget	4,000 characters	Total text sent to the response generator
BM25 k1	1.5	Term frequency saturation
BM25 b	0.75	Document length normalization

What this means for your knowledge base

Write clear, specific titles. Titles are boosted 3x in keyword scoring. "Return Policy — Physical Products" retrieves better than "Policy Document #3."
Use the exact terms your customers use. If customers ask about "shipping times," include that phrase in your article — BM25 rewards exact matches.
Keep articles focused on one topic. A single embedding per article means a sprawling article covering returns, shipping, AND warranties produces a diluted vector. Three focused articles retrieve more precisely.
Tags help retrieval. Tags are included in the embedding text, so adding tags like "returns," "refund," "30-day" gives the vector model more signal.

Caching

Repeated or similar queries are accelerated by three caching layers:

Exact query cache — identical queries return cached results instantly.
Semantic cache — queries that are semantically similar to recent queries reuse the same embedding, skipping the OpenAI API call.
Embedding cache — prevents re-embedding the same query text within a session.

Fallback behavior

If the primary hybrid search is unavailable (for example, if the embedding API is temporarily unreachable), the system degrades gracefully:

Hybrid (default) — vector + BM25 with RRF fusion
Vector-only fallback — if BM25 index is unavailable
BM25-only fallback — if embedding generation fails
Empty result — if all search paths fail, the response generator works without retrieved context and is more likely to escalate

5. Response Generator composes the reply. The response generator receives the classified intent, retrieved knowledge, full conversation history, and Persistent Customer Memory context. It uses GPT-4o to compose a natural-language reply that:

Answers the customer's question using retrieved facts (not hallucinated information)
Maintains your brand's tone and voice
Follows your configured response policies (greeting style, sign-off, escalation language)
Handles multi-turn context (remembers what was discussed earlier in the conversation)
Personalizes the response using the customer's profile, prior interactions, and learned preferences

Response generation is usually the largest portion of per-conversation AI cost because it uses the more capable GPT-4o model.

6. Critic / Supervisor validates before delivery. The critic agent is the final gate before the customer sees a response. It checks:

Factual accuracy — Does the response match the retrieved knowledge? Are product names, prices, and policies correct?
Policy compliance — Does the response follow your configured business rules?
Content safety — Does the response contain inappropriate, harmful, or off-brand content?

If validation fails, the critic returns the response to the generator with a specific rejection reason, and the generator revises it. This feedback loop runs until the response passes or reaches a maximum retry count (default: 2), at which point the system escalates to a human agent.

The critic applies a fail-closed policy: responses are blocked unless all checks pass. This conservative approach prioritizes safety over throughput.

7. Analytics records the interaction. The analytics agent captures structured data from every conversation: intent distribution, response quality scores, escalation rates, latency, and customer satisfaction signals. This data powers the analytics dashboard and feeds continuous improvement cycles.

Communication protocols

The six agents run in-process within a single API Gateway container. They communicate through synchronous HTTP endpoints — the main pipeline calls agents sequentially and all processing completes within a single request lifecycle.

HTTP endpoints (synchronous pipeline)

Each agent exposes a POST endpoint within the API Gateway process. The main pipeline calls agents sequentially (intent → knowledge → response → critic) via internal HTTP calls. Analytics data is captured synchronously during the pipeline execution and persisted to Cosmos DB. Health check endpoints are exposed for Azure Container Apps readiness probes.

Internal message format

Agents exchange messages as JSON payloads over internal HTTP endpoints. Every message carries conversation context and tenant isolation:

{
  "conversation_id": "conv-abc123",
  "tenant_id": "tenant-acme-corp",
  "message": "Where is my order #12345?",
  "intent": "order_status",
  "context": {
    "history": [...],
    "customer_profile": {...},
    "retrieved_knowledge": [...]
  },
  "metadata": {
    "language": "en",
    "timestamp": "2026-01-15T14:32:00Z"
  }
}

Field	Purpose
`conversation_id`	Threads messages into a conversation (maintained across turns)
`tenant_id`	Ensures tenant isolation throughout the pipeline
`intent`	Classified intent from the Intent Classifier
`context`	Accumulated pipeline context (history, profile, knowledge)
`metadata`	Language, timestamps, and routing information

The conversation_id persists across an entire customer conversation, allowing agents to reference previous messages. End-to-end traceability is available through OpenTelemetry and Application Insights.

PII protection

Agent Red provides PII protection at three levels:

Pipeline PII tokenization

Before any customer message reaches the AI models, Agent Red's PII tokenizer scans the text and replaces detected email addresses and phone numbers with reversible UUID tokens. The AI processes the tokenized text, and after the Critic validates the response, detected tokens are replaced with the original values before delivery to the customer. This means the AI models never see raw PII during processing.

Token mappings are stored in an isolated Cosmos DB container with a 7-day TTL, and are automatically purged when a customer exercises their GDPR right to erasure.

Storage-layer PII scrubbing

When PII scrubbing is enabled in the Memory & Privacy settings, Agent Red automatically redacts email addresses and phone numbers from conversation transcripts before storing them. This protects customer data at rest while leaving the live conversation experience unchanged.

Azure security perimeter

All AI processing uses Azure OpenAI Service, which means customer data stays within the Azure security perimeter. Data does not leave Azure infrastructure during conversation processing.

Content safety pipeline

The Critic / Supervisor agent runs a multi-check validation pipeline on every generated response before delivery.

Check	What it validates	Failure action
Factual accuracy	Response matches retrieved knowledge; no hallucinated data	Regenerate with stricter grounding
Policy compliance	Response follows business rules (refund limits, warranty terms)	Regenerate with policy context
Content safety	No inappropriate, harmful, or off-brand content	Regenerate or escalate

The safety pipeline catches issues before they reach customers. The system uses a fail-closed policy — responses are blocked unless the critic explicitly approves them.

Scaling behavior

Agent Red runs as a unified API Gateway on Azure Container Apps with native auto-scaling. The six agents run in-process within the gateway container, so scaling is at the container level rather than per-agent.

Behavior	Description
Scale-to-zero	Container stops when idle, restarts on first request
Auto-scale up	Azure Container Apps scales replicas based on HTTP concurrency
Serverless database	Cosmos DB Serverless charges only for consumed RUs — no idle cost
Design target	680 concurrent merchant tenants (SPEC-1516)

Persistent Customer Memory

Most support platforms treat every conversation as a blank slate. Agent Red maintains a layered memory system that builds context over the lifetime of each customer relationship. The response generator draws on this memory to personalize every interaction — greeting returning customers by name, referencing prior issues, and adapting to individual communication preferences.

Memory architecture

How each layer works

Layer 1: Customer Context (all tiers) — A structured profile assembled from Shopify data, integration sources, and plan metadata. Injected into every conversation automatically. The response generator knows the customer's name, plan tier, active integrations, and communication preferences from the first message.

Layer 2: Conversation Memory (all tiers) — After each conversation, the transcript is cleansed of PII and transient data (session tokens, temporary URLs), chunked, and embedded into Cosmos DB's vector store. When a customer returns, the response generator retrieves semantically relevant prior conversations — no need for the customer to repeat themselves.

Layer 3: Cross-Session Learning (Professional and Enterprise) — A memory framework analyzes accumulated conversations to extract durable patterns: preferred communication style, recurring issues, escalation triggers, and product preferences. These learned insights are injected alongside the customer profile, enabling the AI to adapt its tone and proactively address known issues.

Layer 4: Dedicated Model Training (Enterprise add-on) — After a customer accumulates 1,000+ interactions, Agent Red can create a fine-tuned AI model specifically for that customer. The fine-tuning pipeline trains on the customer's historical data via Azure OpenAI, producing a per-customer model that delivers maximum personalization. Models are periodically re-trained as new interactions accumulate.

Memory by tier

Layer	Starter	Professional	Enterprise
Customer Context (L1)	Included	Included	Included
Conversation Memory (L2)	Included	Included	Included
Cross-Session Learning (L3)	—	Included	Included
Dedicated Model Training (L4)	—	—	Add-on

Privacy and data handling

Layers 1–3 operate under GDPR/CCPA legitimate interest — no additional consent required
All memory data is tenant-isolated (customer A's memory never appears in customer B's context)
Customers can request deletion of their memory profile and all associated data
Conversation transcripts are cleansed of PII before vectorization

See the Privacy Policy for full details on data handling and retention.

Next steps

Initial Setup — What you need to get Agent Red running for your store.
Shopify Integration — Connect your product catalog and order data.

Was this page helpful?

End-to-end conversation flow​

What happens at each step​

Knowledge retrieval technical detail​

Embedding and indexing​

Hybrid search with Reciprocal Rank Fusion​

Retrieval parameters​

What this means for your knowledge base​

Caching​

Fallback behavior​

Communication protocols​

HTTP endpoints (synchronous pipeline)​

Internal message format​

PII protection​

Pipeline PII tokenization​

Storage-layer PII scrubbing​

Azure security perimeter​

Content safety pipeline​

Scaling behavior​

Persistent Customer Memory​

Memory architecture​

How each layer works​

Memory by tier​

Privacy and data handling​

Next steps​