We Built an AI Agent That Triages 500 Support Tickets/Day. Here's the Architecture.
A B2B SaaS burning $12K/mo on manual triage. We replaced 80% of the work with an AI agent in 4 weeks. $340/mo in API costs. 94% accuracy.
A B2B SaaS company came to us with a support problem. Not an unusual one — in fact, the most common kind.
They were processing 500+ support tickets per day across email, Intercom, and their API status page. Three full-time support agents spent roughly 60% of their time on triage: reading tickets, categorizing them (billing, bug, feature request, account access, integration issue), assigning priority, pulling up customer context, and routing to the right person.
The numbers: $12K/month in support labor allocated to triage alone. Average first-response time of 4.2 hours. Customer satisfaction scores declining because responses were slow, not because they were wrong.
Four weeks later, we had an AI agent handling triage. $340/month in API costs. 94% classification accuracy over 30 days. First-response time dropped from 4.2 hours to 12 minutes for the 80% of tickets the agent handles autonomously. The support agents now spend their time on the 20% that requires human judgment.
Here's exactly how we built it.
The problem, in detail
Before building anything, we spent 3 days understanding the workflow. This is the part most teams skip — and it's the reason most AI agents fail in production.
The existing workflow
- Ticket arrives via email, Intercom, or API webhook
- A support agent reads the ticket
- Agent checks the customer's account: plan tier, recent tickets, feature flags, billing status
- Agent categorizes: billing (22%), bug report (31%), feature request (18%), account access (12%), integration issue (11%), other (6%)
- Agent assigns priority: P1 (outage/data loss), P2 (blocking issue), P3 (non-blocking), P4 (question/request)
- Agent drafts a first response (or routes to engineering for P1/P2 bugs)
- For 60% of tickets, the response is a variation of something that's been sent before
The key insight: steps 2-6 are pattern matching. The agent reads text, looks up context, applies rules, and outputs a classification plus a routing decision. This is exactly what LLMs do well.
The other key insight: step 7 means the majority of responses are templates with customization. The AI doesn't need to compose novel prose. It needs to pick the right template and fill in the customer-specific details.
What we measured before building
- Classification accuracy of existing agents: We manually scored 200 random tickets. Human agents were 91% accurate on categorization and 87% on priority. Those became our benchmarks. The AI needs to match or exceed these, not hit 100%.
- Response pattern frequency: 47% of responses were nearly identical to a previous response for the same category. Another 28% were variations with customer-specific details swapped in.
- Time allocation: 60% triage and routing, 25% responding to routine tickets, 15% handling complex issues requiring investigation.
The architecture
Overview
Ticket Source (Email/Intercom/API)
│
▼
Webhook Receiver (Next.js API route)
│
▼
Preprocessing (normalize + extract metadata)
│
▼
LLM Classification (Claude Haiku)
│
├──→ Tool Call: Fetch customer context
├──→ Tool Call: Search recent tickets
├──→ Tool Call: Search knowledge base
│
▼
Classification Output (category + priority + reasoning)
│
├──[P1/P2 Bug]──→ Page engineering on-call
├──[Routine]──→ Draft response + auto-send queue
└──[Complex]──→ Human review queue with context
│
▼
Monitoring Dashboard (accuracy + latency + cost)
Component 1: Webhook receiver
Every ticket source feeds into a single webhook endpoint. We normalize the format upfront so the rest of the pipeline doesn't care where the ticket came from.
// Normalized ticket format
interface TriageTicket {
id: string;
source: "email" | "intercom" | "api";
subject: string;
body: string;
senderEmail: string;
senderName: string;
customerId: string | null; // Resolved from email lookup
timestamp: string;
attachments: { name: string; type: string; url: string }[];
metadata: Record<string, unknown>;
}
The normalization step handles:
- Email parsing (strip signatures, quoted replies, HTML to plain text)
- Intercom webhook payload extraction
- API status page alert formatting
- Customer ID resolution from email address (database lookup)
This took about 2 days of the 4-week build. Not glamorous, but getting clean input to the LLM is non-negotiable.
Component 2: LLM classification with tool calls
This is the core of the agent. We use Claude Haiku (now Claude 3.5 Haiku) for the classification step. The choice was deliberate:
- Speed: Haiku responds in 300-800ms. For triage, speed matters more than the absolute best reasoning.
- Cost: ~$0.0003 per classification (average 1,200 input tokens, 400 output tokens). At 500 tickets/day, that's $4.50/day for classification alone.
- Accuracy: For categorization tasks with clear rubrics, Haiku's accuracy is within 2% of Sonnet. We tested both extensively.
The system prompt is structured:
const systemPrompt = `You are a support ticket triage agent for [CompanyName],
a B2B SaaS platform for [domain].
Your job:
1. Classify the ticket into exactly one category
2. Assign a priority level
3. Determine if this can be auto-responded or needs human review
4. If auto-respond, draft the response
## Categories
- billing: Payment issues, plan changes, invoices, refunds
- bug: Something is broken or not working as expected
- feature_request: Asking for new functionality
- account_access: Login issues, password resets, permission problems
- integration: API issues, webhook failures, third-party connections
- other: Doesn't fit above categories
## Priority Levels
- P1: Service outage, data loss, security issue. Page engineering.
- P2: Blocking issue for customer, degraded functionality.
- P3: Non-blocking issue, workaround available.
- P4: Question, feedback, or non-urgent request.
## Rules
- When unsure between two categories, pick the one with higher impact
- P1 tickets ALWAYS go to human review, regardless of category
- Billing tickets from Enterprise plan customers are minimum P2
- If the customer has filed 3+ tickets in the past week, escalate priority by one level
- NEVER auto-respond to tickets mentioning "data loss", "breach", or "legal"
Use the available tools to look up customer context before classifying.`;
The agent has access to three tools:
const tools = [
{
name: "get_customer_context",
description: "Fetch customer account details: plan, MRR, company size, account health score",
parameters: {
type: "object",
properties: {
customer_id: { type: "string" },
},
required: ["customer_id"],
},
},
{
name: "search_recent_tickets",
description: "Search customer's recent support history for patterns",
parameters: {
type: "object",
properties: {
customer_id: { type: "string" },
days: { type: "number", default: 14 },
},
required: ["customer_id"],
},
},
{
name: "search_knowledge_base",
description: "Search help articles and documentation for relevant solutions",
parameters: {
type: "object",
properties: {
query: { type: "string" },
},
required: ["query"],
},
},
];
The tool calls are critical. Without customer context, the agent can't apply rules like "Enterprise customers get P2 minimum" or "escalate if 3+ tickets this week." Without the knowledge base search, the auto-responses would be generic instead of linking to specific help articles.
Component 3: Response generation
For tickets classified as "routine" (P3/P4 with a clear category), the agent drafts a response. We don't use free-form generation. Instead, we have a library of response templates and the agent selects + customizes:
const responseTemplates = {
billing_refund: {
template: `Hi {{name}},
Thanks for reaching out about your refund request.
{{#if eligible}}
I've initiated a refund of {{amount}} to your {{payment_method}}.
You should see it reflected in 5-10 business days.
{{else}}
I've reviewed your account and unfortunately this falls outside
our refund policy ({{reason}}). Here's what I can offer instead:
{{alternative}}
{{/if}}
Is there anything else I can help with?`,
requires: ["refund_eligibility_check"],
},
// ... 40+ templates covering common scenarios
};
The agent's job is to:
- Select the correct template
- Fill in the variables using customer context
- Decide whether to auto-send or queue for human review
We set a confidence threshold: if the agent's classification confidence is above 0.85, it auto-sends for P3/P4 tickets. Below 0.85, it queues for human review with the draft pre-filled.
Component 4: Routing and escalation
Different classifications trigger different actions:
| Classification | Priority | Action |
|---|---|---|
| Bug | P1 | Page engineering on-call via PagerDuty |
| Bug | P2 | Create Linear ticket, assign to engineering queue |
| Bug | P3-P4 | Draft response with workaround, human review |
| Billing | Any | Route to billing team with account context |
| Feature request | Any | Log to feature request tracker, auto-respond |
| Account access | P1-P2 | Route to security team |
| Account access | P3-P4 | Auto-respond with self-service password reset |
| Integration | Any | Route to developer support with API logs |
The routing is implemented as a simple decision tree — not another LLM call. Once classification is done, routing is deterministic. This is important: you don't want probabilistic behavior in your escalation logic.
Component 5: Monitoring and evaluation
This is the component that keeps the system working over time. We track:
Classification accuracy:
- Every auto-responded ticket has a "Was this helpful?" prompt
- Human reviewers mark whether the pre-filled classification was correct
- Weekly accuracy report: category accuracy, priority accuracy, false positive rate for auto-responses
- Alert if accuracy drops below 90% on any metric
Operational metrics:
- Tickets processed per hour
- Average classification latency (target: < 2 seconds)
- Tool call success rates (API failures, timeouts)
- Auto-response rate (target: 60-70% of total tickets)
- Human review queue depth
Cost tracking:
- Per-ticket cost breakdown: classification, tool calls, response generation
- Daily/weekly/monthly totals
- Cost per category (bug tickets are more expensive — they require more tool calls)
- Alert if daily cost exceeds 2x the running average
// Example: per-ticket cost tracking
interface TicketCost {
ticketId: string;
classification: {
inputTokens: number;
outputTokens: number;
cost: number; // USD
latencyMs: number;
};
toolCalls: {
tool: string;
latencyMs: number;
cost: number;
}[];
responseGeneration: {
inputTokens: number;
outputTokens: number;
cost: number;
latencyMs: number;
} | null;
totalCost: number;
totalLatencyMs: number;
}
The results: 30 days in production
Accuracy
| Metric | Human Agents | AI Agent |
|---|---|---|
| Category accuracy | 91% | 94% |
| Priority accuracy | 87% | 92% |
| False auto-responses | N/A | 3.2% |
The AI agent is more accurate than the human agents were. This surprised us initially, but it makes sense: human agents get tired, distracted, and inconsistent across a shift. The AI applies the same rubric to every ticket.
The 3.2% false auto-response rate means roughly 16 tickets per day get an incorrect auto-response. We mitigate this with the "Was this helpful?" feedback loop — if a customer marks it unhelpful, the ticket immediately routes to a human.
Speed
| Metric | Before | After |
|---|---|---|
| First response time (overall) | 4.2 hours | 47 minutes |
| First response time (auto-responded) | 4.2 hours | 12 minutes |
| First response time (human reviewed) | 4.2 hours | 1.8 hours |
| Triage time per ticket | 6-8 minutes | 1.4 seconds |
The 12-minute auto-response time includes a deliberate 10-minute delay. We don't send instant responses because it feels robotic to customers. The 10-minute delay makes the response feel like a fast human, not a bot. This was a product decision, not a technical limitation.
Cost
| Component | Monthly Cost |
|---|---|
| Claude Haiku (classification) | $135 |
| Claude Haiku (response generation) | $95 |
| Tool call API latency (Supabase, Linear) | Included in existing plans |
| Knowledge base search (embedding queries) | $45 |
| Monitoring infrastructure | $65 |
| Total | $340/mo |
Previous cost: $12K/month in support labor for triage. The three support agents now spend their time on complex tickets, customer success calls, and documentation improvement — work that actually requires human judgment.
ROI: $11,660/month in savings, minus the one-time build cost (paid back in month 2).
Tech decisions and why we made them
Claude Haiku over GPT-4o-mini
Both are fast and cheap. We tested both on a 500-ticket evaluation set. Haiku was 2% more accurate on our specific classification rubric and had more consistent latency (less variance in response time). For other rubrics, GPT-4o-mini might win. The point: benchmark on your data, not on general benchmarks.
Template-based responses over free-form generation
Free-form generation is more flexible but less controllable. A template with variables gives you:
- Consistent tone and formatting
- Guaranteed inclusion of required information (refund policy links, etc.)
- Easy updates when policies change (edit the template, not retrain the model)
- Lower hallucination risk (the LLM fills blanks, doesn't compose)
The tradeoff: templates are rigid. For the 20% of tickets that don't fit a template, we fall back to human review. This is the right tradeoff for a support system where consistency matters more than creativity.
Deterministic routing over LLM-based routing
Once you have a classification, routing should be a lookup table, not another LLM call. LLMs are non-deterministic. You don't want a P1 bug ticket sometimes going to engineering and sometimes not because the LLM decided differently. Rules are cheap, fast, and predictable.
10-minute auto-response delay
We A/B tested this. Instant responses got a 12% lower satisfaction score than responses with a 10-minute delay, even when the content was identical. Customers assumed instant responses were automated and didn't read them carefully. Delayed responses were perceived as personal.
What we'd do differently
Start with a narrower scope
We built the full triage system — all categories, all sources, auto-responses — in one sprint. In hindsight, we should have started with email-only, classification-only (no auto-responses), and expanded from there. The email source represented 70% of ticket volume, and classification alone would have saved 40% of the triage time.
Build the eval suite larger from the start
Our initial eval set was 200 tickets. That wasn't enough to catch edge cases in the "integration" category (which had more variety than we expected). We expanded to 500 tickets in week 2. Starting with 500 would have caught the issue earlier.
Invest more in the preprocessing pipeline
Dirty input is the biggest source of classification errors. Email parsing, in particular, produces artifacts that confuse the LLM — forwarded headers, HTML remnants, auto-generated signatures. We spent 3 days on preprocessing initially and another 2 days fixing issues in week 3. Front-loading that investment would have been more efficient.
The pattern for building AI agents
The triage agent follows a pattern we've used across multiple projects:
- Understand the workflow before automating it. Spend days observing, not hours assuming.
- Measure the human baseline. You can't improve what you don't measure, and humans aren't as accurate as you think.
- Use the cheapest model that meets your accuracy bar. Speed and cost matter more than marginal accuracy gains.
- Give the agent tools, not just prompts. Without access to customer context and knowledge base, the agent is guessing.
- Keep routing deterministic. LLM for understanding, rules for decisions.
- Build monitoring from day one. Accuracy degrades silently without measurement.
- Ship fast, then iterate. The first version won't be perfect. Production data is the best teacher.
The $340/month AI agent isn't magic. It's a well-scoped problem, a clear rubric, the right model, good tools, and relentless measurement. The same pattern works for invoice processing, lead scoring, content moderation, and any other classification + routing workflow.
The expensive part isn't the AI. It's knowing which 80% to automate and which 20% to leave to humans.
Ready to get started?
Want an AI agent for your product? Production-grade, shipped in 4-6 weeks.