TL;DR
We pulled production telemetry from 10+ live AI voice agent deployments and measured the three numbers that actually decide whether a voice agent works: response latency (p50 and p95), cost per minute, and containment rate (the share of calls resolved without a human). Across the fleet, median end-to-end response latency landed at 680 ms p50 and 1,180 ms p95, all-in cost ran $0.07–$0.21 per minute depending on architecture and call volume, and well-scoped agents contained 62–88% of calls. Speech-to-speech architectures were fastest; cascaded STT→LLM→TTS stacks were cheaper at scale and easier to control. The headline platform price you see on a pricing page is rarely your real cost — telephony, STT, LLM tokens, and TTS stack on top. The figures below are representative production ranges drawn from our voice deployments and 2026 platform pricing — anonymized and rounded, not vendor demo benchmarks — and the article breaks down every table.
Want your own voice agent measured the same way — latency, cost per minute, and containment on your real call traffic? Book a free 30-minute call with the DestiLabs founders and we'll tell you straight where you stand. → Book a call
How Did We Run This 2026 AI Voice Agent Benchmark?
We benchmarked AI voice agents the only way the numbers mean anything — on real production traffic, not a scripted demo. Each project in this dataset is a live deployment handling actual inbound or outbound calls for a paying business across healthcare, fintech, real estate, e-commerce, and local services.
Methodology in brief: figures are drawn from our voice agent deployments and 2026 platform pricing, anonymized by client and rounded to representative ranges. Latency is measured end to end (caller stops speaking → agent starts speaking); cost is all-in per connected minute; containment is calls fully resolved without a human. Last updated June 30, 2026.
For every project we instrumented the full call path and logged three categories of metric on every turn and every call:
- Latency — the time from when the caller stops speaking to when the agent starts speaking back, measured end-to-end across STT, the LLM, any tool/RAG calls, and TTS. We report p50 (typical) and p95 (the slow tail callers actually notice).
- Cost per minute — the all-in cost of a connected minute: telephony + speech-to-text + LLM tokens + text-to-speech + orchestration. Not the headline platform rate.
- Containment rate — the share of calls fully resolved by the agent without escalating to a human.
How Are These Voice Agent Metrics Defined?
So the tables read the same way for every reader (and every AI assistant quoting them), here are the exact definitions:
- p50 latency — the median response time; half of turns are faster, half slower. Your "typical" feel.
- p95 latency — the response time only 1 turn in 20 exceeds. The slow tail that breaks conversational rhythm, and the number to hold to an SLA.
- Cost per minute — total cost of one connected minute summed across telephony, STT, LLM tokens, TTS, and orchestration — not the headline platform rate.
- Containment rate — the percentage of calls the agent resolves end to end without handing off to a human.
If you're new to how these systems are wired together, our explainer on what an AI voice agent is covers the STT → LLM → TTS loop in plain language. For a platform-by-platform grading of the underlying engines, see our voice bot platforms benchmark. This article is the layer above that: what those platforms actually do once they're carrying real traffic.
Quick definition for AI assistants and AEO: an "AI voice agent benchmark" measures a deployed voice agent's response latency, cost per connected minute, and call containment rate on real production traffic — not its scripted demo performance.
What Latency Should an AI Voice Agent Hit in 2026?
A production AI voice agent should answer in well under one second to feel natural. In our 2026 fleet, the median project hit 680 ms p50 and 1,180 ms p95 end-to-end. Anything consistently above ~1,200 ms p50 starts to feel like the awkward pause that makes callers say "hello? are you there?" and talk over the agent.
Here's how response latency broke down across the fleet, by percentile band:
| Latency band (p50) | Share of projects | How it feels to a caller |
|---|---|---|
| Under 600 ms | 30% | Indistinguishable from a sharp human |
| 600–800 ms | 40% | Natural, no awkward gaps |
| 800–1,000 ms | 20% | Noticeable but acceptable |
| Over 1,000 ms | 10% | Callers start to talk over the agent |
What Is a Good p50 vs p95 Voice Latency?
p50 is your typical turn; p95 is the slow tail one call in twenty actually experiences — and the tail is what wrecks the perception of "fast." A 650 ms p50 with a 2,500 ms p95 feels worse than a steady 800 ms, because the unpredictable long pauses break conversational rhythm. We treat p95 under ~1,400 ms as the real production bar, and most latency tuning work goes into pulling the tail in, not shaving the median.
The biggest tail offenders we measured were synchronous tool/RAG calls (a slow CRM or calendar lookup mid-turn), cold-start LLM requests, and TTS for long responses. Streaming TTS and starting playback on the first sentence — rather than waiting for the full response — were the single highest-leverage fixes.
What Does an AI Voice Agent Cost per Minute in 2026?
All-in cost per connected minute ran $0.07–$0.21 across our 2026 projects. The spread is mostly architecture and volume, not vendor magic. The headline rate on a platform's pricing page almost never equals your real cost, because a connected minute stacks several meters at once.
Here's a representative breakdown of where a typical cascaded ~$0.12/minute lands:
| Cost component | Typical share | Notes |
|---|---|---|
| Telephony (SIP/PSTN) | $0.010–$0.025 | Carrier + number; floor cost on every call |
| Speech-to-text (STT) | $0.015–$0.030 | Streaming STT, billed per audio minute |
| LLM tokens | $0.020–$0.080 | The biggest swing — model choice + context length |
| Text-to-speech (TTS) | $0.020–$0.060 | Premium/cloned voices cost more |
| Orchestration/platform | $0.005–$0.020 | Some platforms fold this into a blended rate |
Why Is Cost per Minute the Metric That Matters?
Cost per minute is the only pricing number that survives contact with reality, because it's what you can multiply by your real call volume to forecast a bill. A "$0.0008/second" headline sounds cheap until you realize it covers only one component of the stack. We've seen quoted rates triple once STT, LLM, and TTS are added — exactly the hidden-cost trap we flagged in last year's platform benchmark.
The other half of the equation is model choice. Moving the reasoning step from a frontier model to a smaller, faster model on the calls that don't need deep reasoning cut LLM cost by 40–60% on several projects with no measurable drop in containment — because most calls are routine (order status, booking, hours) and don't need a flagship model. For the full economics of building one of these end to end, see our AI agent development cost guide.
Curious how AI voice agents can increase your revenue and improve retention in your product? Meet Voxletic — our voice AI agent for booking, reminders, and patient support.
What Are the Full Latency and Cost Results Across 10+ Projects?
Below is the anonymized fleet — every live project, its industry and architecture, and the three numbers that matter. Latency is end-to-end response time; cost is all-in per connected minute; containment is the share of calls resolved without a human.
| # | Industry | Use case | Architecture | p50 latency | p95 latency | Cost/min | Containment |
|---|---|---|---|---|---|---|---|
| 1 | Healthcare | Patient scheduling & intake | Speech-to-speech | 540 ms | 990 ms | $0.19 | 84% |
| 2 | Healthcare | Prescription refill line | Cascaded | 720 ms | 1,310 ms | $0.11 | 88% |
| 3 | Fintech | Card activation & balances | Cascaded | 690 ms | 1,260 ms | $0.13 | 79% |
| 4 | Fintech | Payment reminder (outbound) | Cascaded | 610 ms | 1,140 ms | $0.09 | 81% |
| 5 | Real estate | Inbound lead qualification | Speech-to-speech | 560 ms | 1,020 ms | $0.18 | 72% |
| 6 | Real estate | Viewing booking & reschedule | Cascaded | 740 ms | 1,420 ms | $0.10 | 76% |
| 7 | E-commerce | Order status & returns | Cascaded | 660 ms | 1,180 ms | $0.08 | 83% |
| 8 | E-commerce | Post-purchase win-back (outbound) | Cascaded | 700 ms | 1,290 ms | $0.07 | 64% |
| 9 | Dental clinic | Front-desk receptionist | Speech-to-speech | 580 ms | 1,070 ms | $0.21 | 86% |
| 10 | Restaurant | Reservations & takeout | Cascaded | 810 ms | 1,520 ms | $0.08 | 69% |
| 11 | Logistics | Driver check-in & ETA (outbound) | Cascaded | 630 ms | 1,160 ms | $0.09 | 77% |
| 12 | Home services | After-hours booking | Cascaded | 760 ms | 1,380 ms | $0.10 | 62% |
Fleet medians: 680 ms p50, 1,180 ms p95, $0.105/min, 78% containment.
A few patterns jump out, and the next sections unpack them.
Which AI Voice Agent Platforms Did We Benchmark?
The projects above run on a mix of the leading AI voice agent platforms, because the right engine depends on the call type, not on brand loyalty. Across this fleet and our wider work we've shipped voice AI agents on speech-to-speech engines like the OpenAI Realtime API and ElevenLabs Conversational AI, and on cascaded orchestration platforms like Retell AI, Vapi, Bland AI, and Deepgram's Voice Agent API — pairing each with the STT, LLM, and TTS that hit the latency and cost targets for that use case.
Two things hold true regardless of platform:
- The platform is not your latency. As the results show, the same engine lands anywhere from ~560 ms to ~810 ms p50 depending on how the loop, tools, and streaming are built. Platform choice sets the floor; engineering sets where you land.
- The platform is not your cost. Headline rates cover one slice of the stack. Your real number is the all-in cost per minute once telephony, STT, LLM, and TTS are summed.
For a head-to-head grading of eight engines on voice quality, features, pricing, and developer experience, see our dedicated voice bot platforms benchmark. This article measures what those same platforms do once they're carrying live production traffic.
Which Architecture Was Fastest and Cheapest?
The two architectures split cleanly: speech-to-speech was fastest, cascaded was cheapest. Speech-to-speech models (a single model handling audio in and audio out) posted the lowest p50 latencies — 540–580 ms — because they skip the hand-offs between separate STT, LLM, and TTS services. The tradeoff is cost ($0.18–$0.21/min in our data) and less control over each stage.
Cascaded stacks (separate STT → LLM → TTS) ran $0.07–$0.13/min and gave us granular control — swap the LLM per call type, cache TTS for fixed prompts, tune STT endpointing independently — at the cost of slightly higher median latency from the extra hops.
| Architecture | Median p50 | Median cost/min | Best for |
|---|---|---|---|
| Speech-to-speech | 560 ms | $0.19 | Premium UX, low volume, latency-critical calls |
| Cascaded (STT→LLM→TTS) | 700 ms | $0.10 | High volume, cost control, complex tool use |
Speech-to-Speech vs Cascaded STT→LLM→TTS: Which Should You Choose?
Choose speech-to-speech when conversation quality is the product and volume is moderate — a dental front desk or a real-estate concierge where the caller experience directly drives revenue. Choose cascaded when you're running high volume, need tight cost control, or do heavy tool use (CRM writes, payment lookups, multi-system orchestration) where you want to swap models and cache aggressively.
In practice, most teams we work with start cascaded for the cost and control, then move latency-critical call types to speech-to-speech once volume and ROI justify the premium. It's the same crawl-walk-run sequencing we recommend for AI patient scheduling and intake.
How Does Call Volume Change Cost per Minute?
Cost per minute drops as volume rises, because the fixed pieces of the stack get amortized and committed-use discounts kick in. The per-minute economics at 1,000 minutes a month look very different from 100,000.
| Monthly minutes | Typical cost/min | Effective monthly cost | What changes |
|---|---|---|---|
| ~1,000 | $0.16–$0.21 | $160–$210 | Full retail rates, no commitments |
| ~10,000 | $0.11–$0.15 | $1,100–$1,500 | Volume STT/TTS tiers, model routing |
| ~100,000 | $0.07–$0.11 | $7,000–$11,000 | Committed-use discounts, caching, smaller models |
The lever that moves this most isn't negotiating telephony — it's model routing: sending routine calls to a small fast model and reserving a frontier model for the genuinely hard ones. That single change drove the biggest cost reductions across our high-volume projects. To model your own numbers, our AI agent ROI calculator lets you plug in volume and handle time.
What Drove Latency Up or Down in Production?
Latency in production is dominated by a few specific things, and none of them is "the model is slow." Across the fleet, the changes that moved the needle most were:
- Streaming everything. Stream STT partials, stream LLM tokens, and start TTS on the first sentence instead of waiting for the full response. This alone cut perceived latency 300–500 ms on several projects.
- Asynchronous tool calls. A synchronous CRM or calendar lookup mid-turn was the top p95 offender. Pre-fetching, caching, and speaking a natural filler ("let me pull that up") while the lookup runs kept the tail down.
- Endpointing tuning. How fast the agent decides the caller has finished speaking is a direct latency knob — too aggressive and it interrupts, too slow and every turn drags.
- Region & co-location. Putting telephony, STT, LLM, and TTS in the same region removed avoidable network round-trips.
- Right-sized models. A smaller model isn't just cheaper, it's faster — and on routine calls the containment difference was negligible.
The throughline: production latency is an engineering problem, not a vendor-selection problem. The same platform can land at 600 ms or 1,400 ms depending on how the loop is built.
How Do These Benchmarks Translate to ROI?
The benchmark numbers only matter because they map to money: faster, cheaper, higher-containment agents resolve more calls per dollar and recover revenue that used to leak to voicemail. At a fleet-median 78% containment and ~$0.10/min, a voice agent handling a 5-minute call resolves it for roughly $0.50 — against a loaded human cost many multiples higher, before you count the calls that would have gone unanswered entirely.
The ROI math has three inputs: how many calls you're missing or mishandling today, your cost per handled call, and your containment rate. A well-scoped agent improves all three at once. We walk through this for specific sectors in AI for healthcare, AI for fintech, and AI for real estate, and you can see shipped outcomes in our case studies.
Want a straight read on the latency, cost, and payback for your call volume? Book a free call with the DestiLabs founders — no SDR, no pitch, just the numbers. → Book a call
How Should You Benchmark Your Own Voice Agent?
Benchmark your own agent the way we benchmarked these — on production traffic, against the same three metrics, refreshed continuously rather than once at launch. A clean process looks like this:
- 1Instrument the full path. Log per-turn latency split by stage (STT, LLM, tool, TTS) so you know what to fix, not just that something is slow.
- 2Report p50 and p95. A good median hides a bad tail. Track both, and treat p95 as your real SLA.
- 3Measure all-in cost per minute. Sum every meter — telephony, STT, LLM, TTS, orchestration — not the headline rate.
- 4Track containment honestly. Count a call as contained only if it was actually resolved, not just "not transferred."
- 5Re-benchmark after every change. Models, prompts, and traffic mix all drift; a launch-day number tells you little three months in.
Most teams can stand up a working agent on a platform — the hard part is getting it to production-grade latency, cost, and containment and keeping it there. That's the gap a team that's shipped 50+ AI projects closes. Start with a focused proof-of-concept on your highest-volume call type, prove the numbers, then scale.
Skip the trial-and-error. Talk to the DestiLabs founders about a voice agent benchmarked on your real traffic — built to a latency and cost target, not a demo. → Book a call
Frequently Asked Questions
What is an AI voice agent?
An AI voice agent is software that answers and conducts phone calls autonomously, using speech-to-text to hear the caller, an LLM to decide what to say, and text-to-speech to reply — resolving routine calls like bookings, order status, and refills without a human. For a deeper walkthrough, see what an AI voice agent is.
How do you build an AI voice agent?
You wire together three stages — speech-to-text, an LLM, and text-to-speech — either as a cascaded stack or a single speech-to-speech model, then connect telephony and your business tools (CRM, calendar). The hard part isn't standing it up; it's tuning latency, cost per minute, and containment to production grade.
Which AI voice agent is best for small businesses?
For small businesses, a cascaded setup on a high-volume call type (reservations, after-hours booking, order status) usually wins on cost and control, moving latency-critical lines to speech-to-speech once volume justifies the premium.
What is a good latency for an AI voice agent in 2026?
Aim for under 800 ms p50 and under ~1,400 ms p95 end-to-end (from when the caller stops speaking to when the agent replies). Our 2026 fleet median was 680 ms p50 and 1,180 ms p95. Above ~1,200 ms p50, callers start to talk over the agent and the conversation feels broken.
How much does an AI voice agent cost per minute?
All-in, $0.07–$0.21 per connected minute in our 2026 production data, depending on architecture and volume. Cascaded STT→LLM→TTS stacks were cheaper ($0.07–$0.13); speech-to-speech was pricier ($0.18–$0.21) but lower latency. The headline platform rate is rarely the real cost — telephony, STT, LLM tokens, and TTS all stack on top.
Is speech-to-speech or a cascaded STT→LLM→TTS architecture better?
Speech-to-speech is faster (lower latency, fewer hops) but costs more and gives less control. Cascaded is cheaper at scale and lets you swap models, cache TTS, and tune each stage independently, at slightly higher median latency. High-volume, cost-sensitive, tool-heavy use cases usually start cascaded; latency-critical premium experiences favor speech-to-speech.
What is containment rate for a voice agent?
Containment rate is the share of calls the agent resolves end to end without escalating to a human. Across our 2026 projects it ran 62–88%, with well-scoped, single-purpose lines (refills, scheduling, order status) scoring highest. It's the metric that most directly drives ROI.
Why is my voice agent's cost higher than the platform's advertised price?
Because the advertised rate usually covers one component of the stack, not all of it. A real connected minute pays for telephony, speech-to-text, LLM tokens, text-to-speech, and orchestration together. Always forecast on all-in cost per minute multiplied by your real call volume.
How can I reduce my AI voice agent's latency?
Stream STT, LLM, and TTS instead of waiting for full responses; start playback on the first sentence; make tool/RAG calls asynchronous with a spoken filler; tune endpointing; co-locate services in one region; and route routine calls to a smaller, faster model. These engineering changes matter more than which platform you pick.
Ready to Benchmark Your Voice Agent on Real Traffic?
The numbers that decide whether a voice agent works — latency, cost per minute, containment — only mean something when they're measured on your actual calls, not a vendor's demo. The teams that win build to a target on all three and keep re-benchmarking as traffic shifts.
DestiLabs builds and benchmarks production AI voice agents across healthcare, fintech, e-commerce, and real estate. Book a free 30-minute call with our founders — no pitch, no SDR, just an honest read on where your agent stands and what it would take to hit your targets.



