A 1.4-Second Latency Bug Was Invisible to APM. Here's How to Find It.
A voice agent's dead air wasn't in the LLM or the ASR. Marcus Chen's post-mortem shows how to find latency in the gaps between instrumented spans, where most APM tools are blind. A customer call on…
A voice agent's dead air wasn't in the LLM or the ASR. Marcus Chen's post-mortem shows how to find latency in the gaps between instrumented spans, where most APM tools are blind.
A customer call on June 3rd included 1.4 seconds of dead air. The user, hearing only silence, asked "hello?" before the AI agent responded. The founder, Marcus Chen, reports that his observability platform showed a perfectly healthy system. End-to-end p95 latency was 980ms, well within budget, and every individual component trace was green. The dashboard insisted everything was fine while the product was failing.
The latency that broke the user experience was not in any single component. It was in the unattributed time between them.
The anatomy of a voice turn
Chen's voice pipeline is a sequence of discrete services: Voice Activity Detection (VAD) determines when the user has stopped speaking, Automatic Speech Recognition (ASR) transcribes the audio, an LLM generates a response, and Text-to-Speech (TTS) converts that response back into audio.
The company maintained a latency budget for each stage, which Chen shared in his post. The sum of the p95 latencies for each component was 1,340ms.
| Stage | p95 |
|---|---|
| VAD / turn-detection | 120ms |
| ASR (streaming) | 310ms |
| LLM TTFT | 380ms |
| LLM full response | 260ms |
| TTS first byte | 190ms |
| Network (both legs) | 90ms |
The system's reported end-to-end p95 latency was 980ms. This is lower than the summed total because a single request rarely hits the 95th percentile on every stage simultaneously. By these metrics, the 1.4-second dead air on the June 3rd call was a statistical impossibility.
Optimizing the wrong spans
Standard Application Performance Monitoring (APM) tools generate waterfall charts, visualizing each operation as a span. The conventional wisdom is to find the longest span and shorten it. Chen reports spending two days on this path. He optimized the ASR and cached prompts to shave 40ms off the LLM's time-to-first-token.
The component spans got shorter. The dead air remained. The core error was assuming the problem was inside one of the visible bars on the chart. Chen's post argues this is a fundamental flaw in applying traditional APM to multi-component AI systems. Voice agents do not break inside the LLM call; they break in the audio pipeline, in the handoffs nobody owns a span for.
Instrumenting the gaps
The 1.4 seconds of silence was never a span. It was the white space between the VAD span ending and the ASR span beginning. This handoff, the moment audio data is passed from one service to the next, was not being measured. APM tools are built to instrument work, not waiting.
To find the gap, Chen had to manually reconstruct the timeline. The solution is to create a new, dedicated span that measures the handoff itself. This "meta-span" starts when the VAD service finishes and ends when the ASR service begins processing. By instrumenting the gap, the team made the invisible latency visible, attributing the dead air to a specific orchestration delay.
What We'd Change
Chen’s post provides a powerful diagnostic playbook but stops short of detailing the fix. Identifying the gap is critical, but reducing it is a separate engineering challenge. The unattributed time likely stems from one of several common sources in distributed systems.
First is network transit and serialization. The time it takes to package audio data from the VAD service and transmit it to the ASR service can be significant, especially with large audio chunks. Second is queueing and resource contention. If the ASR service is handling concurrent requests, Chen's audio might have been waiting in a queue for a worker to become available. This is common in systems that rely on GPU resources.
Finally, the delay could be a cold start on the ASR service itself. If the container or process handling the request was not warm, the initialization time would appear as a handoff delay. A complete playbook would involve instrumenting each of these potential failure points within the gap. A span for "time in queue" or "deserialization time" would provide a more granular diagnosis than a single handoff span.
This playbook is also most relevant for teams building their own voice pipeline from discrete components. Founders using integrated platforms like Vapi or Bland have less control over inter-service orchestration. For them, the takeaway is to demand this level of visibility from their vendors. If a platform cannot account for handoff latency, it is selling an incomplete observability story.
Landing
The critical insight from Chen's investigation is that for complex AI systems, the most expensive failures often occur in the orchestration layer, not the model layer. Traditional APM tools, designed for monolithic applications or simpler microservices, can create critical blind spots by focusing only on the execution time of individual components. The total user-experienced latency is the sum of the work and the waiting. Instrumenting the waiting is no longer optional.
The investor read
This post-mortem signals a maturation of the voice AI market. The challenge is shifting from simply making models work to achieving production-grade reliability and performance. The durable companies in this space will be defined by operational excellence in orchestration and observability, not just access to the fastest models. This creates an opportunity for 'pick-and-shovel' plays in AI-native observability that can visualize and diagnose these inter-service gaps, a problem traditional APM tools were not built to solve. When evaluating voice AI startups, investors should probe for this level of diagnostic capability. A team obsessed with hunting down unattributed milliseconds is a team building a resilient, enterprise-ready product.
Pull quote: “Voice agents do not break inside the LLM call; they break in the audio pipeline, in the handoffs nobody owns a span for.”
Every claim ties to a primary source. See our methodology.