A founder's playbook for benchmarking AI notetakers with synthetic meetings
Tien Nguyen's method for evaluating AI notetakers like Granola and Otter uses synthetic audio to create a ground truth. This is a reproducible, verifiable way to measure raw transcription accuracy.…
Tien Nguyen's method for evaluating AI notetakers like Granola and Otter uses synthetic audio to create a ground truth. This is a reproducible, verifiable way to measure raw transcription accuracy.
THE ANSWER UP FRONT
This methodology is for founders and product teams building or buying AI transcription tools who need objective, reproducible accuracy metrics. It's a playbook for anyone who wants to move beyond subjective 'it feels right' evaluations. Skip this if you only care about qualitative features like summarization and don't need to verify the underlying transcript's fidelity. The bottom line: manufacturing a ground-truth audio file from a script is the only reliable way to benchmark the core speech-to-text performance of an AI notetaker.
METHODOLOGY
This is a v0 review based on a single source signal. It analyzes the method for benchmarking AI notetakers proposed by founder Tien Nguyen in a June 2026 blog post. The source provides a clear rationale and a complete bash script for implementation. Our analysis covers the logic of the approach, its components, and its strategic implications. It does not include our own independent execution of the benchmark against the target tools (Granola, Fathom, Otter) or a quantitative comparison of their results. That testing is reserved for a v2 review.
- Tool Name: Synthetic Meeting Benchmark Methodology
- Version/Date Observed: Published June 15, 2026
- Source Signal: "You can't benchmark an AI notetaker against a real meeting... So I generated the meeting," via dev.to
This v0 review draws on the founder's published claims and artifacts at the source URL; independent benchmarks are pending. Update cadence: this review will be updated to a v1 or v2 when we execute the benchmark internally and can provide our own measured results.
WHAT IT DOES
Nguyen's method provides a framework for creating an objective test for automated speech recognition (ASR) systems. It addresses the fundamental flaw in most anecdotal comparisons: the lack of a 'ground truth' transcript to score against.
The problem: no answer key
The author's initial attempt to compare Granola, Fathom, and Otter involved recording a real-world meeting. The flaw became immediately apparent. To calculate an accuracy score, one needs the 100% correct transcript. But for a live meeting, the only available records were the very transcripts being evaluated. This creates a circular reference where you are grading the test with the students' own, potentially flawed, answers.
Generate the meeting, keep the script
The solution is to manufacture the ground truth. The process starts by writing a script for a synthetic meeting. Nguyen's example is an 80-second, two-speaker dialogue designed to include words and phrases that ASR models often misinterpret: financial jargon (churn, cohort), product terms (SSO, p95), names (Sarah, Priya), and numerical figures (Q3, 5.2%, $19). This script is the answer key.
From text to multi-speaker audio
With the script as the source, the method uses the ElevenLabs text-to-speech API to generate audio for each line. Each of the two speakers is assigned a distinct synthetic voice. A provided bash script automates the API calls and then uses ffmpeg to stitch the individual audio clips into a single meeting file. A short pause is inserted between speakers to give diarization algorithms (which distinguish who is speaking) a clean boundary to detect, isolating the transcription quality itself.
WHAT'S INTERESTING / WHAT'S NOT
The most interesting aspect of this approach is its disciplined, scientific mindset. It isolates a key variable, transcription accuracy, and makes it measurable with metrics like Word Error Rate (WER). In a market saturated with qualitative claims about AI-powered summaries and insights, this is a necessary return to first principles. If the underlying transcript is wrong, all subsequent AI features are built on a faulty foundation.
The provided bash script makes the method immediately actionable. It's not a theoretical proposal but a working tool that any engineer can adapt. This is the correct way to evaluate a core system component.
However, the method has limitations. By design, it only tests transcription and diarization. It offers no insight into the quality of summarization, action-item detection, or other abstractive features that differentiate these products. Furthermore, the synthetic audio from ElevenLabs is likely much cleaner than a typical meeting. It lacks the cross-talk, background noise, non-native accents, and poor microphone quality that define real-world audio. A model could perform perfectly on this synthetic test but fail on a messy, real-world call. The test is a valuable baseline, not a comprehensive simulation.
PRICING
The methodology and script are free. The only cost is the use of the ElevenLabs API for audio generation.
- ElevenLabs Free Tier: 10,000 characters per month.
- ElevenLabs Starter Tier: $5/month for 30,000 characters.
The 80-second script described by the author would contain roughly 200-300 words, or 1,000-1,500 characters. A single benchmark run would therefore be well within the limits of the free tier.
(Pricing snapshot: June 15, 2026)
VERDICT
For any team making a quantitative decision about an AI notetaker or building their own ASR pipeline, this synthetic benchmark approach is the correct one. It replaces subjective, anecdotal comparisons with a reproducible, verifiable test of core transcription quality. While it doesn't assess higher-level AI features like summarization, it provides a crucial, non-negotiable baseline. You must first know if the tool can accurately write down what was said. This method lets you measure that directly. It is the starting point for any serious evaluation.
WHAT WE'D TEST NEXT
For a v2 review, we would execute this benchmark across a dozen AI notetakers, including the original three plus competitors like Fireflies.ai and AssemblyAI's core models. We would calculate and publish the Word Error Rate (WER) and Diarization Error Rate (DER) for each. We would also expand the test suite by generating audio with a wider variety of synthetic voices to simulate different accents. Finally, we would introduce a second track of controlled background noise (cafe sounds, keyboard typing) to the audio file to begin testing performance under more realistic, non-ideal conditions.
The investor read
This methodology highlights a critical market dynamic: as base STT models from major providers commoditize, raw transcription accuracy is assumed to be 'good enough.' This is often false. The value in the AI notetaker space is shifting up the stack to summarization and workflow integration, but Nguyen's playbook shows that a defensible moat can still be built at the foundation with superior accuracy on domain-specific language (e.g., legal, medical, financial). An investment in this space requires evidence of a moat beyond being a thin wrapper. This benchmark is also a signal for the broader MLOps market. As AI features become standard, tooling for verifiable, reproducible evaluation becomes a valuable category in itself. Companies that help others prove their AI works will find a ready market.
Pull quote: “If the underlying transcript is wrong, all subsequent AI features are built on a faulty foundation.”
Every claim ties to a primary source. See our methodology.