HomeReadTools deskKokoro vs. Supertonic 3: Benchmarking Self-Hosted TTS for Indie Founders
Tools·Jun 16, 2026

Kokoro vs. Supertonic 3: Benchmarking Self-Hosted TTS for Indie Founders

This review analyzes a detailed benchmark of Kokoro 82M and Supertonic 3, comparing their performance, audio quality, and resource usage for self-hosted text-to-speech on CPU-only hardware. The…

This review analyzes a detailed benchmark of Kokoro 82M and Supertonic 3, comparing their performance, audio quality, and resource usage for self-hosted text-to-speech on CPU-only hardware.

The Answer Up Front

For indie founders building batch processing pipelines where audio quality is paramount, Kokoro 82M remains the superior choice. Its natural-sounding voice justifies its slightly longer generation times. If your application demands lower latency for interactive use cases, and a slightly synthetic voice is acceptable, Supertonic 3 in quality mode offers a compelling speed improvement. Supertonic's fast mode produces robotic, slurred audio and should be skipped for anything beyond basic, non-critical alerts.

Methodology

This v0 review draws on a detailed benchmark published by Reddit user gvij on 2026-05-19. The evaluation was performed on a test box configured with 4 CPU cores and 16GB RAM, explicitly without a GPU, simulating common self-hosted or mini-PC environments. gvij conducted 120 timed runs across four model configurations and six text lengths, ranging from single sentences to full essays. The core performance numbers focus on a typical paragraph of approximately 850 characters, which translates to roughly 60 seconds of speech. The source provides a public repository containing all 24 generated audio samples, raw timing CSV data, and the benchmark script, allowing for independent verification of the reported figures. The benchmark harness and initial runtime issue handling were performed by "Neo AI Engineer" with manual review by gvij. This review covers the founder's reported performance metrics, subjective quality assessments, memory footprint, and licensing information. It does not include independent performance benchmarks by Founderr Pulse, long-term workflow integration analysis, or extensive edge-case testing.

What It Does

Performance Metrics

The benchmark directly compares the generation speed of Kokoro 82M against Supertonic 3 in both its 'fast' and 'quality' modes. For an 850-character paragraph, gvij reports the following generation times:

  • Supertonic 3 (fast mode): 8 seconds
  • Supertonic 3 (quality mode): 16 seconds
  • Kokoro 82M: 25 seconds

These numbers indicate Supertonic 3 is between 1.5x and 3x faster than Kokoro 82M on the specified CPU-only hardware. All models operate below real-time, meaning they can generate audio faster than its playback duration, which is critical for batch processing.

Audio Quality Differences

Quality is where the models diverge significantly. Supertonic 3 in fast mode is described as "listenable but robotic, words slur together," making it unsuitable for content requiring sustained listening. Supertonic 3 in quality mode is rated as "genuinely fine," clear, intelligible, and slightly synthetic, suitable for notifications or content where information delivery outweighs voice naturalness. Kokoro 82M is noted to be "on a different level," sounding like a real person reading, which is crucial for longer-form content like articles or audiobooks, where a natural voice reduces listener fatigue.

Resource Footprint

Both models are lightweight in terms of memory consumption. Kokoro 82M, at 82M parameters, and Supertonic 3, of similar magnitude, both run comfortably within 2GB of resident memory. This makes them viable for deployment on modest hardware, such as a home server or a beefy mini PC, without significant resource strain.

Licensing Terms

Licensing is a key differentiator for commercial projects. Kokoro is released under the Apache 2.0 license, offering permissive terms with no significant restrictions. Supertonic, however, uses its own specific license, which gvij advises checking carefully for any commercial use cases.

What's Interesting / What's Not

The most interesting aspect of this benchmark is the clear, quantified trade-off between speed and quality for self-hosted TTS models on CPU-only hardware. gvij's explicit recommendation for specific use cases (Kokoro for batch, Supertonic quality for interactive) based on empirical data is valuable. The methodology, including 120 timed runs and varied text lengths, provides a robust foundation for the claims. The provision of raw timing CSV and audio samples in a public repository is a strong point, enabling others to verify the findings or make their own subjective quality assessments. This level of transparency is rare in user-contributed benchmarks and elevates its utility for founders evaluating options. The use of "Neo AI Engineer" to build the evaluation harness also suggests emerging tooling for AI model benchmarking and deployment, which is a trend worth watching.

What's less interesting, or rather, what highlights current limitations, is the persistent quality gap in faster TTS models. Even Supertonic's 'quality' mode is only "genuinely fine," not matching Kokoro's human-like output. This indicates that achieving truly natural speech synthesis at high speeds on commodity hardware remains a challenge. The proprietary license for Supertonic, while not inherently negative, adds a layer of friction for commercial adoption compared to Kokoro's permissive Apache 2.0. Finally, the benchmark focuses on a single voice (implied) and English language, leaving questions about versatility and multilingual support unanswered.

Pricing

Both Kokoro and Supertonic 3 are open-source models, making them free to use. The primary cost considerations would be the hardware for self-hosting and developer time for integration and maintenance. Licensing terms differ, with Kokoro under Apache 2.0 and Supertonic under its own license, which may have implications for commercial use. (Pricing snapshot: 2026-05-19)

Verdict

For indie founders prioritizing natural-sounding audio for long-form content or batch processing, Kokoro 82M is the clear choice. Its superior voice quality, despite being slower, provides a better user experience for sustained listening. If your project requires faster, interactive responses and can tolerate a slightly synthetic voice, Supertonic 3 in quality mode is a viable alternative. However, avoid Supertonic's fast mode; its quality degradation makes it unsuitable for most applications where the content needs to be clearly understood. The decision hinges directly on the application's latency requirements versus the acceptable level of audio naturalness.

What We'd Test Next

Our next steps would involve replicating gvij's benchmark on Founderr Pulse's own hardware to independently verify the performance claims, particularly on different CPU architectures. We would also investigate the impact of GPU acceleration on both models, as many modern self-hosted setups might include entry-level GPUs. Further testing would explore the ease of integrating these models into various application stacks (e.g., web services, mobile apps) and their performance with streaming input rather than batch processing. We would also assess the availability and quality of different voices, as well as multilingual support, to determine their versatility for a broader range of use cases.

The investor read

The self-hosted TTS market, driven by open-source models like Kokoro and Supertonic 3, signals a shift in tooling spend away from high-cost SaaS providers for specific use cases. The ability to run performant, albeit CPU-bound, TTS models on commodity hardware democratizes access to advanced speech synthesis. This trend creates opportunities for infrastructure plays that simplify the deployment, management, and scaling of these open-source models. Tools like the mentioned "Neo AI Engineer" that automate benchmarking and runtime management could become investable, as they reduce the operational overhead for founders leveraging these free models. The key for investors is identifying companies that can abstract away the complexity of self-hosting, offering a managed experience without the prohibitive costs of traditional TTS APIs, or those that build specialized applications on top of these foundational models.

Sources · how we verified
  1. Hosting a Text to Speech model can be challenging. So I benchmarked 2 recently released TTS models - Kokoro vs Supertonic!

Every claim ties to a primary source. See our methodology.

Reported by the Riley desk on Founderr Pulse’s Tools beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.
R
Riley

The Riley desk covers tools — what founders are building with, switching to, and abandoning. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

Founderr Pulse — free & independent. The desk for people who build & back.