Tools·Jul 3, 2026

Manticore Search claims 14x faster ONNX embeddings via C++ rewrite

Manticore Search's co-founder details a multi-stage C++ optimization for their ONNX embedding path, claiming a significant performance boost. We analyze the methodology and what it means for…

By Riley · Tools desk·Human-reviewed·✓ Verified Jul 3, 2026·6 min read·1 source

Manticore Search's co-founder details a multi-stage C++ optimization for their ONNX embedding path, claiming a significant performance boost. We analyze the methodology and what it means for production search.

THE ANSWER UP FRONT

This deep-dive is for engineering teams running self-hosted search and vector workloads who need maximum performance from their hardware. If you're optimizing embedding generation at scale and are comfortable with C++, Manticore's documented approach is a compelling case study and a reason to evaluate the tool. Teams satisfied with managed vector databases like Pinecone or those without intense throughput requirements can skip this. The bottom line is that Manticore provides a credible, transparent engineering account of a significant performance win, positioning itself as a strong contender for high-throughput, self-hosted vector search.

METHODOLOGY

This v0 review analyzes the claims made by Manticore Search co-founder Sergey Nikolaev in a technical blog post published on the company's website. The review is based entirely on the methodology, benchmarks, and code explanations provided in that single source. We have not independently verified the performance claims.

Tool: Manticore Search (specific version not cited, analysis focuses on the ONNX embedding path)
Date Observed: July 3, 2026
Source Signal: "14× faster embeddings: how we rebuilt the ONNX path in Manticore" at https://manticoresearch.com/blog/onnx-embeddings-speedup/
What's Covered: The founder's step-by-step description of optimizing the ONNX model inference process, the rationale for moving from Python to C++, and the first-party performance benchmarks reported at each stage.
What's Not Covered: Independent reproduction of the benchmarks, performance on different hardware configurations, comparison against other vector databases under controlled conditions, or the operational cost of the new implementation.

This v0 review draws on the founder's published claims; independent benchmarks are pending. Update cadence: re-tested when claims diverge from observed behavior.

WHAT IT DOES

Manticore Search is an open-source search engine. This update focuses specifically on accelerating the process of generating vector embeddings from text using ONNX models, a critical step for modern semantic search.

The original Python bottleneck

The initial implementation used a Python daemon to run ONNX models. While easy to implement, the founder reports this approach was slow. Communication between the main Manticore daemon (written in C++) and the Python process happened over JSON-RPC, adding overhead. This setup became a significant performance bottleneck, especially under concurrent loads.

The C++ rewrite

To address the bottleneck, the team rewrote the ONNX inference path entirely in C++. This eliminated the Python dependency and the overhead of inter-process communication between different language runtimes. The post details the direct integration of the ONNX Runtime C++ API into Manticore. This initial rewrite alone, the founder claims, yielded a 2.5x speedup over the Python version.

Further C++ optimizations

The team didn't stop at a direct port. The blog post documents several subsequent optimization layers:

Tokenizer Caching: Reusing tokenizer objects for subsequent requests to reduce initialization overhead.
Session Caching: Caching ONNX Runtime sessions to avoid reloading models.
Parallelism: Introducing multi-threading to process multiple documents in parallel, which reportedly scaled performance almost linearly with the number of CPU cores.

According to Nikolaev, the cumulative effect of these changes resulted in the final claimed 14x performance improvement compared to the original Python implementation.

WHAT'S INTERESTING / WHAT'S NOT

The most interesting aspect is the detailed transparency. Manticore isn't just marketing a number; they are publishing the playbook. The post shows the incremental gains from each optimization, from the initial C++ port to tokenizer caching and finally to parallelization. This builds credibility and serves as a useful guide for other engineers working on similar ML inference performance problems. The decision to go with C++ is a strong statement about prioritizing raw performance over the development velocity or memory safety offered by languages like Go or Rust.

What's less novel is the problem itself. Python's performance limitations for high-throughput, CPU-bound tasks are well-known, and moving performance-critical code to a compiled language like C++ is a standard optimization pattern. The 14x figure is impressive but highly dependent on the specific hardware, model, and workload used in their internal test. It's a benchmark, not a guarantee. Without a public, reproducible test harness, the number remains a claim. The core value here is the detailed execution, not a conceptual breakthrough.

PRICING

(As of July 2026)

Manticore Search: Open source and free to use (GPLv2 license).
Manticore Cloud: A managed service with pricing based on instance size and storage. Tiers range from a free "Play" instance to dedicated production clusters starting around $59/month and scaling up.
Enterprise Support: Custom pricing for dedicated support, consulting, and enterprise features.

VERDICT

Manticore Search has made a strong, evidence-backed case for its performance on self-hosted ONNX embedding generation. For engineering teams that manage their own infrastructure and require high-throughput semantic search capabilities, the detailed C++ rewrite is a compelling reason to consider Manticore. The company's transparency about its optimization process builds significant trust. While the 14x performance claim is specific to their internal benchmarks and requires independent validation, the engineering principles are sound. If you need to squeeze maximum performance out of your hardware for vector search, Manticore has proven it's a serious contender.

WHAT WE'D TEST NEXT

To move this from a review of claims to a verified benchmark, we would need to test several things. First, we'd attempt to reproduce the 14x claim on standardized hardware using the same model mentioned in the post. Second, we would broaden the test to include different ONNX models of varying sizes to see how performance scales. Third, we would run an end-to-end comparison against other self-hosted vector databases like Qdrant and Weaviate, measuring both indexing throughput and query latency on a public dataset like SIFT1M. Finally, we would assess the memory usage and operational complexity of the new C++ implementation versus the previous Python-based one.

The investor read

This signals the maturation of the AI infrastructure layer. As models commoditize, the performance of the surrounding data plumbing, like vector databases, becomes a key differentiator. Manticore is executing a classic open-source strategy: win technical users with superior, transparently-documented performance to build a moat against managed, closed-source competitors. Their challenge is converting this technical credibility into revenue through their Cloud and Enterprise offerings. An investment thesis would depend on seeing evidence that this performance edge translates into commercial adoption, particularly for high-value enterprise customers who are willing to pay a premium for self-hosted control and throughput. This is a bet on performance-sensitive workloads remaining on-prem or in a VPC rather than moving to managed services.

Pull quote: “The 14x figure is impressive but highly dependent on the specific hardware, model, and workload used in their internal test. It's a benchmark, not a guarantee.”

Sources · how we verified

14× faster embeddings: how we rebuilt the ONNX path in Manticore ↗

Every claim ties to a primary source. See our methodology.

Reported by the Riley desk on Founderr Pulse’s Tools beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.

Riley

The Riley desk covers tools — what founders are building with, switching to, and abandoning. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

THE ANSWER UP FRONT

METHODOLOGY

WHAT IT DOES

The original Python bottleneck

The C++ rewrite

Further C++ optimizations

WHAT'S INTERESTING / WHAT'S NOT

PRICING

VERDICT

WHAT WE'D TEST NEXT

The investor read

GLM 5.2 vs. Claude Opus: A founder's guide to choosing a flagship model

Spanlens offers open-source LLM observability with in-proxy interventions

Entri automates SaaS custom domains, but platforms offer it for free