Tools·Jul 5, 2026

DeepSeek open-sources DSpark, claiming 60–85% faster LLM inference

DeepSeek's new open-source library, DSpark, combines speculative decoding and custom CUDA kernels to accelerate inference. We review the technical paper's claims against established frameworks like…

By Riley · Tools desk·Human-reviewed·✓ Verified Jul 5, 2026·5 min read·1 source

DeepSeek's new open-source library, DSpark, combines speculative decoding and custom CUDA kernels to accelerate inference. We review the technical paper's claims against established frameworks like vLLM and TensorRT-LLM.

THE ANSWER UP FRONT

This is for engineering teams self-hosting DeepSeek models who need maximum throughput and are comfortable integrating a new, specialized library. Teams using hosted APIs, requiring broad model compatibility, or prioritizing stability with established frameworks should skip it for now. The bottom line: DSpark presents a credible path to significant performance gains through architecture-specific optimizations, but its claims are currently unverified by independent benchmarks and its utility is narrow.

METHODOLOGY

This v0 review is based on the technical paper for DSpark, published by DeepSeek AI on June 27, 2026. The source artifact is the paper itself, available in the deepseek-ai/DeepSpec GitHub repository. The review covers the techniques and performance claims presented by the authors. Specifically, we analyze their implementation of speculative decoding (DeepSpec), their custom CUDA kernels for operations like Attention and RMSNorm, and their benchmark results comparing DSpark to vLLM and TensorRT-LLM on DeepSeek-V2 models. This review does not include independent performance verification, long-term stability analysis, or an evaluation of the integration effort required. All performance figures cited are claims from the source paper. An updated review will follow once independent benchmarks become available.

WHAT IT DOES

A specialized speculative decoding engine

DSpark’s core performance claim rests on DeepSpec, a speculative decoding strategy. This technique uses a smaller, faster "draft" model to generate a sequence of candidate tokens. The larger, more accurate "target" model then verifies these tokens in a single parallel step. This is much faster than the typical auto-regressive process of generating one token at a time. The paper claims their method is particularly effective for DeepSeek's Mixture-of-Experts (MoE) models, where verifying tokens is significantly cheaper than generating them.

Fused CUDA kernels for DeepSeek models

Beyond the decoding strategy, DSpark implements custom, fused CUDA kernels specifically for the DeepSeek model architecture. The paper highlights optimizations for key components like the SwiGLU activation function, RMSNorm, and attention mechanisms. By fusing these operations, DSpark reduces the overhead from reading and writing to GPU memory, a common bottleneck in LLM inference. This level of optimization is possible because they are targeting a known architecture, a contrast to general-purpose engines that must accommodate many different model structures.

An alternative to general-purpose engines

The paper positions DSpark as a high-performance alternative to existing inference engines like vLLM and TensorRT-LLM. The benchmarks provided show DSpark outperforming these established tools, particularly in throughput (tokens per second). The authors attribute this lead to the tight co-design of their decoding strategy and custom kernels with the specific architecture of their DeepSeek-V2 and DeepSeek-MoE models. It is not presented as a general-purpose, drop-in replacement.

WHAT'S INTERESTING / WHAT'S NOT

The most interesting aspect is the source. A major model provider is open-sourcing its internal, high-performance inference stack. This is a strategic move to build an ecosystem around their models. By making it easier and cheaper to run DeepSeek models, they encourage adoption over competitors. The performance gains, if they hold up to scrutiny, are not merely incremental. A reported 60-85% throughput improvement is a significant cost and latency reduction for any team operating at scale.

What's less clear is the generalizability. The entire paper focuses on DeepSeek models. While the concepts of speculative decoding and kernel fusion are universal, DSpark's implementation is explicitly tailored. The benchmarks are vendor-provided and executed on their own models, which is a best-case scenario. It's not a neutral comparison. We have not seen how it performs on other popular architectures like Llama or Mistral, or if it even supports them. This specialization is a double-edged sword: it's the source of the performance gains, but it also limits the tool's immediate applicability for teams running a diverse set of open-source models.

PRICING

DSpark is open-source under the Apache 2.0 License. It is free to use, modify, and distribute. (Pricing snapshot: June 27, 2026).

VERDICT

DSpark is a compelling, high-expertise tool for teams committed to the DeepSeek model ecosystem and seeking maximum inference throughput. The claimed performance improvements of 60-85% are substantial enough to warrant evaluation for any large-scale deployment. However, its value proposition is currently tied to unverified, vendor-provided benchmarks and a narrow set of supported models. Teams that prioritize stability, broad model compatibility, or lack the engineering resources for a new integration should stick with mature, well-documented frameworks like vLLM for the time being.

WHAT WE'D TEST NEXT

A v2 review would require hands-on benchmarking. First, we would attempt to reproduce the paper's throughput and latency claims on identical hardware (NVIDIA H800 GPUs) using their provided code. Second, we would evaluate the performance on different hardware, like the more common A100 or L40S GPUs. Third, we would assess the effort required to adapt DSpark to run a non-DeepSeek model, such as Llama 3, to test its claimed modularity. Finally, we would measure memory usage and time-to-first-token under various batching scenarios to build a more complete performance profile.

The investor read

DeepSeek's release of DSpark is a classic ecosystem play, using open-source tooling to build a moat around its proprietary models. By open-sourcing a highly optimized inference engine, they lower the barrier to deploying their models at scale, directly competing with API-first providers and other open-source models. This puts pressure on general-purpose inference solutions like vLLM and TGI; they must now compete with specialized engines co-designed with the models themselves. For investors, this signals a shift where value may accrue to the model provider's full stack, not just the inference layer. The most durable companies in this space might be those offering managed services and support for these increasingly complex, model-specific open-source stacks.

Pull quote: “A reported 60-85% throughput improvement is a significant cost and latency reduction for any team operating at scale.”

Sources · how we verified

DeepSeek open-sources inference optimizations with 60–85% faster generation [pdf] ↗

Every claim ties to a primary source. See our methodology.

Reported by the Riley desk on Founderr Pulse’s Tools beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.

Riley

The Riley desk covers tools — what founders are building with, switching to, and abandoning. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

THE ANSWER UP FRONT

METHODOLOGY

WHAT IT DOES

A specialized speculative decoding engine

Fused CUDA kernels for DeepSeek models

An alternative to general-purpose engines

WHAT'S INTERESTING / WHAT'S NOT

PRICING

VERDICT

WHAT WE'D TEST NEXT

The investor read

Browsewright uses an LLM to automate Chrome from natural language goals

Garrison's Muster tests agent behavior, not just its AGENTS.md file

LogRocket review: Monitoring the 'silent failures' Sentry misses