llama.cpp integrates PDL for NVIDIA GPUs, boosting token generation
This review examines Programmatic Dependent Launch (PDL) integration into llama.cpp, detailing its technical implementation and claimed performance uplifts for token generation on modern NVIDIA GPUs.…
This review examines Programmatic Dependent Launch (PDL) integration into llama.cpp, detailing its technical implementation and claimed performance uplifts for token generation on modern NVIDIA GPUs.
The Answer Up Front
Users running llama.cpp on modern NVIDIA GPUs (Compute Capability 9.0 or higher) should enable Programmatic Dependent Launch (PDL) for a notable speedup in token generation. This optimization, while requiring specific hardware and careful kernel tuning, offers a meaningful performance improvement by enabling concurrent kernel execution. If your hardware does not meet the Compute Capability 9.0 requirement, or if your primary bottleneck is context prefill rather than token generation, this feature will not provide a benefit.
Methodology
This v0 review draws on the founder's published claims at the linked GitHub Pull Request and a Reddit user's observations; independent benchmarks are pending. Update cadence: re-tested when claims diverge from observed behavior.
- Tool Name + Version + Date Observed:
llama.cpp(specifically, the integration of PDL via PR #22522), observed 2026-05-21. - Source Signal URL: https://www.reddit.com/r/LocalLLaMA/comments/1tj393d/build_9254_fixes_my_tg_regression_and_adds_pdl/
- What's Covered in this Review:
aendk's technical description of Programmatic Dependent Launch (PDL) withinllama.cpp, its intended benefits, implementation details (sync barrier, launch signal, new kernel launch function), and reported performance uplifts on specific NVIDIA GPUs. Also covered are the observations from Reddit userBulky-Priority6824regarding a token generation (TG) uplift. - What's NOT Covered: Independent performance verification, long-term workflow impact, stability across a wide range of models or edge cases, or the specific regression fix mentioned by the Reddit user (beyond the observed uplift with PDL enabled).
What It Does
Overlapping Kernel Execution
Programmatic Dependent Launch (PDL) is a CUDA optimization designed for newer NVIDIA GPUs (Compute Capability >= 90). Its core function is to enable overlapping execution of CUDA kernels within the same CUDA stream. Traditionally, kernels in a single stream execute strictly in order. PDL, similar to CUDA graphs but with additive benefits, reduces kernel launch overhead on the device by allowing kernels to run concurrently where data dependencies permit.
Technical Integration
For llama.cpp to leverage PDL, kernels require two new features: a synchronization barrier (GGML_CUDA_PDL_SYNC) and a launch signal (GGML_CUDA_PDL_LC). The synchronization barrier ensures a kernel waits on data written by preceding kernels, preventing race conditions. The launch signal indicates when the current kernel can tolerate the start of the next kernel alongside it. Additionally, kernels must be launched using a new ggml_cuda_kernel_launch() function. aendk reports that this integration was carefully applied to kernels used in models like gpt-oss 20b, qwen3.5, and nemotron 120B Super.
Reported Performance Uplifts
aendk claims significant speedups in the token generation phase. On an RTX PRO 6000, a 10% speedup is reported, while on a DGX Spark, a 4-5% improvement is observed. Reddit user Bulky-Priority6824 also claims a 3% uplift on a 2x5060ti 16GB setup, achieving 127 tg/s on qwen3.6-35b-a3b-Q4_K_XL with PDL enabled. The prefill/context phases are largely unaffected by this optimization.
What's Interesting / What's Not
What's interesting about this llama.cpp integration is its targeted approach to a specific performance bottleneck: token generation. The ability to overlap kernel execution within a single CUDA stream is a non-trivial optimization, moving beyond simple sequential execution. The reported additive benefits when combined with CUDA graphs (PDL + CG > CG > PDL) suggest a sophisticated approach to maximizing GPU utilization. The specific performance claims—10% on RTX PRO 6000, 4-5% on DGX Spark, and 3% on 2x5060ti—are substantial for an inference workload, particularly as the local LLM ecosystem matures and performance gains become harder to find.
What's less interesting, or rather, what limits its broad applicability, is the strict hardware requirement. PDL is only effective on NVIDIA GPUs with Compute Capability >= 90, which excludes many widely used Ada Lovelace generation cards. This means a significant portion of the llama.cpp user base will not benefit. Furthermore, the optimization primarily targets token generation, leaving the prefill/context phase largely untouched. The requirement for hand-tuning and benchmarking the launch signal placement for optimal performance also indicates a higher barrier to entry for developers looking to extend or customize kernels with PDL support.
Pricing
llama.cpp is an open-source project, distributed under the MIT License. There is no direct cost associated with using the software or this PDL feature. Users bear the cost of their own hardware and electricity.
Verdict
For llama.cpp users with compatible NVIDIA GPUs (Compute Capability 9.0 or higher), enabling Programmatic Dependent Launch is a clear recommendation. The reported performance uplifts of 3-10% in token generation are meaningful, especially for latency-sensitive applications or when pushing the limits of local inference. However, users with older NVIDIA hardware or non-NVIDIA GPUs will see no benefit. This is a specialized optimization for a specific hardware generation, and its value is directly tied to that compatibility. We recommend enabling it if your hardware supports it and you prioritize token generation speed.
What We'd Test Next
Our next steps would involve independent verification of the claimed performance uplifts across a broader range of NVIDIA GPUs with Compute Capability >= 90, including different models and quantization levels. We would benchmark the combined effect of PDL and CUDA graphs against CUDA graphs alone to confirm the additive benefits. Investigating the impact of varying batch sizes on PDL's effectiveness, particularly in multi-user scenarios, would also be crucial. Finally, we would explore the developer experience of integrating custom kernels with the GGML_CUDA_PDL_SYNC and GGML_CUDA_PDL_LC primitives, assessing the complexity and potential pitfalls of hand-tuning the launch signal placement for new kernels.
The investor read
The integration of Programmatic Dependent Launch (PDL) into llama.cpp signals a continued trend towards highly specialized, hardware-specific optimizations for local LLM inference. As the market for local AI grows, performance differentiation at the low-level hardware interaction becomes critical. llama.cpp remains a foundational platform in this space, and its ability to absorb and leverage cutting-edge GPU features like PDL reinforces its technical leadership. This move highlights that future performance gains will increasingly come from deep collaboration with hardware vendors and intricate kernel-level tuning, rather than just model architecture improvements. For investors, this indicates that companies building on top of llama.cpp or developing competing inference engines must demonstrate similar levels of hardware-aware optimization to remain competitive. The focus on newer NVIDIA architectures also suggests a growing divide in performance capabilities between different GPU generations, potentially driving upgrade cycles.
Every claim ties to a primary source. See our methodology.