Deplodock: A From-Scratch LLM Compiler Demonstrates PyTorch-Beating Kernels
A Reddit user details 'deplodock', an LLM compiler built from pure Python and CUDA, showcasing a multi-stage IR pipeline, extensive optimization passes, and autotuning for performance. The Answer Up…
A Reddit user details 'deplodock', an LLM compiler built from pure Python and CUDA, showcasing a multi-stage IR pipeline, extensive optimization passes, and autotuning for performance.
The Answer Up Front
Deplodock is not a commercial product but a technical deep dive into building an LLM compiler from first principles. It offers a valuable educational resource for engineers seeking to understand the intricacies of ML compilation. For those needing a production-ready solution, existing frameworks like PyTorch Dynamo/Inductor or TVM remain the standard. However, for researchers and low-level optimization specialists, deplodock provides a transparent, hackable blueprint for achieving competitive performance on specific kernel shapes.
Methodology
This v0 review draws on the founder's published claims and technical details presented in a three-part series on Reddit by user NoVibeCoding. Independent benchmarks are pending. Update cadence: re-tested when claims diverge from observed behavior.
- Tool Name & Version: deplodock (project by NoVibeCoding), as observed on 2026-05-19.
- Source Signal URL: https://www.reddit.com/r/LocalLLaMA/comments/1thu987/an_overview_of_modern_llm_compiler_stack_writing/
- What's Covered: The founder's architectural description, the six intermediate representations (IRs), the sixteen optimization passes, the autotuning approach, and reported performance comparisons against the PyTorch production stack on an RTX 5090. Code artifacts (Python, CUDA, CLI diffs) are referenced in the source.
- What's Not Covered: Independent performance verification, long-term workflow integration, broader model support beyond small transformers (TinyLlama, Qwen2.5-7B), or edge case behavior. This review does not assess the project's maintainability or community support.
What It Does
NoVibeCoding's 'deplodock' project is an interactive, hackable ML compiler built from scratch using pure Python and raw CUDA. It aims to demystify the complex world of production ML compiler stacks by providing a transparent, step-by-step implementation.
A Six-Stage IR Pipeline
The compiler processes transformer operations through a sequence of six distinct Intermediate Representations, each progressively closer to hardware execution. This pipeline begins with a Torch IR (captured FX graph) and decomposes operations into a Tensor IR (elementwise, reduction, index map). It then transforms these into a Loop IR (fused loop nests), a Tile IR (GPU scheduling), and a Kernel IR (hardware primitives), culminating in emitted CUDA source code ready for nvcc compilation. This detailed layering allows for fine-grained control and optimization at each stage.
Sixteen Mechanical Passes
The lower half of the pipeline, from Loop IR to CUDA, involves sixteen mechanical Tile-IR passes. These passes systematically transform the loop nest into a GPU schedule, mimicking the optimizations a human CUDA engineer would apply. Examples include splitting computations into blocks, mapping operations to threads, and staging inputs into shared memory (smem). Each pass is presented as a distinct CLI diff, illustrating the incremental changes to the kernel's schedule.
Autotuning for Performance
Initially, the optimization parameters (e.g., block size, register tile, K-chunk, staging, double-buffering) were hand-picked using heuristics. The project evolved to replace these heuristics with an autotuning search loop, specifically referencing the Monte Carlo tree search approach. This shift aims to generalize performance beyond specific kernel shapes, allowing the compiler to discover optimal configurations automatically.
What's Interesting / What's Not
The most interesting aspect of deplodock is its from-scratch approach using pure Python and CUDA. This provides an unparalleled level of transparency into the ML compilation process, contrasting sharply with the hundreds of thousands of lines of C++ in frameworks like TVM. The explicit, multi-stage IR pipeline and the detailed breakdown of sixteen optimization passes offer a pedagogical goldmine for engineers looking to understand how high-level ML graphs translate into efficient GPU code.
The performance claims are also notable. NoVibeCoding reports that the autotuned stack achieves a geomean of 0.96x versus the PyTorch production stack on an RTX 5090. Crucially, 32 of 84 kernel shapes reportedly beat PyTorch's hand-optimized kernels, with a maximum speedup of 5.6x. While these are founder claims and not independently verified, they suggest that a deep, custom compiler can indeed find performance ceilings that even highly optimized existing frameworks might miss for specific workloads.
What's less interesting, or rather, what limits its immediate practical applicability, is the project's scope. It focuses on
The investor read
The 'deplodock' project highlights the intense, low-level optimization frontier in AI infrastructure. While this specific project is educational, it underscores the significant performance gains still available by deeply customizing the ML compiler stack. The market for specialized compilers and hardware-aware optimization tools is growing, driven by the increasing cost and scale of LLM inference. Companies like Modular (Mojo), Groq, and even the internal efforts at NVIDIA, Google, and Meta (PyTorch Dynamo/Inductor) are all tackling this problem, seeking to extract every ounce of performance from silicon. An investable company in this space would likely offer a more generalized, user-friendly, and production-hardened solution that abstracts away much of this complexity, or provide tooling that significantly accelerates the development of such custom compilers, rather than requiring a from-scratch build.
Every claim ties to a primary source. See our methodology.