Tools·Jun 19, 2026

DeepSeek-V4-Flash on Legacy GPUs: Custom Kernels Deliver 255 Prefill tok/s

A Reddit user details a custom setup for running DeepSeek-V4-Flash on 4x RTX 2080 Ti GPUs. This review examines the technical optimizations and claimed performance on budget hardware. The ability to…

By Riley · Tools desk·Human-reviewed·✓ Verified Jun 19, 2026·6 min read·1 source

A Reddit user details a custom setup for running DeepSeek-V4-Flash on 4x RTX 2080 Ti GPUs. This review examines the technical optimizations and claimed performance on budget hardware.

The ability to run large language models (LLMs) on consumer-grade hardware is a persistent challenge, often requiring significant compromises or high-end, expensive GPUs. Known_Ice9380, a Reddit user, published a detailed account of running DeepSeek-V4-Flash (284B total, 13B active) on a budget machine built with four legacy RTX 2080 Ti GPUs, claiming a prefill performance of 255 tokens/s.

The Answer Up Front

This project is for system and compiler hackers, budget-conscious researchers, or anyone determined to extract maximum performance from older hardware. It demonstrates that significant LLM inference capabilities can be achieved without an H100 cluster, provided one is willing to invest substantial engineering effort into low-level optimization. Skip this approach if you require off-the-shelf performance, enterprise-grade support, or lack the expertise for deep technical customization. The bottom line is that specialized hardware-software co-optimization can push the boundaries of legacy systems for demanding AI workloads.

Methodology

This v0 review draws on Known_Ice9380's published claims and technical details shared on Reddit, specifically the post titled "Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s!" The review covers the author's stated hardware configuration, the described technical breakthroughs, and the reported performance metric. Independent benchmarks are pending, as is the full technical report the author submitted to arXiv. This review does not cover long-term workflow integration, broader applicability to other MoE models, or edge-case performance. Update cadence: re-tested when claims diverge from observed behavior.

Tool/Project Name: DeepSeek-V4-Flash custom inference setup
Version: Not explicitly versioned; implementation linked to a GitHub repository.
Date Observed: 2026-05-20
Source Signal URL: https://www.reddit.com/r/LocalLLaMA/comments/1ti5sxu/running_deepseekv4_locally_with_4x_legacy_rtx/
What's Covered: Founder's claims, hardware specifications, technical approach (CUDA kernel development, memory management, communication overlap), and the reported prefill token/s performance.
What's NOT Covered: Independent performance verification, long-term stability, generalizability to other LLMs, or detailed inference latency beyond prefill.

What It Does

The project demonstrates a method for running the DeepSeek-V4-Flash model on a consumer-grade, budget-friendly hardware setup. The core of the approach lies in a series of technical optimizations designed to circumvent the limitations of older GPUs and PCIe Gen3 bandwidth.

Custom Turing CUDA Kernels

Known_Ice9380 developed custom CUDA kernels specifically for the Turing architecture of the RTX 2080 Ti. These kernels accelerate W8A8 (INT8) matrix multiplication, a critical operation in quantized LLM inference, directly addressing the VRAM and PCIe bandwidth bottlenecks of the older hardware.

Heterogeneous Inference

The implementation utilizes a strategy of static memory splitting and dynamic offloading. This ensures 100% utilization of the combined 4x 11/22GB VRAM across the GPUs and 1TB of system RAM, allowing the large DeepSeek-V4-Flash model to fit within the available memory footprint.

Computation-Communication Overlap

To mitigate the significant multi-GPU communication overhead inherent in Mixture-of-Experts (MoE) models, the project employs a pipelined execution strategy. This technique aims to hide communication latency by overlapping it with computation, a common optimization in distributed systems.

Budget Hardware Configuration

The entire setup, costing less than $2,500, comprises an Intel Xeon E5-2696 v4 CPU, four RTX 2080 Ti GPUs (each with 11GB or 22GB VRAM), and 1TB of DDR4 ECC RAM. This configuration targets a high-core-count CPU for general system tasks and leverages the Tensor Cores of the 2080 Ti for AI acceleration.

What's Interesting / What's Not

The most interesting aspect is the depth of optimization. Custom CUDA kernel development for a specific, legacy architecture (Turing) to accelerate W8A8 quantization is a non-trivial undertaking. It highlights that significant performance gains are still available at the hardware-software interface, even on older silicon, for those willing to invest the engineering effort. This is a direct counter-narrative to the prevailing idea that only the latest H100s or equivalent can handle frontier models.

The heterogeneous inference and computation-communication overlap strategies are well-established techniques in distributed computing and large model inference. Their application here is sound and necessary given the hardware constraints, but not novel in principle. What is noteworthy is their effective integration into a cohesive system that delivers a claimed 255 prefill tokens/s on such a constrained budget. The project's open-source nature, with a linked GitHub repository, provides a concrete artifact for review and potential replication.

What's less interesting, from an analytical standpoint, is the reliance on a single, unverified performance claim. While the technical approach is sound, the reported 255 prefill tokens/s lacks independent corroboration or a detailed public benchmark report (pending arXiv clearance). Without this, it remains a founder's claim, not a verified benchmark. The project also focuses solely on prefill performance, leaving the crucial aspect of decoding latency unaddressed, which is often the bottleneck for interactive LLM use.

Pricing

The described hardware setup cost less than $2,500 total, reflecting a budget-conscious approach to local LLM inference. The implementation itself is open-sourced and available at no cost.

Verdict

For developers and researchers with deep systems-level expertise and a limited budget, this project offers a compelling blueprint for running large MoE models like DeepSeek-V4-Flash on readily available, legacy hardware. It unequivocally demonstrates that raw compute power can be augmented by sophisticated software optimization, pushing the performance envelope of older GPUs. If your primary constraint is hardware cost and you have the engineering talent to implement and maintain custom kernels, this approach is a viable path. However, for those seeking plug-and-play solutions or enterprise-grade stability, the required technical investment makes it a niche, albeit powerful, solution.

What We'd Test Next

Our next steps would involve independently replicating the setup using the provided GitHub repository to verify the claimed 255 prefill tokens/s. We would also benchmark decoding latency under various batch sizes and sequence lengths, as this is critical for real-time applications. Further testing would include evaluating the generalizability of the custom Turing kernels to other MoE architectures or even other non-MoE models, assessing the performance impact of different quantization schemes (e.g., W4A16), and measuring the long-term stability and resource utilization during extended inference sessions. We would also investigate the power consumption of the entire system under load.

The investor read

This project highlights a significant market segment for highly optimized, local LLM inference, particularly for budget-constrained users, small businesses, or those prioritizing data privacy and control. The ability to extract substantial performance from legacy hardware signals a demand for tooling that can bridge the gap between cutting-edge AI models and accessible compute. Companies specializing in efficient inference, custom compiler technologies, or hardware-aware quantization (like Groq, or various open-source inference engines) are well-positioned. While this specific project is an open-source, deliberate small play by an individual, it underscores the value of expertise in low-level optimization. An investable company in this space would offer a productized, user-friendly solution that abstracts away the complexity of custom kernel development, providing similar performance gains across a wider range of hardware without requiring deep engineering skill from the end-user.

Sources · how we verified

Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s! ↗

Every claim ties to a primary source. See our methodology.

Reported by the Riley desk on Founderr Pulse’s Tools beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.

Riley

The Riley desk covers tools — what founders are building with, switching to, and abandoning. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

The Answer Up Front

Methodology

What It Does

Custom Turing CUDA Kernels

Heterogeneous Inference

Computation-Communication Overlap

Budget Hardware Configuration

What's Interesting / What's Not

Pricing

Verdict

What We'd Test Next

The investor read

HuddleCluster proposes a load balancer that self-calibrates using relative latency

How to choose an AI memory layer that forgets correctly

A founder's guide to Linux I/O: Epoll vs. io_uring for performance