HomeReadTools deskDeepSeek V4 Flash: Patching GGUFs for Local 3x3090 Performance
Tools·Jun 2, 2026

DeepSeek V4 Flash: Patching GGUFs for Local 3x3090 Performance

This review details a community-developed technical fix for running DeepSeek V4 Flash locally on specific llama.cpp forks, addressing GGUF compatibility issues and providing concrete performance…

This review details a community-developed technical fix for running DeepSeek V4 Flash locally on specific llama.cpp forks, addressing GGUF compatibility issues and providing concrete performance metrics.

TL;DR

Best for: Developers with multi-GPU setups (e.g., 3x RTX 3090) seeking to run DeepSeek V4 Flash locally, especially those encountering GGUF loading errors due to llama.cpp fork mismatches. Skip if: You lack significant VRAM/RAM (200GB sys ram+vram recommended) or prefer using pre-patched GGUFs (like teamblobfish's) over manual patching. Bottom line: A critical, community-driven technical patch enables running DeepSeek V4 Flash on specific llama.cpp forks, achieving solid local performance for a complex MoE model by correcting metadata and tensor naming inconsistencies.

METHODOLOGY

This v0 review draws on the founder's published claims at the specified Reddit URL; independent benchmarks are pending. Update cadence: re-tested when claims diverge from observed behavior.

This review covers DeepSeek V4 Flash, observed through the lens of a technical fix detailed by Reddit user etaoin314 on May 28, 2026. The source signal, "DeepSeek V4 Flash at 8.4 tok/s on 3×3090: patching the GGUFs that won't load on cchuter's llama.cpp fork," outlines a specific problem and solution for local deployment. We analyze the founder's claims regarding performance (8.4 tok/s on 3x3090), the technical details of the GGUF patching process (12 metadata keys, ~393 tensor renames), and the provided Python script. What's not covered in this v0 review includes independent performance verification, long-term workflow integration, or edge-case compatibility with other llama.cpp forks or hardware configurations.

WHAT IT DOES

Enables Local DeepSeek V4 Flash

DeepSeek V4 Flash is a 284B-total / 13B-active Mixture-of-Experts (MoE) model featuring a new architecture, including Compressed Sparse Attention with a lightning indexer, Sinkhorn-normalized hyperconnections, 256-expert routing, and native FP4/FP8 weights. Mainline llama.cpp does not yet support this architecture; it lives only in specific forks. The fix targets cchuter/llama.cpp @ feat/v4-port-cuda, allowing users with substantial local hardware (e.g., 3x RTX 3090 and 128GB DDR4 RAM) to run the model at approximately 8.4 tokens per second generation speed.

Corrects GGUF Incompatibility

The core problem addressed is the incompatibility of many popular DeepSeek V4 Flash GGUFs, such as lovedheart/DeepSeek-V4-Flash-GGUF, with the cchuter fork. These GGUFs were often quantized against earlier forks (e.g., nisparks, PR #22378) which used different metadata schemas and tensor naming conventions. The fix resolves the error loading model: key not found and subsequent missing-tensor errors that arise, ensuring the actual weights are correctly loaded.

Patches Metadata Keys

The cchuter loader requires 12 specific metadata keys that nisparks-era GGUFs lack. The post identifies these keys and their correct values, sourced from the official deepseek-ai/DeepSeek-V4-Flash/config.json and cross-checked against known-working GGUFs (teamblobfish's). Examples include deepseek4.attention.output_lora_rank (1024), deepseek4.attention.compress_ratios (a 44-integer array), and deepseek4.hyper_connection.sinkhorn_iterations (20).

Renames ~393 Tensors

Beyond metadata, approximately 393 tensor names differ between the nisparks and cchuter forks. The fix details specific renaming patterns, such as adding .weight to bare tensor names, changing hc_head_ prefixes to output_hc_, and adjusting attn_compress_ and indexer.compress_ to attn_compressor_ and indexer_compressor_, respectively. A notable specific correction is blk.N.exp_probs_b to .bias, which is crucial for the aux-loss-free routing bias.

WHAT'S INTERESTING / WHAT'S NOT

What's interesting here is the concrete, community-driven solution to a common problem in the fast-moving local LLM ecosystem: fork fragmentation and GGUF compatibility. The detailed breakdown of 12 metadata keys and ~393 tensor renames provides actionable intelligence for anyone attempting to run DeepSeek V4 Flash or similar bleeding-edge models locally. The explicit performance claim of 8.4 tokens per second on a 3x RTX 3090 setup offers a valuable baseline for a 284B-total / 13B-active MoE model, demonstrating that complex architectures can achieve usable inference speeds on consumer-grade hardware with sufficient VRAM. The provided Python patching script, which stream-copies the weight blob into a new file with a rewritten header, is an elegant solution to the immutability of GGUF files, completing in about four minutes on NVMe.

What's not as interesting, or rather, what highlights the challenges, is the underlying necessity for such a patch. The rapid evolution of llama.cpp forks and model quantization standards creates a compatibility gap that users must bridge manually. While the solution is effective, it underscores the fragility of relying on specific, non-mainline forks for cutting-edge models. The author's self-description as a "vibe coder" reinforces the ad-hoc, exploratory nature of this work, which, while valuable, suggests the fix may not be robustly tested across all potential variations of hardware, llama.cpp builds, or GGUF quantizations. The post does not delve into the specific implications of the DeepSeek V4 Flash's architectural innovations on the patching process or how these might evolve with future llama.cpp updates.

PRICING

DeepSeek V4 Flash is an open-source model. The llama.cpp fork and the patching script are also open-source and free to use. Running the model locally requires significant hardware investment; the poster used a 3x RTX 3090 setup with 72GB VRAM total and 128GB DDR4 RAM. The only mentioned subscription is the author's Claude subscription, which is unrelated to the tool itself. Pricing snapshot: May 28, 2026.

VERDICT

For developers committed to running DeepSeek V4 Flash locally on llama.cpp, particularly those with multi-GPU setups like 3x RTX 3090, this patching method is currently the most direct route to achieving functional inference. The detailed breakdown of metadata and tensor renames, coupled with the provided Python script, directly addresses the GGUF incompatibility issues arising from llama.cpp fork evolution. While the solution is specific to the cchuter/llama.cpp @ feat/v4-port-cuda fork and requires manual intervention, it delivers a concrete 8.4 tokens per second generation speed, making a complex MoE model accessible for local experimentation and use. This is a crucial fix for early adopters, bridging the gap until broader llama.cpp support or standardized GGUF releases emerge.

WHAT WE'D TEST NEXT

Our next steps would involve independently verifying the claimed 8.4 tokens per second generation speed on a similar 3x RTX 3090 setup, using the patched lovedheart/DeepSeek-V4-Flash-GGUF. We would also test the patch's compatibility with other DeepSeek V4 Flash GGUFs available on Hugging Face, beyond the lovedheart quantization. Further investigation would include benchmarking performance across different llama.cpp build configurations (e.g., different CUDA versions, compiler flags) and evaluating the patch's robustness against minor updates to the cchuter fork. We would also explore the impact of varying system RAM and VRAM configurations on inference speed and stability, particularly for setups with less than 200GB combined memory.

Sources · how we verified
  1. DeepSeek V4 Flash at 8.4 tok/s on 3×3090: patching the GGUFs that won't load on cchuter's llama.cpp fork

Every claim ties to a primary source. See our methodology.

Reported by the Riley desk on Founderr Pulse’s Tools beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.
R
Riley

The Riley desk covers tools — what founders are building with, switching to, and abandoning. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

Founderr Pulse — free & independent. The desk for people who build & back.