RTX 5080 16GB: Qwen3.6 MoE Benchmarks Find MTP Counterproductive at 128k Context
This review analyzes performance benchmarks for Qwen3.6 MoE models on an RTX 5080 16GB, detailing the unexpected impact of Multi-Token Prediction (MTP) on local LLM inference speeds and VRAM…
This review analyzes performance benchmarks for Qwen3.6 MoE models on an RTX 5080 16GB, detailing the unexpected impact of Multi-Token Prediction (MTP) on local LLM inference speeds and VRAM utilization.
The Answer Up Front
For indie founders and developers targeting local LLM inference on an RTX 5080 16GB, the Qwen3.6 35B Q4_K_XL model without Multi-Token Prediction (MTP) is the optimal configuration. This setup delivers 97 tokens/second (tok/s) by maximizing GPU VRAM, making it the fastest option for high-context workloads. Users should skip MTP for MoE models on this specific hardware, as its VRAM reservation pushes critical expert layers to the CPU, creating a bottleneck that negates its speculative generation benefits. If 56k context is sufficient or if working with a 12GB card, the 27B IQ3 model is a viable alternative.
Methodology
This v0 review draws on the founder's published claims and detailed benchmark tables from a Reddit post by gaztrab on r/LocalLLaMA. Independent benchmarks are pending, and we will re-test when claims diverge from observed behavior. The tests were conducted on an RTX 5080 16GB GPU paired with a Ryzen 9 9950X CPU and 128GB RAM. The llama.cpp framework, specifically version b9204 (mainline), was used for all inference. Three Qwen3.6 MoE model configurations were evaluated:
- Qwen3.6-27B MTP-UD-IQ3_XXS (12.45 GB)
- Qwen3.6-35B-A3B MTP-UD-Q4_K_XL (~22 GB)
- Qwen3.6-35B-A3B MTP-Q8_0 (~36 GB)
Benchmarks focused on token generation speed (tok/s), VRAM usage, and the impact of MTP at various context lengths, including real coding-agent context lengths up to 131k. Common MTP flags included -np 1 --fit on -fa on -t 20 --no-mmap --jinja -ctk q8_0 -ctv q8_0 --spec-type draft-mtp --spec-draft-n-max 2. This review covers the founder's reported performance metrics and technical explanations for observed behavior. It does not cover independent performance verification, long-term workflow integration, or edge-case handling.
What It Does
Qwen3.6 MoE models
The Qwen3.6 series represents a family of Mixture-of-Experts (MoE) large language models, designed for efficient inference by activating only a subset of 'expert' networks per token. The models tested include 27B and 35B parameter counts, with various quantizations like IQ3_XXS, Q4_K_XL, and Q8_0, which reduce model size and VRAM footprint at the cost of some precision.
Multi-Token Prediction (MTP)
MTP, recently merged into llama.cpp at b9190, is a speculative decoding technique. It aims to accelerate inference by predicting multiple tokens ahead and then verifying them with the full model, accepting sequences of correct predictions. This can significantly boost generation speed if the speculative draft model is accurate, reducing the number of full model passes.
Local inference on consumer hardware
The core activity is running these large language models directly on a consumer-grade GPU (RTX 5080 16GB) rather than relying on cloud APIs. This requires careful management of VRAM, model quantization, and llama.cpp configuration parameters like --fit-target to optimize for speed and context length.
What's Interesting / What's Not
The most significant finding is the counter-intuitive performance of Multi-Token Prediction (MTP) on the RTX 5080 16GB with Qwen3.6 MoE models. While MTP is designed to accelerate generation, the gaztrab benchmarks show it makes the 35B Q4_K_XL configuration 23% slower than running without it. The 35B Q4_K_XL model achieved 97 tok/s without MTP, compared to 74 tok/s with MTP enabled.
This performance degradation is attributed to MTP's VRAM requirements. MTP necessitates reserving approximately 1.5 GB of VRAM for its compute buffer (--fit-target 1536). For MoE models, this VRAM reservation forces about three additional expert layers to offload from the GPU to the CPU. Since CPU-bound expert layers become the primary bottleneck for MoE inference, MTP's speculative gains (reported at a 79.5% acceptance rate for the 35B Q4_K_XL model) are insufficient to compensate for the increased CPU overhead.
For developers, this means blindly enabling new llama.cpp features like MTP without benchmarking can lead to suboptimal performance. The --fit-target 0 configuration, which allows llama.cpp to fully utilize available VRAM without reserving space, yielded the best results for the 35B Q4_K_XL model, achieving 97 tok/s and using 15,815 MiB of VRAM. This highlights the critical importance of VRAM management and understanding the specific architectural demands of MoE models on constrained hardware.
The 27B IQ3 model, at 12.45 GB, fully fits on the 16GB GPU and offers a respectable 73 tok/s with MTP, making it a strong contender for users with 12GB cards or those who do not require the full 128k context length. However, for maximum performance on 16GB, the 35B Q4_K_XL without MTP is the clear winner.
Pricing
This review focuses on local LLM inference. There are no direct software pricing tiers for llama.cpp or the Qwen3.6 models, as they are open-source. The primary cost is the hardware itself, specifically the RTX 5080 16GB GPU and supporting system components. (Pricing snapshot: 2026-05-20)
Verdict
For indie developers and founders leveraging an RTX 5080 16GB for local LLM inference, the optimal configuration is the Qwen3.6 35B Q4_K_XL model running on llama.cpp without Multi-Token Prediction. This setup maximizes GPU utilization, delivering 97 tok/s by avoiding the CPU bottleneck introduced when MTP's VRAM reservation pushes MoE expert layers off the GPU. While MTP offers speculative decoding benefits, its VRAM overhead on 16GB cards with MoE models proves detrimental. The 27B IQ3 model is a strong alternative for users with less VRAM or lower context requirements, but the 35B Q4_K_XL without MTP provides the best performance for high-context tasks on the specified hardware.
What We'd Test Next
Our next steps would involve independently reproducing gaztrab's benchmarks to verify the MTP performance anomaly across a wider range of MoE models and hardware configurations. We would specifically test the impact of MTP on non-MoE models to isolate whether the issue is MoE-specific or a general characteristic of MTP's VRAM reservation. Further investigation would include benchmarking with different CPU architectures and RAM speeds to quantify their influence on CPU-bound expert layers. We would also evaluate the long-term stability and generation quality of the recommended configuration under sustained load, and explore the --fit-target parameter's sweet spot for various VRAM capacities beyond 16GB.
The investor read
This benchmark highlights the continued importance of hardware optimization and efficient inference frameworks like llama.cpp for local LLM deployment. The finding that a new feature (MTP) can degrade performance due to VRAM constraints underscores the complexity of optimizing for specific hardware, particularly for MoE models. This signals a persistent demand for high-VRAM consumer GPUs and specialized software solutions that can intelligently manage memory and computation. Companies developing tools for local LLM deployment, especially those focusing on quantization, dynamic offloading, or novel speculative decoding methods that are less VRAM-hungry, are well-positioned. The community-driven nature of these detailed benchmarks on platforms like Reddit also indicates a strong, active user base willing to invest time in performance tuning, suggesting a market for premium, pre-optimized local inference solutions or hardware bundles.
Every claim ties to a primary source. See our methodology.