Qwen3.6 35B-A3B MTP achieves 249 t/s on consumer GPU
This review analyzes Qwen3.6 35B-A3B MTP's performance on a 24GB RTX 5090M, focusing on its Mixture-of-Experts (MoE) architecture and Medusa-style Tree Attention (MTP) benefits for local LLM…
This review analyzes Qwen3.6 35B-A3B MTP's performance on a 24GB RTX 5090M, focusing on its Mixture-of-Experts (MoE) architecture and Medusa-style Tree Attention (MTP) benefits for local LLM inference.
TL;DR
Best for: Developers needing high-throughput local LLM inference on consumer-grade GPUs with 24GB VRAM, especially for tasks benefiting from speculative decoding and MoE efficiency. Skip if: Your workflow requires 'thinking mode' outputs from Qwen, or if you need PagedAttention for multi-user concurrency. Bottom line: Qwen3.6 35B-A3B MTP delivers exceptional token generation rates by combining MoE efficiency with speculative decoding, making 24GB consumer GPUs viable for advanced local LLM use cases.
METHODOLOGY
This v0 review draws on the founder's published claims at the provided Reddit URL; independent benchmarks are pending. Update cadence: re-tested when claims diverge from observed behavior. The tool reviewed is unsloth/Qwen3.6-35B-A3B-MTP-GGUF UD-Q3_K_XL, observed on 2026-05-23. The source signal URL is https://www.reddit.com/r/LocalLLaMA/comments/1tlfp5s/qwen36_35ba3b_mtp_hits_249_ts_on_a_24gb_consumer/.
This review covers the performance claims, architectural details (MoE, MTP), llama.cpp integration specifics, and context scaling data as presented by aurelienams. The testing environment involved a laptop-class RTX 5090 (24GB, sm_120 Blackwell, ~896 GB/s) running Linux, using ggml-org/llama.cpp master branch from a few days prior to the post. This llama.cpp version included am17an's MTP merge (#22673), ggerganov's n_max=3 default cleanup (#23269), and NVIDIA backend sampling work (#23287, merged 2026-05-20). Performance was measured over 10 back-to-back runs of a 'Space Invaders HTML completion' task, generating 2000 tokens each, in a single-user stream.
What is NOT covered in this review includes independent performance verification, long-term workflow stability under sustained agentic load (beyond 3.5 minutes), and edge-case behaviors not detailed in the source. Concurrency with PagedAttention was also not tested.
WHAT IT DOES
MoE architecture for efficiency
Qwen3.6 35B-A3B MTP leverages a Mixture-of-Experts (MoE) architecture, featuring 128 experts plus 1 shared expert. During a forward pass, the router activates approximately 8 experts per token. This design significantly reduces the per-token compute cost compared to a dense model of similar overall parameter count. While the model is 35 billion parameters, only about 3 billion parameters are actively processed per token, leading to substantial efficiency gains.
MTP for speculative decoding
The model integrates Medusa-style Tree Attention (MTP) for speculative decoding. This technique predicts multiple future tokens in parallel, then verifies them with the main model. The reported draft acceptance rate for Qwen3.6 35B-A3B MTP was 86.6% with n_max=3, meaning the model successfully predicts multiple tokens in a single decode step. This high acceptance rate, combined with the n_max=3 setting, results in an effective throughput of roughly 3.6 tokens per decode step, a 3.6x improvement over non-speculative decoding.
llama.cpp integration
The performance was observed using ggml-org/llama.cpp master, which recently incorporated key merges for MTP and NVIDIA backend sampling. The specific llama.cpp arguments used included --spec-type draft-mtp, --spec-draft-n-max 3, --ctx-size 262144, --cache-type-k q4_0, --cache-type-v q4_0, --batch-size 512, --ubatch-size 512, --parallel 1, and --flash-attn on. These settings optimize the model for local inference on NVIDIA GPUs, enabling the reported high throughput and large context window.
WHAT'S INTERESTING / WHAT'S NOT
What's interesting about Qwen3.6 35B-A3B MTP is its ability to achieve 249.30 t/s on a 24GB laptop-class RTX 5090. This is a 3.4x speedup over the 27B dense MTP variant, which ran at 74.28 t/s on the exact same hardware and configuration. This performance gain is attributed to the MoE-A3B architecture, which drastically reduces per-token compute, combined with the high 86.6% draft acceptance rate of MTP at n_max=3. The context scaling also remained remarkably flat, with throughput barely changing from 32K to 262K tokens, demonstrating robust performance across large context windows while fitting within 24GB of VRAM (22.4 GB total usage at 262K context).
What's not as compelling, or requires careful consideration, is the limitation regarding 'thinking mode'. Re-enabling 'thinking mode' tanks the MTP draft acceptance rate to ~40%, indicating that the MTP draft heads were trained specifically on non-thinking outputs. This restricts the model's flexibility for certain types of generative tasks. Additionally, the Q4_K_XL quantization does not fit within 24GB of VRAM, limiting users to Q3_K_XL as the largest viable quantization. The current testing is also limited to single-stream, single-user scenarios, meaning the benefits of PagedAttention concurrency are not explored. The long-term stability under sustained agentic load, particularly regarding potential degradation patterns observed in other models (e.g.,
Every claim ties to a primary source. See our methodology.