Tools·May 26, 2026

Qwen3.6 35B-A3B MTP achieves 249 t/s on consumer GPU

This review analyzes Qwen3.6 35B-A3B MTP's performance on a 24GB RTX 5090M, focusing on its Mixture-of-Experts (MoE) architecture and Medusa-style Tree Attention (MTP) benefits for local LLM…

By Riley · Tools desk·Human-reviewed·✓ Verified May 26, 2026·4 min read·1 source

This review analyzes Qwen3.6 35B-A3B MTP's performance on a 24GB RTX 5090M, focusing on its Mixture-of-Experts (MoE) architecture and Medusa-style Tree Attention (MTP) benefits for local LLM inference.

TL;DR

Best for: Developers needing high-throughput local LLM inference on consumer-grade GPUs with 24GB VRAM, especially for tasks benefiting from speculative decoding and MoE efficiency. Skip if: Your workflow requires 'thinking mode' outputs from Qwen, or if you need PagedAttention for multi-user concurrency. Bottom line: Qwen3.6 35B-A3B MTP delivers exceptional token generation rates by combining MoE efficiency with speculative decoding, making 24GB consumer GPUs viable for advanced local LLM use cases.

METHODOLOGY

This v0 review draws on the founder's published claims at the provided Reddit URL; independent benchmarks are pending. Update cadence: re-tested when claims diverge from observed behavior. The tool reviewed is unsloth/Qwen3.6-35B-A3B-MTP-GGUF UD-Q3_K_XL, observed on 2026-05-23. The source signal URL is https://www.reddit.com/r/LocalLLaMA/comments/1tlfp5s/qwen36_35ba3b_mtp_hits_249_ts_on_a_24gb_consumer/.

This review covers the performance claims, architectural details (MoE, MTP), llama.cpp integration specifics, and context scaling data as presented by aurelienams. The testing environment involved a laptop-class RTX 5090 (24GB, sm_120 Blackwell, ~896 GB/s) running Linux, using ggml-org/llama.cpp master branch from a few days prior to the post. This llama.cpp version included am17an's MTP merge (#22673), ggerganov's n_max=3 default cleanup (#23269), and NVIDIA backend sampling work (#23287, merged 2026-05-20). Performance was measured over 10 back-to-back runs of a 'Space Invaders HTML completion' task, generating 2000 tokens each, in a single-user stream.

What is NOT covered in this review includes independent performance verification, long-term workflow stability under sustained agentic load (beyond 3.5 minutes), and edge-case behaviors not detailed in the source. Concurrency with PagedAttention was also not tested.

WHAT IT DOES

MoE architecture for efficiency

Qwen3.6 35B-A3B MTP leverages a Mixture-of-Experts (MoE) architecture, featuring 128 experts plus 1 shared expert. During a forward pass, the router activates approximately 8 experts per token. This design significantly reduces the per-token compute cost compared to a dense model of similar overall parameter count. While the model is 35 billion parameters, only about 3 billion parameters are actively processed per token, leading to substantial efficiency gains.

MTP for speculative decoding

The model integrates Medusa-style Tree Attention (MTP) for speculative decoding. This technique predicts multiple future tokens in parallel, then verifies them with the main model. The reported draft acceptance rate for Qwen3.6 35B-A3B MTP was 86.6% with n_max=3, meaning the model successfully predicts multiple tokens in a single decode step. This high acceptance rate, combined with the n_max=3 setting, results in an effective throughput of roughly 3.6 tokens per decode step, a 3.6x improvement over non-speculative decoding.

llama.cpp integration

The performance was observed using ggml-org/llama.cpp master, which recently incorporated key merges for MTP and NVIDIA backend sampling. The specific llama.cpp arguments used included --spec-type draft-mtp, --spec-draft-n-max 3, --ctx-size 262144, --cache-type-k q4_0, --cache-type-v q4_0, --batch-size 512, --ubatch-size 512, --parallel 1, and --flash-attn on. These settings optimize the model for local inference on NVIDIA GPUs, enabling the reported high throughput and large context window.

WHAT'S INTERESTING / WHAT'S NOT

What's interesting about Qwen3.6 35B-A3B MTP is its ability to achieve 249.30 t/s on a 24GB laptop-class RTX 5090. This is a 3.4x speedup over the 27B dense MTP variant, which ran at 74.28 t/s on the exact same hardware and configuration. This performance gain is attributed to the MoE-A3B architecture, which drastically reduces per-token compute, combined with the high 86.6% draft acceptance rate of MTP at n_max=3. The context scaling also remained remarkably flat, with throughput barely changing from 32K to 262K tokens, demonstrating robust performance across large context windows while fitting within 24GB of VRAM (22.4 GB total usage at 262K context).

What's not as compelling, or requires careful consideration, is the limitation regarding 'thinking mode'. Re-enabling 'thinking mode' tanks the MTP draft acceptance rate to ~40%, indicating that the MTP draft heads were trained specifically on non-thinking outputs. This restricts the model's flexibility for certain types of generative tasks. Additionally, the Q4_K_XL quantization does not fit within 24GB of VRAM, limiting users to Q3_K_XL as the largest viable quantization. The current testing is also limited to single-stream, single-user scenarios, meaning the benefits of PagedAttention concurrency are not explored. The long-term stability under sustained agentic load, particularly regarding potential degradation patterns observed in other models (e.g.,

Sources · how we verified

Qwen3.6 35B-A3B MTP hits 249 t/s on a 24GB consumer GPU (RTX 5090M) — 3.4× the dense 27B variant on the same image ↗

Every claim ties to a primary source. See our methodology.

Reported by the Riley desk on Founderr Pulse’s Tools beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.

Riley

The Riley desk covers tools — what founders are building with, switching to, and abandoning. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

TL;DR

METHODOLOGY

WHAT IT DOES

MoE architecture for efficiency

MTP for speculative decoding

llama.cpp integration

WHAT'S INTERESTING / WHAT'S NOT

Robinhood Chain demo app shows standard Ethereum dev tools still work

Web Crypto API offers secure browser-side UUID v4 generation

Git-absorb uses git blame to automate fixup commits