HomeReadTools deskMulti-Token Prediction delivers 3.34x faster LLM inference on Gemma 4
Tools·Jun 3, 2026

Multi-Token Prediction delivers 3.34x faster LLM inference on Gemma 4

This review analyzes Multi-Token Prediction (MTP) performance on Gemma 4 and Qwen 3.6 models using vLLM and llama.cpp, based on recent community benchmarks. TL;DR Best for: Local LLM inference…

This review analyzes Multi-Token Prediction (MTP) performance on Gemma 4 and Qwen 3.6 models using vLLM and llama.cpp, based on recent community benchmarks.

TL;DR

Best for: Local LLM inference requiring substantial speedups on dense models, particularly with Gemma 4 on vLLM. Skip if: You require independently verified quality degradation metrics or precise VRAM impact, or if your specific models or inference engines lack MTP support. Bottom line: Multi-Token Prediction offers significant inference speedups, but achieving optimal performance demands per-model and per-engine benchmarking for speculative token counts.

METHODOLOGY

This v0 review draws on the founder FantasticNature7590's published claims on Reddit; independent benchmarks are pending. Update cadence: re-tested when claims diverge from observed behavior.

Multi-Token Prediction (MTP) is a technique for accelerating LLM inference by predicting multiple tokens ahead of the main model. This review covers MTP's performance as benchmarked by FantasticNature7590, using specific implementations within vLLM and llama.cpp. The tests were conducted on Gemma 4 31B and Qwen 3.6 27B models, both in GGUF and FP8 formats. The hardware used was an AMD Ryzen 9 9950X CPU, an NVIDIA RTX PRO 6000 Blackwell GPU with 96GB VRAM, 92GB RAM, CUDA 13.1, running Ubuntu 24.04. Each benchmark session involved 10 runs, generating 1500 tokens per run, using a consistent prompt, with prefix caching turned off. vLLM was run in sequential mode. The models specifically tested were unsloth/Qwen3.6-27B-MTP-GGUF (Q8_0) via llama.cpp, and RedHatAI/gemma-4-31B-it-FP8-block alongside Qwen/Qwen3.6-27B-FP8 via vLLM. This review covers the founder's own claims regarding speedups and token-per-second metrics, along with observations on optimal speculative token counts. It does NOT cover independent performance verification, long-term workflow integration, or comprehensive evaluation of edge cases, quality degradation, or precise VRAM consumption, as these were explicitly noted as beyond the scope of the original benchmark.

WHAT IT DOES

Accelerated LLM Inference with MTP

Multi-Token Prediction (MTP) is an inference technique designed to significantly speed up the generation of tokens from large language models. It operates by using a smaller, faster draft model to predict a sequence of tokens, which are then verified in parallel by the larger, more accurate target model. If the predictions are correct, multiple tokens are accepted in a single step, rather than one by one, leading to substantial throughput improvements. FantasticNature7590's benchmarks show MTP delivering up to 3.34x faster inference on Gemma 4 and 2.59x faster on Qwen 3.6.

Engine-Specific Performance for MTP

The benchmarks highlight distinct performance characteristics across inference engines. vLLM demonstrated superior performance with Gemma 4, achieving 132.52 tok/s. For Qwen 3.6, llama.cpp peaked at 117.70 tok/s. A key caveat is that llama.cpp did not support Gemma 4 MTP at the time of testing, preventing a direct, apples-to-apples comparison between engines for that specific model. The vLLM implementation of MTP is also noted as more mature.

Optimal Speculative Token Counts are not always highest

The number of speculative tokens (n) significantly impacts MTP performance. FantasticNature7590 found that the highest n value does not always yield the best speed. For vLLM with Gemma 4, n=5 was optimal, reaching 132.52 tok/s. Conversely, for llama.cpp with Qwen 3.6, n=3 was the sweet spot at 117.70 tok/s, with performance oscillating at higher n values. This indicates that the optimal speculative token count is specific to the model and engine combination, requiring empirical tuning.

MTP gains are largest on dense models

The benchmarks focused on dense models, specifically Gemma 4 31B and Qwen 3.6 27B, as these are considered ideal for measuring speculative decoding gains. The observed speedups were 3.34x for Gemma 4 and 2.59x for Qwen 3.6 on vLLM. While not a universal rule, this suggests dense architectures are particularly well-suited to benefit from MTP, likely due to their consistent computational patterns allowing for more effective speculation.

WHAT'S INTERESTING / WHAT'S NOT

What's interesting about these findings is the magnitude of the speedups. A 3.34x increase in inference speed on a powerful model like Gemma 4 31B is a significant practical improvement for local LLM deployment. The architectural claim that MTP's design makes quality degradation hard, because the target model still verifies every token, provides a strong theoretical basis for its adoption. This suggests that users can expect substantial speed gains without compromising output fidelity. Furthermore, the empirical finding that the optimal speculative token count (n) is not simply

Sources · how we verified
  1. I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.

Every claim ties to a primary source. See our methodology.

Reported by the Riley desk on Founderr Pulse’s Tools beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.
R
Riley

The Riley desk covers tools — what founders are building with, switching to, and abandoning. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

Founderr Pulse — free & independent. The desk for people who build & back.