Tools·Jun 2, 2026

Llama.cpp for MTP, KV cache, and long context: User reports on performance

This review evaluates llama.cpp and its forks for Multi-GPU (MTP) support, KV cache quantization, and long context handling, based on a user's reported experiences and performance numbers. TL;DR Best…

By Riley · Tools desk·Human-reviewed·✓ Verified Jun 2, 2026·3 min read·1 source

This review evaluates llama.cpp and its forks for Multi-GPU (MTP) support, KV cache quantization, and long context handling, based on a user's reported experiences and performance numbers.

TL;DR Best for: Users with a single NVIDIA 3090 GPU seeking improved initial token generation speed for local LLMs via MTP, willing to manage context growth performance degradation. Skip if: You require consistent token generation rates across very long contexts, or prioritize official llama.cpp mainline stability and broader hardware compatibility over specialized MTP forks. Bottom line: llama.cpp-mtp offers a significant initial speed boost for MTP on a single 3090, but its performance degrades sharply as context fills, making mainline llama.cpp a more predictable choice for sustained long context inference.

METHODOLOGY

This v0 review draws on a user's published claims and observations on Reddit regarding llama.cpp and related projects. Independent benchmarks are pending. Update cadence: re-tested when claims diverge from observed behavior or when new official releases address these areas. The primary signal is a Reddit post by user GodComplecs, dated 2026-05-28, discussing llama.cpp's performance for Multi-GPU (MTP), KV cache quantization, and long context. GodComplecs specifically references two GitHub repositories: noonghunna/club-3090 and Indras-Mirror/llama.cpp-mtp. The review covers the user's reported token generation speeds (60 tks, 20 tks), context length (4k, 20-40k), and KV cache quantization (Q4), using a Qwen 3.6 27b Q4 model. What is not covered includes independent performance verification, long-term workflow integration, or edge cases beyond the specific scenario described by the user.

WHAT IT DOES

Multi-GPU (MTP) Support

GodComplecs discusses Multi-GPU (MTP) support within the llama.cpp ecosystem, specifically in the context of a single NVIDIA 3090 GPU. The user reports using Indras-Mirror/llama.cpp-mtp, a fork designed to optimize llama.cpp for multi-GPU setups. This project aims to distribute the model across multiple GPUs to overcome VRAM limitations and potentially improve inference speed. GodComplecs notes an initial performance of 60 tks with this setup.

KV Cache Quantization

KV cache quantization is a technique to reduce the memory footprint of the key-value cache during LLM inference. By quantizing the KV cache, users can accommodate longer contexts within limited VRAM. GodComplecs specifically mentions using a q4 cache with mainline llama.cpp. This indicates a 4-bit quantization scheme for the KV cache, a common optimization to balance memory savings and performance. The user observes that with this setup, context is max 4k at good speed.

Long Context Performance

Long context performance refers to an LLM's ability to process and generate text with extended input histories. GodComplecs provides direct performance numbers for long context scenarios. With llama.cpp-mtp, the initial speed is 60 tks, but this drops significantly to 20 tks as the context fills up. In contrast, mainline llama.cpp with a Q4 cache maintains a maximum context of 4k at a good speed, implying a more stable but lower baseline performance for long contexts.

`club-3090` Project

GodComplecs also mentions noonghunna/club-3090, initially used in a vllm version for contexts up to 20-40k. The user notes that the project has since introduced a new llama.cpp patched version. However, GodComplecs expresses concern that the project's readme is becoming

Sources · how we verified

Question: Llama cpp, whats good right now for: MTP, KV cache quant, Long context. ↗

Every claim ties to a primary source. See our methodology.

Reported by the Riley desk on Founderr Pulse’s Tools beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.

Riley

The Riley desk covers tools — what founders are building with, switching to, and abandoning. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

METHODOLOGY

WHAT IT DOES

Multi-GPU (MTP) Support

KV Cache Quantization

Long Context Performance

club-3090 Project

Robinhood Chain demo app shows standard Ethereum dev tools still work

Web Crypto API offers secure browser-side UUID v4 generation

Git-absorb uses git blame to automate fixup commits

`club-3090` Project