Anbeeld's KV Cache Quantization Benchmarks: TurboQuant's Nuanced Value
This review analyzes Anbeeld's benchmarks of KV cache quantization techniques, including TurboQuant, q5, and q8, on an RTX 3090 with Qwen 3.6 27B at long contexts. TL;DR Best for: Developers…
This review analyzes Anbeeld's benchmarks of KV cache quantization techniques, including TurboQuant, q5, and q8, on an RTX 3090 with Qwen 3.6 27B at long contexts.
TL;DR Best for: Developers optimizing LLM inference on 24GB VRAM GPUs, especially for long contexts, seeking aggressive VRAM compression with TCQ-enabled techniques. Skip if: VRAM is not a bottleneck, or if only perplexity (PPL) is considered for quality without examining tail distributions. Bottom line: Aggressive KV cache quantization is viable for VRAM-constrained setups, but requires careful selection based on KL divergence metrics, with asymmetric quants and TCQ offering significant advantages.
METHODOLOGY
This v0 review draws on the founder Anbeeld's published claims and data at the linked Reddit post and article. Independent benchmarks are pending. This review will be re-tested when claims diverge from observed behavior or new versions are released. The study, conducted by Anbeeld, used BeeLlama v0.1.2 to benchmark various KV cache quantization techniques. Tests were performed on a single RTX 3090, a common setup for 24 GB VRAM users, ensuring results are grounded in practical constraints. The models tested were Qwen 3.6 27B (specifically Q5_K_S and IQ4_XS quants) at 64k and 128k context lengths. Key metrics included Perplexity (PPL) and 99.9% KL divergence (KLD), which Anbeeld argues provides a more accurate picture of quality degradation. The review covers Anbeeld's detailed findings on TurboQuant, TCQ, q5, q8, and various symmetric and asymmetric quantization schemes. What is not covered in this v0 review includes independent performance verification, long-term workflow integration, or comprehensive edge case analysis beyond the scope of Anbeeld's published data.
WHAT IT DOES
Benchmarking KV Cache Quantization
Anbeeld's work benchmarks different methods for quantizing the Key-Value (KV) cache in large language models. KV cache quantization reduces the memory footprint of storing attention keys and values, enabling larger context windows or larger models on VRAM-limited hardware like the RTX 3090. The study specifically compares techniques such as bf16 (baseline), q4_0, q5_0, q8_0, turbo3, turbo4, turbo2_tcq, turbo3_tcq, and combinations like q5_0/q4_0 and q8_0/q5_0. The goal is to identify which quantization schemes offer the best trade-off between VRAM savings and model quality, measured by PPL and KLD.
PPL vs. KLD for Quality Assessment
Anbeeld highlights a critical distinction between Perplexity (PPL) and KL divergence (KLD) for evaluating quantization quality. PPL, while a common metric, can mask significant degradation in the tail of the probability distribution. Anbeeld's findings show that q4_0 stays under 0.01 PPL above bf16, and turbo3_tcq adds only ~0.02 PPL. However, 99.9% KL divergence tells a different story: q4_0's tail KLD is 32% worse than q5_0's, which can severely impact structured outputs like tool calls and JSON generation. This emphasizes the need for more sensitive metrics beyond average token prediction accuracy.
TurboQuant and TCQ Effectiveness
TurboQuant, a technique for KV cache quantization, is evaluated. Anbeeld notes that at 4 bits, turbo4 offers no quality advantage over q4_0 and runs 17% slower, suggesting its primary value lies in more aggressive 2-3 bit compression where alternatives are scarce. A key finding is the significant improvement offered by TCQ (Tiny Cache Quantization). turbo3_tcq is consistently much better than plain turbo3, and turbo2_tcq outperforms turbo2. TCQ is presented as a legitimate solution for cases requiring aggressive compression without unacceptable quality loss.
WHAT'S INTERESTING / WHAT'S NOT
What's interesting in Anbeeld's benchmarks is the explicit focus on practical VRAM constraints (24GB GPUs) and the detailed exploration of metrics beyond simple perplexity. The finding that PPL hides the tail while KLD exposes it is a crucial insight for anyone relying on LLMs for structured output. This directly addresses the common frustration of models generating plausible but ultimately incorrect JSON or tool calls, even with seemingly low PPL. The data-backed argument for KLD as a more sensitive metric for quality degradation is a significant contribution.
The strong performance of TCQ-enabled TurboQuant at lower bitrates (2-3 bits) is also noteworthy. This provides a clear path for users needing extreme VRAM savings. Similarly, the demonstration that asymmetric KV beats symmetric at the same memory footprint (q5_0/q4_0 outperforming q4_1/q4_1) offers a concrete optimization strategy. Anbeeld's observation that higher model precision means more cache damage (e.g., Q5_K_S taking more damage than IQ4_XS at the same cache quant) highlights the interconnectedness of model and KV cache quantization, suggesting a balanced approach is best.
What's not as interesting, or perhaps less surprising, is the conclusion that q8 is mostly a luxury tier. While q8_0/q5_0 at 43.8% of bf16 KV keeps 99.9% precision at 93.7-98.2%, full q8_0/q8_0 at 53.1% is mostly for scenarios where VRAM is not a bottleneck. This confirms what many VRAM-constrained users already suspect: pushing for the absolute highest quality KV cache often comes at a disproportionate VRAM cost that could be better spent elsewhere. The brief mention of the vLLM study as a
- Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM ↗
- KV Cache Quantization Benchmarks for Long Context ↗
Every claim ties to a primary source. See our methodology.