HomeReadTools deskShard achieves 10x KV cache compression for Llama-3.1-8B
Tools·May 28, 2026

Shard achieves 10x KV cache compression for Llama-3.1-8B

This review analyzes Shard's technical design, which uses PCA and int4 quantization for K, and Hadamard rotation with vector quantization for V, assessing its claimed performance benefits. TL;DR Best…

This review analyzes Shard's technical design, which uses PCA and int4 quantization for K, and Hadamard rotation with vector quantization for V, assessing its claimed performance benefits.

TL;DR

Best for: Developers running Llama-3.1-8B locally who are constrained by GPU memory and require significant KV cache compression without measurable performance degradation on standard benchmarks. Skip if: Your workflow demands verified, consistent performance across a wide range of models and benchmarks, or if you cannot tolerate any potential, unmeasured impact on less common evaluation metrics. Bottom line: Shard offers a technically innovative approach to KV cache compression, claiming substantial memory savings with no measurable performance hit on common benchmarks, making it a compelling option for memory-bound local LLM inference.

METHODOLOGY

This v0 review draws on the founder's published claims at krishgarg.com/shard, linked from the Reddit post by /u/Thrumpwart. The analysis covers the technical design principles, the claimed compression ratios, and the stated performance impact on specific benchmarks. The tool, Shard, is presented as a drop-in HuggingFace Cache. The version observed is implied by the publication date of the Reddit post and project page, May 26, 2026. This review covers the founder's own claims regarding the compression techniques (PCA + int4 on K, Hadamard + vector quantization on V) and the reported performance on NIAH and LongBench. What is not covered in this v0 review includes independent performance benchmarks, long-term workflow integration, or edge-case behavior with various models, context lengths, or hardware configurations. Update cadence: This review will be re-tested when independent benchmarks become available or when observed behavior diverges from the founder's claims.

WHAT IT DOES

Drop-in HuggingFace Cache

Shard integrates as a direct replacement for the standard HuggingFace KV cache. This design choice aims to minimize integration effort for developers already using HuggingFace's ecosystem, allowing for immediate application of its compression benefits without extensive code changes. The project's GitHub repository, krish1905/shard, provides the implementation details.

Differential compression for K and V

Shard employs distinct compression strategies for the Key (K) and Value (V) matrices within the KV cache. For the K matrix, it applies Principal Component Analysis (PCA) followed by int4 quantization. The founder notes that the K matrix is effectively low-rank after undoing RoPE (Rotary Position Embeddings), making PCA an effective strategy. For the V matrix, Shard uses a Hadamard rotation combined with vector quantization. This differential treatment acknowledges the distinct mathematical properties and roles of K and V in the attention mechanism.

Direct attention on compressed K

A key technical detail is that attention calculations run directly on the compressed K matrix. This avoids the need for fp16 reconstruction of the K matrix before attention, which would introduce latency and potentially negate some of the memory benefits. This direct computation is critical for maintaining inference speed while operating on a significantly smaller cache.

Significant memory reduction

The primary function of Shard is to reduce the memory footprint of the KV cache. The founder claims a 10x reduction for Llama-3.1-8B at an 8K context length, increasing to an 11x reduction at 32K context. These figures suggest substantial memory savings, which are particularly beneficial for running larger models on resource-constrained hardware, such as local GPUs.

WHAT'S INTERESTING / WHAT'S NOT

What is most interesting about Shard is its differential compression approach for K and V. The insight that K and V matrices behave differently post-RoPE, and thus require tailored compression techniques, is a meaningful improvement over simpler, uniform quantization methods. Applying PCA and int4 to K, while using Hadamard rotation and vector quantization for V, demonstrates a nuanced understanding of the attention mechanism's underlying linear algebra. This is a more sophisticated approach than merely quantizing the entire cache uniformly, which often leads to greater quality degradation.

The claim of

Sources · how we verified
  1. Shard - getting to 10× KV cache compression

Every claim ties to a primary source. See our methodology.

Reported by the Riley desk on Founderr Pulse’s Tools beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.
R
Riley

The Riley desk covers tools — what founders are building with, switching to, and abandoning. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

Founderr Pulse — free & independent. The desk for people who build & back.