HomeReadTools deskik_llama.cpp boosts local LLM inference by 22% on 12GB VRAM
Tools·May 23, 2026

ik_llama.cpp boosts local LLM inference by 22% on 12GB VRAM

This review analyzes a detailed performance comparison between ik_llama.cpp and llama.cpp, highlighting a significant speed increase on consumer-grade hardware for local large language model…

This review analyzes a detailed performance comparison between ik_llama.cpp and llama.cpp, highlighting a significant speed increase on consumer-grade hardware for local large language model inference.

TL;DR

Best for: Indie founders and developers running large language models (LLMs) locally on consumer GPUs with 12GB VRAM, particularly those seeking optimized multi-token prediction (MTP) performance. Skip if: You require the absolute highest token acceptance rates over raw token generation speed, or if your workflow does not involve GGUF models and llama.cpp-like inference engines. Bottom line: ik_llama.cpp delivers a substantial 22% speed improvement over llama.cpp for Qwen3.6-35B inference on an RTX 4070 Super 12GB, making it a strong contender for local LLM acceleration.

METHODOLOGY

v0 review draws on the claims published by Reddit user janvitos, who benchmarked ik_llama.cpp (developed by ikawrakow) against llama.cpp. The signal was observed on 2026-05-21. This review covers the performance comparison, specific hardware configurations, command-line parameters, and benchmark results as detailed in the source. The core of the analysis focuses on the reported token generation rates (tok/s) and multi-token prediction (MTP) acceptance rates across various tasks. We acknowledge this is a v0 review based on a single, self-reported benchmark. What is not covered includes independent performance verification, long-term stability in diverse workflows, or a deep dive into ik_llama.cpp's underlying architectural changes. Update cadence: This tool will be re-tested when claims diverge from observed behavior in independent benchmarks.

WHAT IT DOES

Optimized Multi-Token Prediction

ik_llama.cpp is a fork of the popular llama.cpp project, specifically engineered to enhance multi-token prediction (MTP) performance. The Reddit post highlights that llama.cpp's MTP implementation, after a recent merge, saw a performance decrease. ik_llama.cpp aims to restore and surpass previous MTP speeds, particularly when offloading parts of the model to the CPU. The benchmark uses a Qwen3.6-35B-A3B-IQ4_XS-4.19bpw GGUF quantization, a model noted for its balance of accuracy and compact size (4GB smaller than Unsloth's Q4_K_XL).

Detailed Hardware and Software Stack

The benchmark was conducted on a specific hardware configuration: an RTX 4070 Super 12GB GPU, an AMD Ryzen 7 9700X CPU, and 48GB DDR5-6000 EXPO I RAM. The operating system used was CachyOS. This detailed specification provides a precise context for the reported performance gains, allowing other users with similar setups to potentially replicate the results. The llama-server command-line parameters for both llama.cpp and ik_llama.cpp are provided, including --fit, --ctx-size 131072, and specific cache types (q8_0) for both key/value and draft caches. These parameters are crucial for maximizing performance and managing VRAM on 12GB GPUs.

Benchmark Results

Reddit user janvitos performed benchmarks using mtp-bench.py across nine tasks, including code_python, code_cpp, summarize, and qa_factual. The llama.cpp setup achieved an average of 89.76 tok/s with an aggregate acceptance rate of 0.9393. In contrast, ik_llama.cpp delivered an average of 110.24 tok/s, representing a 22% increase in token generation speed. The aggregate acceptance rate for ik_llama.cpp was 0.8749. The specific llama-server launch commands for both implementations are provided, detailing the --fit-margin parameter for ik_llama.cpp which differs from llama.cpp's --fit-target.

WHAT'S INTERESTING / WHAT'S NOT

What's interesting is the concrete, reproducible performance uplift. A 22% speed increase on a 12GB RTX 4070 Super is significant for local LLM inference, especially for indie developers who rely on consumer hardware. The detailed command-line parameters, including --fit-margin for ik_llama.cpp, offer a clear path for others to replicate these results. The use of CachyOS, a performance-oriented Linux distribution, also highlights the attention to detail in squeezing maximum performance from the system. The benchmark explicitly addresses a performance regression in llama.cpp's MTP merge, positioning ik_llama.cpp as a direct solution to that problem. This makes ik_llama.cpp particularly relevant for users who experienced a slowdown after llama.cpp's MTP update.

What's not as interesting, or rather, what requires further investigation, is the slight drop in the aggregate acceptance rate for ik_llama.cpp (0.8749) compared to llama.cpp (0.9393). While the raw token generation speed is higher, a lower acceptance rate means more tokens are drafted but ultimately rejected, potentially impacting the perceived quality or efficiency of the output, depending on the application. The source also mentions ik_llama.cpp being

Sources · how we verified
  1. 110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp

Every claim ties to a primary source. See our methodology.

Reported by the Riley desk on Founderr Pulse’s Tools beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.
R
Riley

The Riley desk covers tools — what founders are building with, switching to, and abandoning. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

Founderr Pulse — free & independent. The desk for people who build & back.