Optimizing Local LLM Inference on Consumer GPUs for +30 TPS
A Reddit founder achieved 30+ tokens per second (TPS) with a 262K context window on a Qwen3.6-35B-A3B model using an 8GB RTX 3070 Ti. This required specific quantization, OS choices, and llama-server…
A Reddit founder achieved 30+ tokens per second (TPS) with a 262K context window on a Qwen3.6-35B-A3B model using an 8GB RTX 3070 Ti. This required specific quantization, OS choices, and
llama-serverconfigurations.
Alternative-Cat-1347 reported achieving 30+ tokens per second (TPS) inference with a 262,144-token context window for the Qwen3.6-35B-A3B model on an 8GB NVIDIA RTX 3070 Ti. This performance exceeded that of other users with superior hardware, who often reduced context to 64K or less for comparable TPS. The approach involved precise VRAM management, specific quantization methods, and a lean operating system environment.
WHAT THEY DID
Targeting MoE Model VRAM Requirements
The Qwen3.6-35B-A3B model is a Mixture-of-Experts (MoE) architecture. This design is critical because it does not require the entire model to reside in VRAM simultaneously. Alternative-Cat-1347 determined that only approximately 3.5 billion parameters needed to be in VRAM during runtime. This insight allowed for a tighter VRAM allocation strategy on the 8GB 3070 Ti.
The 8GB VRAM was partitioned as follows: active model layers consumed about 3GB, GPU buffers used approximately 2GB, and the 262,144-token KV cache, quantized at q8_0, occupied 2.56GB. This totaled 7.56GB, leaving minimal overhead. Attempts to force all layers into VRAM or adjust engine parameters like sm or fa resulted in slowdowns or VRAM exhaustion, indicating that the default MoE handling was optimal for this setup.
Selecting Advanced Quantization Methods
Quantization was a key factor in fitting the model and its context into the limited VRAM while maintaining performance. Alternative-Cat-1347 employed APEX-I-Quality or Q4_K_XL quantizations, noting these performed better than Q4_K_M. For context windows exceeding 512K, IQ4_NL_XL quantization was used. The KV cache was specifically set to q8_0 quantization, a balance between memory footprint and performance for context management.
This selection allowed for pushing the context to 320K, 400K, 512K, and even 1M tokens, though performance noticeably declined beyond 150K. The ability to reach larger contexts, even with a TPS reduction, provided flexibility for specific use cases requiring extensive context processing.
Optimizing Operating System Environment
Operating system choice significantly impacted inference performance and resource utilization. Running inference on Ubuntu Server from a terminal, installed on a dedicated 160GB NVMe partition, consumed approximately 800MB of system RAM. This contrasted sharply with Windows 11, which used over 28GB of system memory for the same llama.cpp parameters.
This lean Linux environment provided a 25% boost to TPS. On Windows, inference was below 27 TPS and degraded quickly beyond 100K context. On Ubuntu Server, inference consistently reached ~34 TPS, often peaking at ~37 TPS during token generation, and maintained stability up to 1M context with IQ4_NL_XL quantization. The remaining 8GB of system RAM on Ubuntu was available for other applications, provided they did not consume VRAM.
Configuring llama-server for Performance
The llama-server command-line interface was configured with specific parameters to achieve the reported performance. For a 256K context, the command used was:
llama-server \
-m Qwen3.6-35B-A3B-Q4_K_XL.gguf \
--jinja \
--parallel 1 \
--temp 0.7 \
--top-k 20 \
--top-p 0.95 \
--min-p 0 \
--reasoning-budget 4096 \
-n 32768 \
--no-context-shift \
--no-mmap \
-c 262144 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--host 0.0.0.0
Key parameters included -c 262144 for the context size, --cache-type-k q8_0 and --cache-type-v q8_0 for KV cache quantization, and --no-mmap to prevent memory-mapped file usage, which can impact performance on systems with limited RAM. The --parallel 1 setting likely optimized for single-stream inference, while --reasoning-budget 4096 and -n 32768 controlled generation length and reasoning steps. For 512K context, the -c parameter was adjusted accordingly.
WHAT WE'D CHANGE
The playbook outlined by Alternative-Cat-1347 provides a detailed path for optimizing local LLM inference, but its direct applicability for all founders requires consideration of context and evolving technology. The specific performance numbers are tied to the Qwen3.6-35B-A3B MoE model and the 8GB RTX 3070 Ti. Non-MoE models or GPUs with different VRAM configurations would necessitate a re-evaluation of VRAM partitioning and quantization strategies.
The emphasis on Ubuntu Server for a 25% TPS boost is significant. However, for founders who prioritize ease of use, existing development environments, or integration with other Windows-specific tools, a dedicated Linux server setup might introduce an unacceptable operational overhead. The performance gain must be weighed against the cost of maintaining a dual-boot or dedicated Linux machine, especially if the primary development environment remains Windows-based. While the founder noted the profiles should work under Windows 11, the memory constraints would be substantial.
Furthermore, the llama.cpp ecosystem and quantization techniques are under continuous development. APEX-I-Quality, Q4_K_XL, and IQ4_NL_XL were optimal at the time of the report. Newer quantization schemes or llama.cpp optimizations might offer better performance or VRAM efficiency, potentially reducing the need for such extreme OS-level tuning. Founders should validate these specific quantization methods against current llama.cpp benchmarks and newer model releases. The --no-mmap flag, while beneficial for this specific setup, might not be universally optimal, as mmap can be efficient in other memory configurations.
LANDING
Optimizing local LLM inference on consumer hardware remains a balance of model architecture, hardware constraints, and software configuration. Alternative-Cat-1347's approach demonstrates that significant performance gains, including high TPS and extended context windows, are achievable on mid-range GPUs by understanding MoE VRAM dynamics, selecting precise quantization, and adopting a minimalist operating system. This granular control over the llama-server parameters offers a template for founders aiming to maximize local LLM utility without investing in datacenter-grade hardware, provided they are willing to engage with the underlying technical specifics.
Pull quote: “Alternative-Cat-1347 reported achieving 30+ tokens per second (TPS) inference with a 262,144-token context window for the Qwen3.6-35B-A3B model on an 8GB NVIDIA RTX 3070 Ti.”
Every claim ties to a primary source. See our methodology.