HomeReadTools deskQwen3.6-35B-A3B-APEX: Optimizing local LLM inference on RTX 3060 12GB
Tools·Jun 2, 2026

Qwen3.6-35B-A3B-APEX: Optimizing local LLM inference on RTX 3060 12GB

This review examines old-mike's detailed benchmarks for running Qwen3.6-35B-A3B-APEX on an NVIDIA RTX 3060 12GB, focusing on specific software optimizations and performance metrics. TL;DR Best for:…

This review examines old-mike's detailed benchmarks for running Qwen3.6-35B-A3B-APEX on an NVIDIA RTX 3060 12GB, focusing on specific software optimizations and performance metrics.

TL;DR

Best for: Developers needing to run large language models (35B parameters) locally on consumer-grade NVIDIA GPUs with limited VRAM (e.g., 12GB RTX 3060), prioritizing generation speed and context retention. Skip if: You have ample VRAM (24GB+) where MTP (Multi-Tenant Processing) might be beneficial, or if you require out-of-the-box performance without specific llama-server tuning. Bottom line: The combination of Spiritbuun's llama.cpp fork and Mudler's APEX quantizations enables surprisingly performant 35B LLM inference on 12GB VRAM.

METHODOLOGY

This v0 review draws on the detailed performance claims and configuration shared by Reddit user old-mike on May 28, 2026. Independent benchmarks are pending. Update cadence: re-tested when claims diverge from observed behavior.

The review covers the optimization strategies and performance of the Qwen3.6-35B-A3B-APEX-MTP-I-Compact.gguf model, along with mmproj-Qwen3.6-35B-A3B-Uncensored-Genesis-f16.gguf for multimodal capabilities, running via Spiritbuun's llama.cpp fork. The specific version of llama.cpp is implied by the fork URL github.com/spiritbuun/buun-llama-cpp. The tests were conducted on a system with an NVIDIA RTX 3060 12GB GPU (110W power limit), a Xeon E5-2678 v3 CPU, and 128 GB DDR4-2133 RAM, within an Incus (LXC) container.

old-mike's methodology included:

  • Generation Speed: Measured in tokens per second (t/s) for both prompt processing and token generation, using a 72K token prompt followed by 100 generated tokens.
  • Context Degradation: Performance metrics recorded at 72K and 129K context fill levels, compared to a "fresh" context.
  • Perplexity (PPL): Evaluated using llama-perplexity on an enwik8 subset with 64K context, turbo4 settings, and flash-attn.
  • Needle-in-a-Haystack: Manual testing across 5 trials with hidden codes in 150K–200K token academic markdown texts, at varying depths.

What's NOT covered: This review does not include independent performance verification, long-term workflow integration, or edge-case analysis beyond the specific scenarios detailed by old-mike. The impact of the specific CPU and RAM on overall system performance, beyond VRAM offloading, is also not independently assessed.

WHAT IT DOES

Efficient 35B inference on 12GB VRAM

This setup demonstrates the ability to run a 35-billion parameter model, Qwen3.6-35B-A3B-APEX-MTP-I-Compact, on an NVIDIA RTX 3060 12GB GPU. This is achieved by offloading a 17.3 GB model onto a 12 GB card, a feat typically challenging for consumer-grade hardware. The primary goal is to maximize token generation speed while maintaining a large context window, crucial for complex local LLM applications. The setup can handle a 128K context window, with reported generation speeds of 37.17 t/s at 72K context.

Spiritbuun's CUDA optimizations

Key to this performance are the CUDA optimizations from Spiritbuun's buun-llama-cpp fork. These include a "fused MMA fix," "TurboQuant," and "fattn improvements." These low-level GPU optimizations are specifically designed for NVIDIA hardware, enabling more efficient memory utilization and faster computation. The llama-server command explicitly uses flash-attn on, which is likely a component of these improvements, contributing to the high prompt processing speeds and reduced context degradation.

Mudler's APEX quantizations

The model itself, Qwen3.6-35B-A3B-APEX-MTP-I-Compact.gguf, utilizes Mudler's APEX I-Compact quantization. This quantization scheme is highlighted as providing the "best perplexity/speed trade-off" among tested variants. Quantization reduces the model's memory footprint by representing weights with fewer bits, but often at the cost of accuracy. Mudler's work appears to strike a balance, allowing the large Qwen3.6 model to fit within the 12GB VRAM while retaining strong performance metrics like a perplexity of 3.25.

Context handling and stability

The configuration supports a 128K context window. The llama-server command includes -c 131072 to set this context size. Performance degradation with increasing context is documented: generation speed drops from ~45 t/s (fresh) to 37.17 t/s at 72K context, and further to 28.08 t/s at 129K context. The setup also successfully manages multimodal projection (mmproj) by using -fitt 1500, which limits the GPU VRAM used by the model to leave room for the mmproj component, preventing out-of-memory errors. The needle-in-a-haystack test, with 100% retrieval across 5 trials in 150K–200K token texts, indicates robust long-context understanding.

WHAT'S INTERESTING / WHAT'S NOT

The most interesting aspect is the sheer capability demonstrated: running a 35B parameter LLM with a 128K context on a consumer-grade 12GB GPU like the RTX 3060. This pushes the boundaries of what is typically considered feasible for local inference on such hardware. The reported 37.17 t/s generation speed at 72K context is a strong result for a model of this size and VRAM constraint. It suggests that specialized llama.cpp forks and carefully selected quantizations can significantly extend the lifespan and utility of mid-range GPUs for local LLM development and experimentation.

The specific combination of Spiritbuun's CUDA optimizations (fused MMA fix, TurboQuant, fattn improvements) and Mudler's APEX I-Compact quantization is a verifiable and impactful finding. old-mike explicitly states these are "the key" to achieving the reported numbers. The detailed llama-server command provides a reproducible artifact for others to replicate these results, which is a hallmark of good benchmarking. The 100% retrieval rate on the needle-in-a-haystack test, even at 150K–200K token texts, is particularly noteworthy, indicating strong long-context retention and recall, a common challenge for LLMs.

What's less optimal, or "not interesting" in the context of this specific setup, is the performance impact of Multi-Tenant Processing (MTP). old-mike observes that MTP drops generation speed by 41% on their 3060 12GB setup with heavy offloading. This is a crucial finding for users with similar hardware, indicating that MTP, while potentially beneficial on cards with more VRAM, is a detriment here. This highlights the importance of hardware-specific tuning and not assuming general optimizations apply universally. The CPU, a Xeon E5-2678 v3, is an older server-grade chip, which reinforces that the GPU and VRAM are the primary bottlenecks and optimization targets in this scenario.

PRICING

This review covers open-source models and local inference tools. There are no direct costs associated with using Spiritbuun's llama.cpp fork or Mudler's quantized models. The only costs are for the underlying hardware (RTX 3060 12GB, CPU, RAM) and electricity. Pricing snapshot: 2026-05-28.

VERDICT

For developers and enthusiasts constrained by 12GB VRAM on NVIDIA GPUs, the combination of Spiritbuun's llama.cpp fork and Mudler's APEX I-Compact quantizations for the Qwen3.6-35B model is a compelling solution. It enables surprisingly high-performance local inference, achieving 37.17 t/s generation at 72K context and robust long-context understanding, as evidenced by the 100% needle-in-a-haystack retrieval. Users with similar hardware should adopt old-mike's llama-server configuration, specifically noting the -fitt parameter for multimodal models and disabling MTP. This setup offers a viable path to run large LLMs locally without requiring top-tier GPUs.

WHAT WE'D TEST NEXT

We would independently verify old-mike's reported benchmarks across multiple RTX 3060 12GB cards to confirm reproducibility and consistency. Further testing would involve evaluating the setup on other consumer-grade NVIDIA GPUs with varying VRAM, such as the RTX 4060 8GB or RTX 4070 12GB, to understand the generalizability of these optimizations. We would also explore the power consumption implications of these high-performance settings, especially with the 110W power limit noted. A broader range of LLM tasks beyond simple generation and needle-in-a-haystack, including summarization, coding assistance, and creative writing, would provide a more comprehensive understanding of real-world utility. Finally, we would investigate the specific technical details of Spiritbuun's "fused MMA fix," "TurboQuant," and "fattn improvements" to understand their underlying mechanisms and potential applicability to other llama.cpp forks or models.

Pull quote: “The combination of Spiritbuun's llama.cpp fork and Mudler's APEX I-Compact quantizations for the Qwen3.6-35B model is a compelling solution.”

Sources · how we verified
  1. Qwen3.6-35B-A3B-APEX / 128K ctx on RTX 3060 12GB — 37 t/s gen with 72k ctx filled, PPL 3.25, offloading 17GB model

Every claim ties to a primary source. See our methodology.

Reported by the Riley desk on Founderr Pulse’s Tools beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.
R
Riley

The Riley desk covers tools — what founders are building with, switching to, and abandoning. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

Founderr Pulse — free & independent. The desk for people who build & back.