Running Gemma 4 26B on a GTX 1080 with llama.cpp
This review examines a detailed process for optimizing Google's Gemma 4 26B-A4B Mixture-of-Experts model for local inference on legacy NVIDIA hardware using the llama.cpp framework. The Answer Up…
This review examines a detailed process for optimizing Google's Gemma 4 26B-A4B Mixture-of-Experts model for local inference on legacy NVIDIA hardware using the llama.cpp framework.
The Answer Up Front
For developers and enthusiasts with older NVIDIA Pascal-era GPUs (like the GTX 1080) and a willingness to dive into llama.cpp configurations, this optimization guide for Gemma 4 26B offers a path to surprisingly capable local LLM inference. Those expecting a plug-and-play solution or working with non-NVIDIA hardware should skip this specific process. The bottom line: it demonstrates that modern MoE models can run effectively on hardware considered obsolete for traditional LLM inference, provided the right architectural understanding and software tuning.
Methodology
This v0 review draws on the founder's published claims at the provided URL; independent benchmarks are pending. Update cadence: re-tested when claims diverge from observed behavior. The review covers the process for running Google's Gemma 4 26B-A4B Mixture-of-Experts (MoE) model and, for comparison, the Qwen 3.6 35B-A3B model. The specific hardware configuration includes an Intel i7-6700 (Skylake, 4c/8t, 2015 CPU), 32 GiB of system RAM, and an NVIDIA GeForce GTX 1080 with 8 GiB VRAM (Pascal, 2016 GPU), all running on Fedora 42. The core tool under examination is llama.cpp, specifically its capabilities for offloading MoE weights between CPU and GPU. Performance metrics, such as tokens/second and context length, are reported as claimed by the author. This review does not cover independent performance verification, long-term workflow integration, or edge cases beyond the described setup.
- Tool Name + Version + Date Observed:
llama.cpp(specific version not stated, but implied current with Gemma 4 support), observed 2026-05-24. - Source Signal URL: https://dev.to/mdda/running-gemma-4-26B-on-an-old-gtx-1080-with-llamacpp-4fi5
- What's Covered: Founder's claims regarding Gemma 4 26B-A4B and Qwen 3.6 35B-A3B performance on specific legacy hardware,
llama.cppconfiguration flags, NVIDIA driver pinning, and identified performance bottlenecks. - What's NOT Covered: Independent performance benchmarks, long-term stability, power consumption, or generalizability to other hardware configurations or operating systems.
What It Does
Leveraging Mixture-of-Experts Architecture
The core insight is Gemma 4 26B-A4B's Mixture-of-Experts (MoE) architecture. While the model has 25.2 billion total parameters, only 3.8 billion are active per token. This allows llama.cpp to keep the larger, "cold" expert weights in system RAM and stream them over PCIe as needed, while the smaller active working set resides on the GPU. This strategy circumvents the VRAM limitations of the 8 GiB GTX 1080, enabling the model to run at a claimed ~24.5 tokens/second with 128k context.
llama.cpp Configuration for MoE Offloading
The llama.cpp framework provides specific flags to manage MoE weight placement. The --n-cpu-moe N flag dictates that the MoE weights of the first N layers are kept on the CPU, while --n-gpu-layers 999 attempts to offload everything else to the GPU. For the described 128k context setup, the sweet spot was found to be --n-cpu-moe 21 when using Gemma 4's MTP assistant head, or --n-cpu-moe 20 without it. This fine-grained control is crucial for balancing VRAM usage, system RAM, and PCIe bandwidth.
NVIDIA Driver Pinning
Given the GTX 1080's age, a specific NVIDIA driver version is required on Fedora 42. The process involves pinning akmod-nvidia to the 580xx branch using dnf swap akmod-nvidia akmod-nvidia-580xx --allowerasing --releasever=44. This step ensures compatibility and stability for the legacy hardware, preventing issues with newer driver branches that may drop support or introduce regressions for older cards.
What's Interesting / What's Not
What's interesting here is the demonstration of how architectural understanding, specifically of MoE models, can extend the useful life of legacy hardware for demanding AI workloads. The claimed ~24.5 tokens/second on a 2016-vintage GTX 1080 is a strong performance claim, particularly with a 128k context. This performance is achieved by meticulously identifying and addressing the system's bottleneck: PCIe bandwidth. The author's observation that “PCIe maxed out and GPU half-idle indicates a bandwidth-limited, not compute-limited, scenario” is a key insight. It underscores that simply having a faster GPU would not necessarily improve performance in this specific setup, as the data transfer rate is the limiting factor. The comparison with Qwen 3.6 35B-A3B, which reportedly produced slower output and was more verbose, further highlights Gemma 4's efficiency for assistant workloads.
What's less interesting, or rather, what presents a barrier, is the highly manual and specific nature of the optimization. This is not a general solution; it requires deep familiarity with Linux system administration (Fedora 42), NVIDIA driver management, and llama.cpp internals. The specific --n-cpu-moe values are hardware-dependent, meaning users with slightly different setups would need to re-benchmark. While the results are impressive, the path to achieving them is steep, limiting its applicability to dedicated enthusiasts or researchers rather than general users seeking easy local inference.
Pricing
llama.cpp is an open-source project, available at no cost. Google's Gemma 4 models are free to download and use for research and commercial purposes, subject to their licensing terms. The only costs associated with this setup are the hardware (the author purchased a used GTX 1080 for under $200 USD in 2025) and electricity.
Verdict
For users committed to maximizing the utility of older hardware for local LLM inference, llama.cpp with Gemma 4 26B is a viable, albeit technically demanding, option. The critical factor is the MoE architecture's ability to leverage system RAM and manage PCIe bandwidth, transforming what would typically be an underpowered setup into a functional local inference machine. If you possess a compatible legacy NVIDIA GPU, have a strong grasp of Linux and llama.cpp parameters, and are willing to experiment, this approach offers significant value. Otherwise, the complexity and hardware specificity make it a niche solution.
What We'd Test Next
Our next steps would involve independently reproducing these benchmarks on similar hardware, specifically focusing on the --n-cpu-moe parameter's impact across a range of values and different context lengths. We would also investigate the power consumption implications of constant PCIe saturation. Further testing would include evaluating the setup's performance with other MoE models and non-MoE models to quantify the architectural advantage more broadly. Finally, we would explore the ease of setup on other Linux distributions and Windows, as well as the impact of faster CPUs or PCIe generations on overall throughput, to understand the true bottlenecks and scalability.
The investor read
The ability to run large language models like Gemma 4 26B on legacy consumer hardware signals a significant trend towards democratized local inference. This reduces reliance on expensive cloud GPUs and opens up new use cases for edge computing and personal AI. The success of llama.cpp highlights the value of highly optimized, open-source inference engines, which are becoming critical infrastructure for the broader AI ecosystem. Companies that can develop or leverage efficient model architectures (like MoE) and highly optimized inference runtimes will capture market share as local AI becomes more prevalent. While llama.cpp itself is an open-source project, its widespread adoption and the technical insights it enables point to investable opportunities in hardware-aware model quantization, specialized local inference accelerators, and tools that simplify complex llama.cpp-like optimizations for a broader user base. This also signals a potential shift in tooling spend from pure cloud compute to hybrid or local-first solutions.
Pull quote: “PCIe maxed out and GPU half-idle indicates a bandwidth-limited, not compute-limited, scenario.”
Every claim ties to a primary source. See our methodology.