ollama's Heterogeneous GPU Support: Custom Weighting Optimizes Mixed Hardware
This review analyzes comperr's technical modifications to ollama/main for heterogeneous GPU setups. It details how compute-aware layer distribution and output layer prioritization aim to improve…
This review analyzes
comperr's technical modifications toollama/mainfor heterogeneous GPU setups. It details how compute-aware layer distribution and output layer prioritization aim to improve performance and VRAM utilization on mixed hardware configurations.
The problem of efficiently running large language models (LLMs) on consumer-grade hardware is complex, especially when dealing with heterogeneous GPU configurations. ollama/main, a popular local LLM runner, has faced challenges in this area. comperr's custom implementation addresses these shortcomings by introducing compute-aware GPU weighting and intelligent layer splitting, moving beyond ollama/main's default VRAM-only allocation strategy.
The Answer Up Front
Users with heterogeneous GPU setups, particularly those combining cards with disparate compute capabilities like an RTX 5090 and an RTX 3090, should pay close attention to comperr's approach. This work directly targets the inefficiency of ollama/main in such environments, where a stronger GPU might be underutilized due to the weaker card bottlenecking the system. If you are running local LLMs and have mixed GPUs, comperr's method offers a technically sound path to better performance and VRAM utilization. Skip this if your setup is homogeneous or if you are not using ollama. The bottom line is that this custom implementation offers a significant, targeted improvement for local LLM inference on mixed GPU hardware by intelligently distributing layers based on actual compute power, not just available VRAM.
Methodology
This v0 review draws on the founder's published claims at the provided Reddit URL; independent benchmarks are pending. Update cadence: re-tested when claims diverge from observed behavior. The tool under review is ollama/main as modified by comperr, observed on 2026-05-28. This review covers comperr's specific code logic, formulas, and stated performance claims as detailed in the Reddit post. What is not covered includes independent performance benchmarks, long-term workflow integration, edge-case stability, or the specifics of ollama/main's Windows compilation issues, which comperr mentions but are outside the scope of the GPU distribution logic itself.
What It Does
Compute Power Weighting
comperr's primary innovation lies in its findBestFit() function, which departs from ollama/main's default VRAM-only allocation. In main, GPU free memory is used verbatim, meaning a 3090 (24 GB) and a 5090 (32 GB) are assigned layers purely based on their VRAM capacity. comperr introduces a method to compute raw power per GPU, using SMCount * ClockMHz or falling back to ComputeMajor*100+ComputeMinor if the former reports uniform values. This raw power is then used to calculate a powerShare and computeCapacity multiplier. This computeCapacity scales the FreeMemory of each GPU before greedyFit runs, ensuring the 5090 receives layers proportional to its compute power, not just its VRAM.
Optimized Layer Iteration
comperr identifies the iteration direction within greedyFit() as the
The investor read
This technical deep dive into ollama's GPU management highlights a critical, underserved niche in the local LLM inference market: optimizing for heterogeneous consumer hardware. As LLMs become more accessible, the demand for efficient resource utilization on diverse, often mismatched, GPU setups will only grow. Existing solutions like llama.cpp offer flexibility, but comperr's explicit focus on compute-aware layer distribution for ollama signals a maturing market segment that requires more sophisticated orchestration. A productized version of this, perhaps as a robust ollama fork or a plugin that simplifies deployment and tuning for various mixed GPU configurations, could be highly investable. This is likely a deliberate small, bootstrapped play, given its open-source nature and specific technical focus, but it points to a broader trend of optimizing local inference for the long tail of hardware configurations.
Every claim ties to a primary source. See our methodology.