Tools·Jun 11, 2026

ollama's Heterogeneous GPU Support: Custom Weighting Optimizes Mixed Hardware

This review analyzes comperr's technical modifications to ollama/main for heterogeneous GPU setups. It details how compute-aware layer distribution and output layer prioritization aim to improve…

By Riley · Tools desk·Human-reviewed·✓ Verified Jun 11, 2026·3 min read·1 source

This review analyzes comperr's technical modifications to ollama/main for heterogeneous GPU setups. It details how compute-aware layer distribution and output layer prioritization aim to improve performance and VRAM utilization on mixed hardware configurations.

The problem of efficiently running large language models (LLMs) on consumer-grade hardware is complex, especially when dealing with heterogeneous GPU configurations. ollama/main, a popular local LLM runner, has faced challenges in this area. comperr's custom implementation addresses these shortcomings by introducing compute-aware GPU weighting and intelligent layer splitting, moving beyond ollama/main's default VRAM-only allocation strategy.

The Answer Up Front

Users with heterogeneous GPU setups, particularly those combining cards with disparate compute capabilities like an RTX 5090 and an RTX 3090, should pay close attention to comperr's approach. This work directly targets the inefficiency of ollama/main in such environments, where a stronger GPU might be underutilized due to the weaker card bottlenecking the system. If you are running local LLMs and have mixed GPUs, comperr's method offers a technically sound path to better performance and VRAM utilization. Skip this if your setup is homogeneous or if you are not using ollama. The bottom line is that this custom implementation offers a significant, targeted improvement for local LLM inference on mixed GPU hardware by intelligently distributing layers based on actual compute power, not just available VRAM.

Methodology

This v0 review draws on the founder's published claims at the provided Reddit URL; independent benchmarks are pending. Update cadence: re-tested when claims diverge from observed behavior. The tool under review is ollama/main as modified by comperr, observed on 2026-05-28. This review covers comperr's specific code logic, formulas, and stated performance claims as detailed in the Reddit post. What is not covered includes independent performance benchmarks, long-term workflow integration, edge-case stability, or the specifics of ollama/main's Windows compilation issues, which comperr mentions but are outside the scope of the GPU distribution logic itself.

What It Does

Compute Power Weighting

comperr's primary innovation lies in its findBestFit() function, which departs from ollama/main's default VRAM-only allocation. In main, GPU free memory is used verbatim, meaning a 3090 (24 GB) and a 5090 (32 GB) are assigned layers purely based on their VRAM capacity. comperr introduces a method to compute raw power per GPU, using SMCount * ClockMHz or falling back to ComputeMajor*100+ComputeMinor if the former reports uniform values. This raw power is then used to calculate a powerShare and computeCapacity multiplier. This computeCapacity scales the FreeMemory of each GPU before greedyFit runs, ensuring the 5090 receives layers proportional to its compute power, not just its VRAM.

Optimized Layer Iteration

comperr identifies the iteration direction within greedyFit() as the

The investor read

This technical deep dive into ollama's GPU management highlights a critical, underserved niche in the local LLM inference market: optimizing for heterogeneous consumer hardware. As LLMs become more accessible, the demand for efficient resource utilization on diverse, often mismatched, GPU setups will only grow. Existing solutions like llama.cpp offer flexibility, but comperr's explicit focus on compute-aware layer distribution for ollama signals a maturing market segment that requires more sophisticated orchestration. A productized version of this, perhaps as a robust ollama fork or a plugin that simplifies deployment and tuning for various mixed GPU configurations, could be highly investable. This is likely a deliberate small, bootstrapped play, given its open-source nature and specific technical focus, but it points to a broader trend of optimizing local inference for the long tail of hardware configurations.

Sources · how we verified

Heterogeneous GPU Weighting & Layer Splitting ↗

Every claim ties to a primary source. See our methodology.

Reported by the Riley desk on Founderr Pulse’s Tools beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.

Riley

The Riley desk covers tools — what founders are building with, switching to, and abandoning. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

The Answer Up Front

Methodology

What It Does

Compute Power Weighting

Optimized Layer Iteration

The investor read

Robinhood Chain demo app shows standard Ethereum dev tools still work

Web Crypto API offers secure browser-side UUID v4 generation

Git-absorb uses git blame to automate fixup commits