Tools·Jun 11, 2026

Benchmarking LLMs on AMD MI60 for Home Assistant and Frigate

This review examines a systematic llama-bench optimization process for Gemma 4 and Qwen3 LLMs on an AMD MI60 GPU, targeting performance gains for local Home Assistant and Frigate applications. The…

By Riley · Tools desk·Human-reviewed·✓ Verified Jun 11, 2026·6 min read·1 source

This review examines a systematic llama-bench optimization process for Gemma 4 and Qwen3 LLMs on an AMD MI60 GPU, targeting performance gains for local Home Assistant and Frigate applications.

The Answer Up Front

For users running local LLMs on AMD GPUs, particularly the MI60 or MI50, this detailed benchmarking approach offers a clear path to significant performance improvements. The systematic tuning of llama.cpp parameters, as demonstrated, can yield substantial real-world speedups for applications like Home Assistant and Frigate. If you are struggling with latency on AMD hardware for local inference, this methodology is worth replicating. Users on NVIDIA hardware or those not engaging in deep llama.cpp optimization can skip this specific tuning guide, though the general principle of systematic benchmarking remains valuable.

Methodology

This v0 review draws on the founder's published claims at the provided Reddit URL; independent benchmarks are pending. Update cadence: re-tested when claims diverge from observed behavior. The review covers the optimization methodology and reported performance improvements for llama.cpp when running specific LLMs on AMD MI60 hardware. The source signal, a Reddit post by user FantasyMaster85, details a multi-parameter sweep using a custom script that orchestrated llama-bench runs. The core tool is llama.cpp, specifically deployed via the mixa3607/ML-gfx906 Docker container, which simplifies ROCm setup on Ubuntu 24.04. The hardware under test was an AMD MI60 GPU with 32GB VRAM. Two models were tested: Gemma 4 26B.A4B Q4_1 and Qwen3 35B.A3B Q4_0. The benchmarking involved 30 total runs across 8 sections, each internally repeated 5 times by llama-bench for statistical stability. Parameters varied included KV cache pre-fill depth (0, 1,000, 6,000 tokens), flash attention (on/off), KV cache quantization (f16, q8_0, q4_0), ubatch size (512, 2048, 4096, 8192), logical batch size (2048, 8192), CPU thread count (8, 12, 24), and two ROCm-specific environment variables: GGML_ROCM_FORCE_MMQ (1 vs. 0) and HSA_ENABLE_SDMA (enabled vs. disabled). This review does not cover independent performance verification, long-term workflow integration, or edge-case behavior beyond the scope of the original post.

What It Does

Systematic Parameter Exploration

The core of FantasyMaster85's work is a methodical exploration of llama.cpp and ROCm parameters. The script executed 30 benchmark runs, divided into 8 sections. Sections 1 through 7 isolated individual parameters, varying one at a time while holding others constant at a baseline. This approach allows for clear attribution of performance changes to specific settings. Section 8 then combined the most promising individual results, stacking three different configurations (e.g., SDMA disabled with q8_0 KV) to assess compounded gains or conflicts. This structured testing is crucial for identifying optimal configurations in a complex parameter space.

Targeted LLM Optimization

The benchmark specifically targeted Gemma 4 26B Q4_1 and Qwen3 35B Q4_0 models. The choice of Q4_1 and Q4_0 quantization levels was deliberate, as the founder notes that MI60 and MI50 GPUs inherently receive a speed boost on these specific quantizations. The decision to use Q4_0 for Qwen3, despite Q4_1 being generally faster, was driven by VRAM constraints, allowing for desired context size across three concurrent LLM slots, each with its own cache.

Docker-based Benchmarking

A key practical element of this methodology is the use of the mixa3607/ML-gfx906 Docker container. The founder explicitly states that getting MI60/MI50 GPUs to work with Ubuntu 24.04 can be challenging, and this container streamlines the setup, making llama.cpp operational in minutes. This significantly reduces the overhead typically associated with building from source and configuring ROCm environments, allowing more focus on the actual benchmarking.

What's Interesting / What's Not

What stands out is the rigorous, almost scientific, approach to optimizing llama.cpp for a specific hardware and application stack. The explicit isolation of variables in the initial benchmark sections, followed by the stacking of promising configurations, is a robust methodology that many local LLM users overlook. The reported real-world impact—less than 1.2 seconds for HomeAssistant voice commands and under 18 seconds for Frigate footage summaries—demonstrates that deep, systematic tuning can translate directly into tangible user experience improvements. The use of a pre-configured Docker container to abstract away the notorious complexity of ROCm setup on AMD GPUs is also a significant practical win, lowering the barrier to entry for similar optimizations.

What is less interesting, or rather, what limits the generalizability, is the lack of raw, normalized performance metrics (e.g., tokens/second) in the public post. While application-specific times are valuable for the founder's use case, they make direct comparison across different hardware or LLM tasks difficult. The reliance on Claude to generate the script and summarize results, while convenient, introduces a potential for unverified claims or subtle inaccuracies in the methodology description, though the provided detail appears coherent. Finally, the specific findings are highly tailored to the MI60/MI50 architecture and the chosen LLMs, meaning direct applicability to other GPU types or models may vary significantly, requiring a similar re-benchmarking effort.

Pricing

This review focuses on optimizing open-source software (llama.cpp, Gemma, Qwen) on user-owned hardware. There are no direct costs associated with the tools or methodology described, beyond the initial hardware investment. The Docker container used is also open-source. (Pricing snapshot date: 2026-05-23)

Verdict

This benchmarking exercise provides a compelling case for systematic optimization of local LLM deployments, particularly on less-common hardware like AMD's MI60. The founder's methodical approach, varying parameters one by one before combining the most effective settings, is a blueprint for achieving significant performance gains. For anyone running llama.cpp on AMD GPUs and experiencing suboptimal performance, this detailed process of tuning KV cache, ubatch size, CPU threads, and ROCm environment variables offers a clear, actionable path to improve application responsiveness. The reported gains for HomeAssistant and Frigate demonstrate that dedicated optimization efforts can yield substantial, practical benefits.

What We'd Test Next

Our next steps would involve reproducing these benchmarks on similar AMD MI60/MI50 hardware to independently verify the reported performance improvements and quantify them with standard metrics like tokens/second. We would expand the test matrix to include a wider range of LLMs and quantization levels, particularly those popular in the local LLM community. A deeper dive into the specific interactions of GGML_ROCM_FORCE_MMQ and HSA_ENABLE_SDMA with different llama.cpp builds would be valuable. Finally, we would compare the optimized AMD MI60 performance against similarly priced or specced NVIDIA GPUs to provide a broader context for hardware selection in local LLM inference scenarios.

The investor read

This signal highlights the ongoing, significant investment of time and effort into optimizing local LLM inference, particularly on non-NVIDIA hardware. The challenges with ROCm setup, even on Ubuntu 24.04, underscore the need for robust, easy-to-deploy solutions for AMD GPUs. Tools or services that abstract away this complexity, like the mixa3607/ML-gfx906 Docker container, are highly valuable. For investors, this points to a growing market for performance tooling and specialized infrastructure layers that cater to the diverse hardware landscape of local AI. Companies that can provide automated, systematic benchmarking platforms or highly optimized, hardware-specific LLM runtimes (beyond just llama.cpp) could capture significant market share. The ability to unlock substantial performance on existing, potentially underutilized, hardware like the MI60 also suggests a strong demand for cost-effective local AI solutions, which could drive investment in specialized software rather than just new silicon.

Sources · how we verified

Did a 30 runs of llama-bench to find optimal settings for my use case (Frigate and HomeAssistant) on my MI60 32gb VRAM GPU - two models tested Gemma4 and Qwen3.6 - Figured I'd share in case it helps anyone else ↗

Every claim ties to a primary source. See our methodology.

Reported by the Riley desk on Founderr Pulse’s Tools beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.

Riley

The Riley desk covers tools — what founders are building with, switching to, and abandoning. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

The Answer Up Front

Methodology

What It Does

Systematic Parameter Exploration

Targeted LLM Optimization

Docker-based Benchmarking

What's Interesting / What's Not

Pricing

Verdict

What We'd Test Next

The investor read

Robinhood Chain demo app shows standard Ethereum dev tools still work

Web Crypto API offers secure browser-side UUID v4 generation

Git-absorb uses git blame to automate fixup commits