Tools·May 20, 2026

RDNA2 Flash Attention fix for llama.cpp doubles token generation speed

This review examines a custom llama.cpp build that bypasses an assert statement, enabling Flash Attention on AMD RDNA2 GPUs. We detail the reported performance gains and specific hardware/model…

By Riley · Tools desk·Human-reviewed·✓ Verified May 20, 2026·5 min read·1 source

This review examines a custom llama.cpp build that bypasses an assert statement, enabling Flash Attention on AMD RDNA2 GPUs. We detail the reported performance gains and specific hardware/model compatibility.

TL;DR

Best for: AMD RDNA2 GPU owners (gfx1030/gfx1031) seeking to enable Flash Attention in llama.cpp for significantly faster local LLM inference, specifically with Qwen 3.6 35B and 27B models. Skip if: You require broad model compatibility, absolute stability, or are not comfortable with custom builds and potential debugging on ROCm. This workaround is not for general-purpose local LLM use. Bottom line: DiscipleofDeceit666's custom llama.cpp build offers a substantial performance uplift for specific RDNA2 GPUs and Qwen models by fixing a critical Flash Attention bug.

METHODOLOGY

This v0 review draws on the founder's published claims and technical details shared on Reddit by user DiscipleofDeceit666. The signal, titled "RDNA2 flash attention isn’t enabled stock, I enabled it with this build and doubled my speed," was observed on 2026-05-19. We cover the specific assert bypass, the reported performance benchmarks (30 tok/s vs 70-80 tok/s), the provided GitHub repository link, and the detailed build and server commands as presented by the author. This review does not include independent performance benchmarks, long-term workflow integration assessments, or comprehensive testing of edge cases beyond what the founder explicitly mentioned. Update cadence: This review will be re-tested when independent benchmarks become available or if claims diverge from observed community behavior.

WHAT IT DOES

Enables Flash Attention on RDNA2 GPUs

DiscipleofDeceit666's custom llama.cpp binary specifically targets AMD Radeon RDNA2 GPUs (gfx1030/gfx1031). The core functionality is to enable Flash Attention, a technique designed to accelerate transformer model inference, which is reportedly disabled in stock llama.cpp ROCm builds for these specific hardware configurations. The author claims this custom build is the "fastest possible setup" for these GPUs.

Bypasses a critical assert statement

The workaround addresses a specific crash encountered when attempting to run Flash Attention on ROCm with RDNA2 hardware. The error, GGML_ASSERT(max_blocks_per_sm > 0) failed originating from ggml/src/ggml-cuda/fattn-common.cuh:1054, indicates that hipOccupancyMaxActiveBlocksPerMultiprocessor incorrectly reports a value of 0. The custom build patches this assert, allowing Flash Attention to execute. The author provides a GitHub repository, https://github.com/Minerest/llama.cpp_RDNA2_FlashAttnEnabled/releases/tag/mtp-fa-workaround, containing the patched build and technical findings.

Reported performance gains

According to the author, this custom build delivers significant performance improvements. While stock ROCm builds reportedly "[don't] run" with Flash Attention on this hardware, and Vulkan achieves 30 tok/s, the patched build achieves 70-80 tok/s. This represents a substantial speedup, more than doubling the token generation rate compared to Vulkan, and enabling functionality where it was previously absent.

Specific build and server commands

The author provides the exact cmake command used to build the patched llama.cpp (llama-bench target) with ROCm support for gfx1030 and gfx1031 targets, including the -DGGML_FATTN_TRACE flag. Additionally, the llama-server flags are detailed, specifying parameters like --spec-type draft-mtp, -fa on, -ngl 50, -ts 16,10, and -c 64192, which are crucial for replicating the reported performance and functionality.

WHAT'S INTERESTING / WHAT'S NOT

What's interesting here is the direct, actionable solution to a specific hardware-software incompatibility. The author's willingness to diagnose and patch a low-level assert statement (GGML_ASSERT(max_blocks_per_sm > 0) failed) in llama.cpp to unlock a core performance feature like Flash Attention is a significant contribution to the local LLM community. The reported performance increase from 30 tok/s (Vulkan) to 70-80 tok/s (patched ROCm) is compelling, demonstrating that substantial gains are possible when hardware-specific optimizations are correctly implemented. The provision of both a pre-built binary and the exact build commands lowers the barrier to entry for other RDNA2 users facing the same issue. This is a pragmatic, engineering-focused fix rather than a high-level feature addition.

What's not interesting, or rather, what raises concerns, is the limited scope of compatibility and stability. The author explicitly states that only Qwen 3.6 35B and 27B models are confirmed working, with Gemma crashing on larger contexts and Deepseek running very slowly. This suggests the fix might be highly specific to certain model architectures or even specific model weights, limiting its general applicability. The "Buyer Beware, local AI on rocm crash often" warning underscores the inherent instability of the ROCm ecosystem for local AI, even with this patch. While the performance gain is impressive for Qwen, the lack of broader model support means this is not a universal solution for RDNA2 owners running llama.cpp. The underlying root cause of hipOccupancyMaxActiveBlocksPerMultiprocessor = 0 being reported incorrectly by HIP for these GPUs also remains a broader issue that this patch only sidesteps, rather than fully resolves at the driver or HIP level.

PRICING

This custom llama.cpp build and workaround are entirely free and open-source. The GitHub repository provides access to the patched code and releases at no cost. (Pricing snapshot: 2026-05-19)

VERDICT

For owners of AMD RDNA2 GPUs (specifically gfx1030/gfx1031) struggling to enable Flash Attention in llama.cpp with ROCm, DiscipleofDeceit666's custom build is a critical, high-impact workaround. It directly addresses a known assert failure, unlocking a reported 2-3x performance increase for compatible models like Qwen 3.6 35B and 27B. However, this is a highly specialized solution. Its utility is constrained by limited model compatibility and the general instability of ROCm for local AI, as acknowledged by the author. If your primary use case involves Qwen models on RDNA2, this build is a strong recommendation for immediate performance gains. For broader model exploration or greater stability, alternative hardware or more mature software stacks may be necessary.

WHAT WE'D TEST NEXT

Our next steps would involve independent verification of the claimed performance benchmarks across various RDNA2 GPU models (e.g., RX 6700 XT, RX 6800, RX 6900 XT). We would conduct a systematic compatibility matrix test with a wider range of llama.cpp-supported LLMs, including different quantizations and context window sizes, to precisely map the limits of the workaround. Investigating the underlying HIP driver behavior that causes hipOccupancyMaxActiveBlocksPerMultiprocessor = 0 would also be crucial, potentially leading to a more robust, upstream fix. We would also assess long-term stability under sustained load and explore potential interactions with other llama.cpp features or ROCm versions.

Sources · how we verified

RDNA2 flash attention isn’t enabled stock, I enabled it with this build and doubled my speed ↗

Every claim ties to a primary source. See our methodology.

Reported by the Riley desk on Founderr Pulse’s Tools beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.

Riley

The Riley desk covers tools — what founders are building with, switching to, and abandoning. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

TL;DR

METHODOLOGY

WHAT IT DOES

Enables Flash Attention on RDNA2 GPUs

Bypasses a critical assert statement

Reported performance gains

Specific build and server commands

WHAT'S INTERESTING / WHAT'S NOT

PRICING

VERDICT

WHAT WE'D TEST NEXT

Robinhood Chain demo app shows standard Ethereum dev tools still work

Web Crypto API offers secure browser-side UUID v4 generation

Git-absorb uses git blame to automate fixup commits