RDNA2 Flash Attention fix for llama.cpp doubles token generation speed
This review examines a custom llama.cpp build that bypasses an assert statement, enabling Flash Attention on AMD RDNA2 GPUs. We detail the reported performance gains and specific hardware/model…
This review examines a custom llama.cpp build that bypasses an assert statement, enabling Flash Attention on AMD RDNA2 GPUs. We detail the reported performance gains and specific hardware/model compatibility.
TL;DR
Best for: AMD RDNA2 GPU owners (gfx1030/gfx1031) seeking to enable Flash Attention in llama.cpp for significantly faster local LLM inference, specifically with Qwen 3.6 35B and 27B models.
Skip if: You require broad model compatibility, absolute stability, or are not comfortable with custom builds and potential debugging on ROCm. This workaround is not for general-purpose local LLM use.
Bottom line: DiscipleofDeceit666's custom llama.cpp build offers a substantial performance uplift for specific RDNA2 GPUs and Qwen models by fixing a critical Flash Attention bug.
METHODOLOGY
This v0 review draws on the founder's published claims and technical details shared on Reddit by user DiscipleofDeceit666. The signal, titled "RDNA2 flash attention isn’t enabled stock, I enabled it with this build and doubled my speed," was observed on 2026-05-19. We cover the specific assert bypass, the reported performance benchmarks (30 tok/s vs 70-80 tok/s), the provided GitHub repository link, and the detailed build and server commands as presented by the author. This review does not include independent performance benchmarks, long-term workflow integration assessments, or comprehensive testing of edge cases beyond what the founder explicitly mentioned. Update cadence: This review will be re-tested when independent benchmarks become available or if claims diverge from observed community behavior.
WHAT IT DOES
Enables Flash Attention on RDNA2 GPUs
DiscipleofDeceit666's custom llama.cpp binary specifically targets AMD Radeon RDNA2 GPUs (gfx1030/gfx1031). The core functionality is to enable Flash Attention, a technique designed to accelerate transformer model inference, which is reportedly disabled in stock llama.cpp ROCm builds for these specific hardware configurations. The author claims this custom build is the "fastest possible setup" for these GPUs.
Bypasses a critical assert statement
The workaround addresses a specific crash encountered when attempting to run Flash Attention on ROCm with RDNA2 hardware. The error, GGML_ASSERT(max_blocks_per_sm > 0) failed originating from ggml/src/ggml-cuda/fattn-common.cuh:1054, indicates that hipOccupancyMaxActiveBlocksPerMultiprocessor incorrectly reports a value of 0. The custom build patches this assert, allowing Flash Attention to execute. The author provides a GitHub repository, https://github.com/Minerest/llama.cpp_RDNA2_FlashAttnEnabled/releases/tag/mtp-fa-workaround, containing the patched build and technical findings.
Reported performance gains
According to the author, this custom build delivers significant performance improvements. While stock ROCm builds reportedly "[don't] run" with Flash Attention on this hardware, and Vulkan achieves 30 tok/s, the patched build achieves 70-80 tok/s. This represents a substantial speedup, more than doubling the token generation rate compared to Vulkan, and enabling functionality where it was previously absent.
Specific build and server commands
The author provides the exact cmake command used to build the patched llama.cpp (llama-bench target) with ROCm support for gfx1030 and gfx1031 targets, including the -DGGML_FATTN_TRACE flag. Additionally, the llama-server flags are detailed, specifying parameters like --spec-type draft-mtp, -fa on, -ngl 50, -ts 16,10, and -c 64192, which are crucial for replicating the reported performance and functionality.
WHAT'S INTERESTING / WHAT'S NOT
What's interesting here is the direct, actionable solution to a specific hardware-software incompatibility. The author's willingness to diagnose and patch a low-level assert statement (GGML_ASSERT(max_blocks_per_sm > 0) failed) in llama.cpp to unlock a core performance feature like Flash Attention is a significant contribution to the local LLM community. The reported performance increase from 30 tok/s (Vulkan) to 70-80 tok/s (patched ROCm) is compelling, demonstrating that substantial gains are possible when hardware-specific optimizations are correctly implemented. The provision of both a pre-built binary and the exact build commands lowers the barrier to entry for other RDNA2 users facing the same issue. This is a pragmatic, engineering-focused fix rather than a high-level feature addition.
What's not interesting, or rather, what raises concerns, is the limited scope of compatibility and stability. The author explicitly states that only Qwen 3.6 35B and 27B models are confirmed working, with Gemma crashing on larger contexts and Deepseek running very slowly. This suggests the fix might be highly specific to certain model architectures or even specific model weights, limiting its general applicability. The "Buyer Beware, local AI on rocm crash often" warning underscores the inherent instability of the ROCm ecosystem for local AI, even with this patch. While the performance gain is impressive for Qwen, the lack of broader model support means this is not a universal solution for RDNA2 owners running llama.cpp. The underlying root cause of hipOccupancyMaxActiveBlocksPerMultiprocessor = 0 being reported incorrectly by HIP for these GPUs also remains a broader issue that this patch only sidesteps, rather than fully resolves at the driver or HIP level.
PRICING
This custom llama.cpp build and workaround are entirely free and open-source. The GitHub repository provides access to the patched code and releases at no cost. (Pricing snapshot: 2026-05-19)
VERDICT
For owners of AMD RDNA2 GPUs (specifically gfx1030/gfx1031) struggling to enable Flash Attention in llama.cpp with ROCm, DiscipleofDeceit666's custom build is a critical, high-impact workaround. It directly addresses a known assert failure, unlocking a reported 2-3x performance increase for compatible models like Qwen 3.6 35B and 27B. However, this is a highly specialized solution. Its utility is constrained by limited model compatibility and the general instability of ROCm for local AI, as acknowledged by the author. If your primary use case involves Qwen models on RDNA2, this build is a strong recommendation for immediate performance gains. For broader model exploration or greater stability, alternative hardware or more mature software stacks may be necessary.
WHAT WE'D TEST NEXT
Our next steps would involve independent verification of the claimed performance benchmarks across various RDNA2 GPU models (e.g., RX 6700 XT, RX 6800, RX 6900 XT). We would conduct a systematic compatibility matrix test with a wider range of llama.cpp-supported LLMs, including different quantizations and context window sizes, to precisely map the limits of the workaround. Investigating the underlying HIP driver behavior that causes hipOccupancyMaxActiveBlocksPerMultiprocessor = 0 would also be crucial, potentially leading to a more robust, upstream fix. We would also assess long-term stability under sustained load and explore potential interactions with other llama.cpp features or ROCm versions.
Every claim ties to a primary source. See our methodology.