Custom Strix Halo + Dual 3090 eGPU NVLink Mod Boosts LLM Inference
This review analyzes Rattling33's custom hardware setup, combining a Strix Halo APU with dual NVIDIA RTX 3090 eGPUs via NVLink, and its performance implications for local LLM inference. TL;DR Best…
This review analyzes Rattling33's custom hardware setup, combining a Strix Halo APU with dual NVIDIA RTX 3090 eGPUs via NVLink, and its performance implications for local LLM inference.
TL;DR
Best for: Developers and researchers running 27B or 31B dense LLMs locally, particularly those bottlenecked by eGPU bandwidth and willing to undertake custom hardware modifications. This setup excels in multi-coding agent scenarios where prompt processing speed is critical.
Skip if: Your primary focus is on power efficiency for very large models (e.g., 122B), where Strix Halo alone with llama.cpp proved more efficient. Also, skip if you require an off-the-shelf solution without custom cooling or NVLink bridge modifications.
Bottom line: Rattling33's custom Strix Halo and dual 3090 eGPU setup, enhanced with a 2-slot NVLink bridge and a DIY cooling mod, significantly improves prompt processing and token generation for specific dense LLMs by overcoming eGPU bandwidth limitations.
METHODOLOGY
This v0 review draws on the founder's published claims and detailed setup description on Reddit; independent benchmarks are pending. Update cadence: re-tested when claims diverge from observed behavior. The "tool" under review is a custom-engineered hardware and software configuration designed to maximize local LLM inference performance.
- Tool Name: Strix Halo with Dual 3090 eGPUs and NVLink Mod
- Version: Strix Halo (Bosgame M5), NVIDIA RTX 3090 (x2), 2-slot NVLink bridge, custom 3D-printed cooling duct, 120mm fans. Software includes Fedora 43,
llama.cpp(build 9221), andvLLM. - Date Observed: 2026-05-23
- Source Signal URL:
https://www.reddit.com/r/LocalLLaMA/comments/1tkulbk/scrambling_to_max_strixhalo_nvlink_dual_egpu_3090/ - What's Covered: This review covers Rattling33's detailed description of the custom hardware integration, including the NVLink bridge modification and cooling solution. It examines the claimed performance improvements (PP/s, TG/s) for 27B and 31B dense models, the observed power efficiency trade-offs, and the specific software configurations used (
llama.cppwith layer splitting,vLLMwith various KVcache types). The review also includes the rationale behind the modifications and the author's specific findings regarding NVLink's impact on different software. - What's NOT Covered: This review does not include independent performance benchmarks or long-term stability testing of the custom setup. Edge cases, detailed thermal profiles under various loads, and a comprehensive cost-benefit analysis beyond the NVLink bridge price are also outside the scope of this v0 assessment.
WHAT IT DOES
Rattling33's setup integrates a Strix Halo APU-based system with two external NVIDIA RTX 3090 GPUs to enhance local LLM inference capabilities. The core idea is to combine the Strix Halo's substantial 124GB UMA VRAM with the raw processing power and dedicated VRAM of dual 3090s.
Strix Halo as Base
The Bosgame M5 Strix Halo serves as the primary compute unit, offering 124GB of Unified Memory Architecture (UMA) VRAM. This substantial memory capacity is effective for running larger models like 122B via llama.cpp in a power-efficient manner, as observed by Rattling33. However, its prompt processing (PP/s) on smaller, dense models like 27B and 31B can be relatively slow, prompting the need for acceleration.
Dual 3090 eGPUs via NVMe
To overcome the Strix Halo's limitations on dense models, two NVIDIA RTX 3090 GPUs are connected as external GPUs (eGPUs) via NVMe PCIe 4x4 slots. This configuration adds 24GB of dedicated VRAM per card, significantly boosting the overall compute capacity. The initial setup saw improvements, but the native bandwidth limits of eGPUs remained a bottleneck for optimal performance.
NVLink Bandwidth Mitigation
The crucial modification involves integrating a 2-slot NVLink bridge between the two 3090 eGPUs. This aims to mitigate the eGPU's native bandwidth limits, which typically route through the CPU, by providing a high-speed direct interconnect between the GPUs. Rattling33 observed that this modification can lead to "up to several times better PP/s and TG/s on small densed models" by allowing the GPUs to communicate more efficiently, particularly beneficial for vLLM workloads.
Custom Cooling Mod
To accommodate the 2-slot NVLink bridge, which requires closer GPU spacing than standard 3-slot thick 3090 cards allow, Rattling33 implemented a custom cooling solution. This involved removing the stock 3-fan shroud from the top 3090 and attaching 120mm fans with a 3D-printed side blow duct. This ingenious modification not only allowed the NVLink bridge to fit but, surprisingly, resulted in the modded 3090 running at lower temperatures than the unmodded card below it.
WHAT'S INTERESTING / WHAT'S NOT
What's interesting about Rattling33's setup is the pragmatic ingenuity applied to a common local LLM inference challenge. The core problem—eGPU bandwidth limitations affecting performance on dense models—is tackled with a hands-on, hardware-level solution. The observation that a 2-slot NVLink bridge, typically used for internal multi-GPU setups, can be adapted for eGPUs with a custom cooling mod is a significant finding for the LocalLLaMA community. The reported "several times better PP/s and TG/s" on 27B and 31B dense models highlights a tangible performance uplift that directly addresses the initial frustration with slower prompt processing on the Strix Halo alone. This approach offers a clear path for users seeking to maximize throughput for specific model sizes without investing in a full-blown server-grade system. The detailed account of vLLM's performance variations with different KVcache types and context lengths also provides valuable, actionable insights for software configuration.
What's not explicitly detailed, but implied, is the level of effort and technical expertise required. While the post provides a clear overview, replicating the custom cooling mod and ensuring stable operation would demand significant DIY skills. The lack of specific, quantified benchmark numbers for the "several times better" claim, while understandable for a Reddit post, means readers must rely on qualitative assessment. Furthermore, the observation that NVLink does not benefit llama.cpp's layer splitting, and even led to a performance decrease in PP/s despite a TG/s gain, is a crucial nuance. This indicates that NVLink's advantages are highly dependent on the software stack and how it utilizes multi-GPU communication, making it less of a universal panacea and more of a targeted optimization for vLLM or similar frameworks.
PRICING
- NVLink 2-slot bridge: Approximately $250, including custom fees.
- NVIDIA RTX 3090 GPUs: Acquired via the local second-hand market; no specific price provided.
- Strix Halo (Bosgame M5): No specific price provided.
Pricing snapshot: 2026-05-23
VERDICT
Rattling33's custom Strix Halo and dual 3090 eGPU setup, featuring an NVLink bridge and a custom cooling mod, is a highly effective, albeit DIY, solution for accelerating local LLM inference. It is best suited for users targeting 27B or 31B dense models who are currently bottlenecked by eGPU bandwidth and are comfortable with hardware modifications. The significant performance gains in prompt processing and token generation, particularly with vLLM, make this a compelling approach for multi-coding agent scenarios or other applications demanding high throughput on these specific model sizes. However, for 122B models, the Strix Halo alone with llama.cpp offers better power efficiency. The NVLink modification does not universally improve all LLM software stacks, specifically showing limited benefit and even drawbacks for llama.cpp's layer splitting. This setup demonstrates that targeted hardware modifications can yield substantial performance improvements, but requires careful consideration of model size, software, and power efficiency goals.
WHAT WE'D TEST NEXT
Our next steps would involve a rigorous benchmarking effort to quantify the claimed performance improvements. We would measure specific PP/s and TG/s figures for 27B and 31B dense models across several configurations: Strix Halo alone, Strix Halo with a single 3090 eGPU, and the full dual 3090 eGPU setup both with and without the NVLink bridge. We would also measure and compare the actual power consumption for each configuration and model size to validate the power efficiency observations. Long-term thermal stability and noise levels of the custom cooling mod under sustained load would be assessed. Finally, we would expand testing to a wider array of LLM architectures and quantization levels, and compare the performance against a dedicated workstation with internal NVLink to establish a baseline for maximum theoretical performance.
Pull quote: “Rattling33's custom Strix Halo and dual 3090 eGPU setup, enhanced with a 2-slot NVLink bridge and a DIY cooling mod, significantly improves prompt processing and token generation for specific dense LLMs by overcoming eGPU bandwidth limitations.”
Every claim ties to a primary source. See our methodology.