HomeReadTools deskDDR5 bandwidth limits dual-LLM inference on APUs: Benchmarks inside
Tools·Jun 2, 2026

DDR5 bandwidth limits dual-LLM inference on APUs: Benchmarks inside

This review analyzes a deep dive into LLM performance on shared-memory APU hardware, specifically examining how DDR5 bandwidth limitations impact the viability of multi-model agent architectures.…

This review analyzes a deep dive into LLM performance on shared-memory APU hardware, specifically examining how DDR5 bandwidth limitations impact the viability of multi-model agent architectures.

TL;DR Best for: Understanding hardware bottlenecks for LLM inference on APUs, single MoE LLM deployments. Skip if: You require efficient concurrent inference of multiple LLMs on shared-memory systems for agentic workflows. Bottom line: DDR5 bandwidth is the primary bottleneck for running multiple LLMs concurrently on APUs, even with Mixture of Experts models, making dual-LLM agent architectures impractical.

METHODOLOGY

This v0 review draws on the founder's published claims and benchmarks at dev.to. Independent benchmarks are pending, and our update cadence will re-test when claims diverge from observed behavior. The review covers an analysis of LLM inference performance on specific APU hardware, detailing the observed tokens per second and memory bandwidth constraints. The source signal provides specific hardware and software configurations, including command-line outputs and benchmark screenshots.

Tool/Hardware Tested: The primary hardware under review is the Minisforum UM790Pro, equipped with an AMD Ryzen 9 7940HS CPU (Zen 4, 8C/16T), an AMD Radeon 780M iGPU (12 RDNA 3 compute units), and 96 GB DDR5-5600 RAM, providing approximately 80 GB/s bandwidth. The GPU memory pool is described as 2 GB dedicated VRAM plus 46 GB GTT, totaling 48 GB GPU-accessible memory. Ollama was used as the LLM management tool.

Models Tested: The author tested four LLM models: qwen3.6:35b (a Mixture of Experts model), gemma4-e2b-abliterated, qwen3:4b-instruct, and qwen2.5:1.5b. Tests involved running these models individually and attempting to run them concurrently.

Measurements: Key measurements included observed tokens per second for single-model inference and the impact on performance when attempting to load and run multiple models, specifically highlighting the role of DDR5 bandwidth.

Source Signal URL: https://dev.to/josh_green_dev/why-ddr5-bandwidth-kills-dual-llm-inference-on-apus-benchmarks-inside-42p1

What's Covered: This review covers the founder's claims regarding MoE model efficiency, the specific hardware configuration of the Minisforum UM790Pro, the use of Ollama for LLM management, and the empirical benchmarks demonstrating DDR5 bandwidth as a bottleneck for dual-LLM inference on APUs.

What's NOT Covered: This review does not include independent performance benchmarks, long-term workflow analysis, or an exhaustive exploration of edge cases beyond what the source signal provides. We have not independently verified the claims of 17.8 tokens/second or 80 GB/s bandwidth, but quote them as reported.

WHAT IT DOES

Ollama for LLM Management

Josh Green, the author, uses Ollama to manage and run various LLM models on his local machine. The tool allows users to ollama pull models like qwen3.6:35b and gemma4-e2b-abliterated. It also provides ollama ps to inspect which models are currently loaded into memory and whether they reside on the GPU or CPU. Crucially for these benchmarks, Ollama supports passing num_gpu as a model parameter to force CPU-only inference, enabling precise control over resource allocation for testing.

APU Hardware Configuration

The core hardware for this experiment is the Minisforum UM790Pro. It features an AMD Ryzen 9 7940HS CPU and an AMD Radeon 780M integrated GPU. The system is equipped with 96 GB of DDR5-5600 RAM, which the author notes provides approximately 80 GB/s of bandwidth. The GPU-accessible memory pool is substantial at 48 GB (2 GB dedicated VRAM + 46 GB GTT), but it is carved directly from the main DDR5 system memory. This shared memory architecture means all operations—CPU inference, GPU inference, KV caches, and OS functions—contend for the same 80 GB/s pipe.

MoE Model Efficiency

The review highlights the architectural specifics of the qwen3.6:35b model. Despite its 35-billion-parameter count, it is a Mixture of Experts (MoE) model with 256 total experts, activating only 8 per token. This design, combined with State Space Model (SSM) components, means its per-token compute cost is comparable to a much smaller 4-5B parameter dense model. The author observed 17.8 tokens/second for qwen3.6:35b running as a single model, demonstrating its practical usability for interactive work on this hardware.

Shared Memory Bottleneck

The central finding is that the shared DDR5-5600 bandwidth is the critical limiting factor for running multiple LLMs concurrently on APUs. While the 48 GB GPU-accessible memory pool is large enough to hold multiple models, the 80 GB/s bandwidth is insufficient to sustain high-performance inference for more than one LLM at a time. This bottleneck prevents the efficient execution of multi-model agent architectures, even when one of the models is an efficient MoE variant.

WHAT'S INTERESTING / WHAT'S NOT

What's interesting in this analysis is the empirical demonstration of how MoE models, despite their large total parameter counts, can achieve performance akin to much smaller dense models due to their sparse activation. The author's observation of qwen3.6:35b running at 17.8 tokens/second on an APU is a concrete data point for local LLM enthusiasts. The explicit identification of DDR5 bandwidth as the primary bottleneck for multi-LLM inference on APUs is a crucial insight. It shifts the focus from raw compute or VRAM capacity to the often-overlooked memory subsystem, providing a clear architectural reason why multi-model agent pipelines

Sources · how we verified
  1. Why DDR5 Bandwidth Kills Dual-LLM Inference on APUs (Benchmarks Inside)

Every claim ties to a primary source. See our methodology.

Reported by the Riley desk on Founderr Pulse’s Tools beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.
R
Riley

The Riley desk covers tools — what founders are building with, switching to, and abandoning. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

Founderr Pulse — free & independent. The desk for people who build & back.