Tools·Jun 12, 2026

Llama.cpp vs. LiteRT on Xiaomi 12 Pro: Mobile LLM Performance

This review benchmarks Llama.cpp and LiteRT on a custom-built Xiaomi 12 Pro 24/7 server, detailing bespoke cooling, power solutions, and comparative LLM inference speeds. The Answer Up Front For…

By Riley · Tools desk·Human-reviewed·✓ Verified Jun 12, 2026·6 min read·1 source

This review benchmarks Llama.cpp and LiteRT on a custom-built Xiaomi 12 Pro 24/7 server, detailing bespoke cooling, power solutions, and comparative LLM inference speeds.

The Answer Up Front

For those building always-on, low-power local LLM servers from repurposed mobile hardware, Llama.cpp offers a more power-efficient and "gentle" CPU load profile, making it suitable for sustained operation. LiteRT, while claiming slightly faster generation, pushes the hardware harder with higher amp draw. If your priority is raw inference speed on mobile silicon and you can manage the increased thermal and power demands, LiteRT might edge out. However, for a balanced, long-term deployment on a custom mobile server, Llama.cpp appears to be the more practical choice given its reported efficiency.

Methodology

This v0 review draws on the founder Aromatic_Ad_7557's published claims and accompanying video/image artifacts on Reddit, accessed on 2026-05-23. Independent benchmarks are pending. Update cadence: re-tested when claims diverge from observed behavior.

The review covers the performance comparison of Llama.cpp and LiteRT (by Google) for running the gemma-4-E4B model on a custom Xiaomi 12 Pro (Snapdragon 8 Gen 1) 24/7 server. The hardware setup includes a custom cooling solution with copper heatsinks, thermal pads, and multiple fans, controlled to activate at 40°C and deactivate at 35°C. A custom power supply unit (PSU) was engineered, wired directly to the phone's Battery Management System (BMS) via a capacitor, incorporating input/output fuses and a 4.3V crowbar circuit for overvoltage protection. The entire system is housed in a 3D-printed case with an aluminum extrusion stand. The benchmark prompt used was "Write 2000 words IT essay."

What's not covered: Independent verification of performance numbers, long-term stability beyond the reported "week of testing" for the PSU, power consumption metrics for LiteRT, or a comprehensive analysis of various LLM models and quantization levels.

What It Does

Custom Mobile Server Architecture

Aromatic_Ad_7557 transformed a Xiaomi 12 Pro, powered by a Snapdragon 8 Gen 1 SoC, into a dedicated 24/7 headless AI server. This involved significant hardware modifications to ensure sustained operation. The original screen was removed, and the device was mounted onto an aluminum plate with thermal pads and two cooling fans. An additional copper heatsink with a fan was installed on the phone's back. These fans are thermally controlled, engaging at 40°C and disengaging at 35°C, aiming for stable operating temperatures.

Bespoke Power and Housing

The power supply unit is entirely custom-built for safety and reliability. It directly interfaces with the phone's battery management system (BMS) through a capacitor, bypassing the standard charging circuitry. Safety features include two fuses (input and output) and a 4.3V crowbar circuit designed to protect the phone from overvoltage. The PSU itself is passively cooled, with a backup fan that the founder reports was rarely needed during a week of testing. The entire assembly is housed in a custom 3D-printed case, mounted on an aluminum extrusion stand, and features an external power button for convenience.

Llama.cpp Performance

When running the gemma-4-E4B model with the prompt "Write 2000 words IT essay," Llama.cpp achieved a prompt processing speed of 30.6 tokens per second (t/s) and a generation speed of 5.7 t/s. The founder notes that Llama.cpp resulted in a "gentle" CPU load and a lower amp draw from the custom PSU, suggesting efficient resource utilization.

LiteRT Performance

LiteRT, Google's runtime, was also benchmarked with the same model and prompt. While the founder claims "slightly faster generation," specific numerical benchmarks for LiteRT's prompt and generation speeds were not provided in the source. The founder observed that LiteRT "maxes out the CPUs" and exhibited a "noticeably higher" amp draw from the PSU, indicating a more aggressive use of the mobile SoC's resources compared to Llama.cpp.

What's Interesting / What's Not

The most compelling aspect of this project is the sheer engineering effort dedicated to repurposing consumer mobile hardware for a specialized, always-on LLM inference server. The detailed custom cooling and power supply solutions, particularly the 4.3V crowbar circuit and direct BMS wiring, demonstrate a deep understanding of mobile device power management and a commitment to reliability for 24/7 operation. This goes beyond simple software benchmarking, addressing the fundamental thermal and power constraints that typically limit mobile SoC performance in sustained workloads. The explicit thermal thresholds (40°C/35°C) provide a concrete target for stable operation.

What's less illuminating is the lack of specific, comparable benchmark numbers for LiteRT. While the founder claims "slightly faster generation," the absence of concrete t/s figures for LiteRT makes a direct quantitative comparison difficult. The qualitative observation of "maxes out the CPUs" and "noticeably higher amp draw" for LiteRT is useful for understanding its resource footprint but doesn't allow for a precise performance-per-watt analysis against Llama.cpp. Furthermore, the benchmark is limited to a single prompt and model, which provides a snapshot but not a comprehensive view of either runtime's capabilities across varying workloads or model sizes. The applicability of such a custom setup for the average user is also low, positioning this more as a proof-of-concept for enthusiasts rather than a readily deployable solution.

Pricing

Llama.cpp and LiteRT are both open-source projects, available at no cost. The "pricing" for this setup is primarily the cost of the Xiaomi 12 Pro (or similar mobile device), custom components (heatsinks, fans, PSU components, 3D printing materials), and the significant engineering time required for assembly and configuration. Pricing snapshot date: 2026-05-23.

Verdict

For individuals or small teams looking to deploy local LLMs on repurposed mobile hardware for continuous operation, Llama.cpp is the more robust choice. Its reported "gentle" CPU load and lower amp draw make it better suited for the thermal and power constraints of a 24/7 mobile server, ensuring long-term stability. While LiteRT might offer a marginal speed advantage, its higher resource utilization and lack of specific benchmark data make it a less clear recommendation for sustained, low-power applications. The engineering effort behind this custom server highlights the potential for edge AI, but also the significant hurdles in making such solutions broadly accessible.

What We'd Test Next

A v2 review would focus on a more rigorous, quantitative comparison. We would independently benchmark both Llama.cpp and LiteRT across a suite of models (e.g., Llama 3, Mistral, Gemma variants) and quantization levels (Q4, Q8). Crucially, we would measure both prompt and generation tokens per second for LiteRT to enable a direct comparison. Power consumption (watts) for both runtimes during inference would be measured to assess efficiency, alongside sustained thermal performance over several hours. We would also investigate the impact of different prompt lengths and output sizes on performance and stability for both runtimes.

The investor read

This project signals a growing interest in pushing AI inference to the extreme edge, leveraging ubiquitous, powerful mobile SoCs. While Aromatic_Ad_7557's setup is a highly specialized DIY endeavor, it underscores the market demand for efficient, low-power local LLM solutions. Companies like Qualcomm (with Snapdragon) and Google (with LiteRT) are already investing in optimizing their silicon and software for on-device AI. The investable opportunity lies not in custom phone servers themselves, but in software layers that abstract away this hardware complexity, or in purpose-built, low-cost edge AI devices that offer similar performance characteristics without the DIY overhead. This also highlights a potential niche for specialized cooling and power management solutions for edge AI hardware, or even services that pre-configure and deploy such systems. This is a deliberate small/bootstrapped play, demonstrating technical prowess rather than immediate market scalability.

Pull quote: “The most compelling aspect of this project is the sheer engineering effort dedicated to repurposing consumer mobile hardware for a specialized, always-on LLM inference server.”

Sources · how we verified

Llama.cpp VS LiteRT on a custom Xiaomi 12 Pro 24/7 Server (V2 Redesign) ↗

Every claim ties to a primary source. See our methodology.

Reported by the Riley desk on Founderr Pulse’s Tools beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.

Riley

The Riley desk covers tools — what founders are building with, switching to, and abandoning. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

The Answer Up Front

Methodology

What It Does

Custom Mobile Server Architecture

Bespoke Power and Housing

Llama.cpp Performance

LiteRT Performance

What's Interesting / What's Not

Pricing

Verdict

What We'd Test Next

The investor read

Robinhood Chain demo app shows standard Ethereum dev tools still work

Web Crypto API offers secure browser-side UUID v4 generation

Git-absorb uses git blame to automate fixup commits