HomeReadTools deskStepFun 3.7 Flash with llama.cpp on M5 Max: Performance at Scale
Tools·May 29, 2026

StepFun 3.7 Flash with llama.cpp on M5 Max: Performance at Scale

This review analyzes the token generation and preprocessing speeds of the StepFun 3.7 Flash model running via llama.cpp on an M5 Max, examining performance across varying context window sizes. TL;DR…

This review analyzes the token generation and preprocessing speeds of the StepFun 3.7 Flash model running via llama.cpp on an M5 Max, examining performance across varying context window sizes.

TL;DR

Best for: Local LLM development requiring responsive inference with context windows up to 16K tokens on M5 Max hardware. It handles moderate context lengths (32K-64K) with acceptable, though reduced, performance. Skip if: Your primary workflow demands consistent, high-speed inference for very large context windows (beyond 64K) or if your M5 Max has less than 128 GB of unified memory, as this setup already pushes memory limits. Bottom line: StepFun 3.7 Flash, when run with llama.cpp on an M5 Max, offers robust performance for short to medium contexts, but experiences significant throughput degradation at its maximum context capacity.

METHODOLOGY

This v0 review draws on the founder's published claims at https://www.reddit.com/r/LocalLLaMA/comments/1tqqebc/stepfun_37_flash_speed_benchmark_in_m5_max/; independent benchmarks pending. Update cadence: re-tested when claims diverge from observed behavior. The review covers a performance benchmark of the StepFun 3.7 Flash model (version not explicitly stated, but implied as "3.7 Flash") run using a "day-0 shipped llama.cpp's branch." The benchmark was conducted by /u/Beamsters on an M5 Max machine equipped with 128 GB of unified memory. The model was loaded using Q4_K_S quantization. The test evaluated preprocessing (PP) and token generation (TG) speeds across six distinct context window sizes: 0, 2048, 8192, 16384, 32768, and 65536 tokens. For each context size, 128 tokens were generated with a batch size of 1. The reported metrics include total time for preprocessing (T_PP s), preprocessing speed (S_PP t/s), total time for token generation (T_TG s), token generation speed (S_TG t/s), total time (T s), and overall speed (S t/s). What's NOT covered in this v0 review includes independent performance verification, long-term workflow integration, power consumption, or specific comparisons against other LLM models or inference engines. This analysis is limited to the data and observations provided by the original poster.

WHAT IT DOES

Benchmarks local LLM performance

The core function demonstrated by the /u/Beamsters post is benchmarking the StepFun 3.7 Flash model's performance when run locally via llama.cpp on Apple's M5 Max hardware. The benchmark specifically measures the speed at which the model can process input prompts (preprocessing) and generate new tokens (token generation) across a range of context window sizes. This provides concrete data on how the model scales with increasing input length.

Quantized model inference

The benchmark explicitly states the use of "Q4_K_S" quantization. This indicates that the StepFun 3.7 Flash model was run in a highly compressed format, optimizing for memory usage and potentially inference speed on consumer hardware like the M5 Max. The M5 Max's 128 GB unified memory was utilized, with memory peaking around "~120+ GB," suggesting that even with quantization, the model is substantial and pushes the limits of the hardware.

Scales across context windows

The provided data illustrates the performance characteristics of StepFun 3.7 Flash across context windows ranging from 0 tokens (for pure generation) up to 65536 tokens. This scaling behavior is critical for developers building applications that require varying levels of contextual understanding, from short, immediate responses to processing extensive documents or codebases. The benchmark shows how preprocessing and generation speeds degrade as the context window expands.

WHAT'S INTERESTING / WHAT'S NOT

The most interesting aspect of this benchmark is the clear performance degradation as the context window increases, particularly for preprocessing. While token generation speed (S_TG t/s) remains relatively stable for shorter contexts (62.80 t/s at 0 PP to 51.71 t/s at 16384 PP), it drops more significantly at 32768 PP (45.43 t/s) and again at 65536 PP (33.92 t/s). This indicates that while the M5 Max can handle large contexts, the user experience for generation slows down by nearly 50% from its peak when pushing to the maximum tested context.

Preprocessing speed (S_PP t/s) shows a much steeper decline. It starts strong at 1056.65 t/s for 2048 tokens, but falls to 730.52 t/s at 16384 tokens, and further to 367.71 t/s at 65536 tokens. This means that feeding a 65K token prompt to the model takes nearly three times as long per token as feeding a 2K token prompt. The original poster's observation that "Short context < 16k feels fast and very responsive" aligns with this data, as the overall speed (S t/s) remains above 660 t/s up to 16384 tokens. Beyond this, the overall speed drops to 488.39 t/s at 32768 tokens and 360.79 t/s at 65536 tokens, confirming the "sluggish but still usable" assessment.

What's not explicitly covered, and thus less interesting from a data-driven perspective, is the specific version of llama.cpp used beyond "day-0 shipped llama.cpp's branch." The exact model size of StepFun 3.7 Flash is also not provided, which would offer crucial context for interpreting the memory usage and performance. The benchmark also lacks a comparison point to other models or even previous versions of StepFun, making it difficult to assess if "3.7 Flash" represents a significant improvement over its predecessors or competitors. The "Pelican bench" mentioned is an image artifact without numerical data, so it cannot be analyzed here.

PRICING

Neither the StepFun 3.7 Flash model nor the llama.cpp inference engine are commercial products with listed pricing. StepFun 3.7 Flash is a model, and llama.cpp is an open-source project. Therefore, there are no direct costs associated with using these software components. The primary cost is the underlying hardware, in this case, an M5 Max with 128 GB of unified memory. Pricing snapshot date: 2026-05-29.

VERDICT

StepFun 3.7 Flash, when run with llama.cpp on an M5 Max with 128 GB of memory, is best suited for local LLM development scenarios that prioritize responsiveness for context windows up to 16K tokens. The benchmark data supports this, showing overall speeds above 660 t/s in this range, aligning with the "fast and very responsive" user experience. Developers requiring moderate context lengths (32K-64K) will find it usable, but should anticipate a noticeable decrease in performance, with overall speeds dropping to around 360-480 t/s. This setup is not ideal if your application demands consistent, high-speed inference for contexts significantly beyond 64K tokens, as the memory pressure and performance degradation become substantial. The M5 Max's 128 GB memory is a critical factor, as the model pushes its limits, suggesting that systems with less memory would struggle.

WHAT WE'D TEST NEXT

Our next steps would involve a more comprehensive benchmarking suite. We would first aim to identify the exact model size of StepFun 3.7 Flash and the specific commit of llama.cpp used to ensure reproducibility. We would then test various quantization levels (e.g., Q8_0, Q5_K_M) to understand their impact on both memory footprint and performance. A crucial comparison would be against other popular open-source models (e.g., Llama 3 8B/70B, Mixtral) running on the same M5 Max hardware to establish a baseline for "good" performance. We would also investigate the impact of different batch sizes on token generation speed, and measure power consumption to assess the efficiency of running large contexts locally. Finally, we would run specific, real-world tasks (e.g., summarization of long documents, code generation from large repositories) to evaluate practical throughput and latency.

Pull quote: “Short context < 16k feels fast and very responsive.”

Sources · how we verified
  1. StepFun 3.7 Flash - Speed Benchmark in M5 Max

Every claim ties to a primary source. See our methodology.

Reported by the Riley desk on Founderr Pulse’s Tools beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.
R
Riley

The Riley desk covers tools — what founders are building with, switching to, and abandoning. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

Founderr Pulse — free & independent. The desk for people who build & back.