Sigilant-sweep CLI benchmarks LLM configs, finds Q4_K_M faster than Q8_0
This review examines Sigilant-sweep, an open-source CLI tool designed to benchmark various LLM configurations against specific hardware setups. It covers the tool's methodology, reported performance…
This review examines Sigilant-sweep, an open-source CLI tool designed to benchmark various LLM configurations against specific hardware setups. It covers the tool's methodology, reported performance metrics, and key findings.
TL;DR
Best for: Local LLM operators and developers seeking to optimize inference performance by systematically testing quantization, KV cache, and context size configurations on their specific hardware. Skip if: Your primary concern is comprehensive LLM quality evaluation, including aspects like tool calling or structured JSON validity, which Sigilant-sweep does not yet measure in its scoring. Bottom line: Sigilant-sweep provides a focused, open-source solution for identifying optimal LLM runtime configurations, offering a significant advantage over generic benchmarks by enabling hardware-specific performance tuning.
METHODOLOGY
This v0 review draws on the founder diptanshu1991's published claims at the Reddit URL provided. Independent benchmarks are pending. We observed Sigilant-sweep, an open-source CLI tool, on 2026-05-28. This review covers the founder's stated methodology, reported benchmark results, and the tool's features as described in the Reddit post and implied by the linked GitHub repository (https://github.com/sigilantlabs/sigilant-sweep/). Specifically, we examine the tool's approach to testing 16 distinct configurations (combinations of quantizations, KV cache settings, and context sizes) across multiple trials. The reported metrics—Tokens Per Second (TPS), Time To First Token (TTFT), and Perplexity (PPL) on a fixed 3,300-token corpus—are analyzed. What is NOT covered in this review includes independent performance verification, long-term workflow integration, or comprehensive testing of edge cases beyond what the founder reported. Update cadence: This review will be re-tested when claims diverge from observed behavior or significant new versions are released.
WHAT IT DOES
Automated config sweeps
Sigilant-sweep is an open-source command-line interface (CLI) tool that automates the process of benchmarking different LLM configurations. It tests 16 pre-defined combinations of quantization levels, KV cache settings, and context sizes. Users specify the number of trials for each configuration, and the tool executes these tests against supported LLM backends like llama.cpp and vLLM.
Deterministic benchmarking
The founder, diptanshu1991, highlights a significant effort to achieve deterministic results. Initial runs showed inconsistent winners, which was addressed by implementing deterministic shuffling through cyclic offset. This approach reportedly stabilizes results, making them reliable 9 out of 10 times for a given hardware and backend setup. This focus on reproducibility is critical for trustworthy performance comparisons.
Performance metrics and scoring
The tool measures several key performance indicators: Tokens Per Second (TPS), Time To First Token (TTFT), and Perplexity (PPL). PPL is calculated on a fixed 3,300-token mixed-domain corpus. After all trials, Sigilant-sweep computes p50 and p95 values for TPS and TTFT for each configuration. These metrics are then normalized and combined into a final score, which is a weighted average based on user-selected profiles (balanced, latency, or quality). The CLI also flags low confidence when the top two configurations are within a 1% score gap, suggesting more trials may be needed.
Context depth profiling
Sigilant-sweep includes a depth profile mode designed to assess how performance changes with increasing context length. This mode specifically tests TPS and TTFT at prompt lengths of 8k, 14k, and 28k. This allows users to understand which configurations maintain optimal performance as context size grows, while perplexity remains measured on the same fixed corpus across all passes.
WHAT'S INTERESTING / WHAT'S NOT
What's interesting about Sigilant-sweep is its direct attack on a common problem in the local LLM space: the lack of hardware-specific, actionable benchmarks. Generic benchmarks often fail to account for the nuances of individual hardware setups, leading to suboptimal performance for users. Sigilant-sweep's core value proposition is enabling users to find the optimal config for their specific hardware, which is a meaningful improvement over relying on broad recommendations. The emphasis on deterministic results, achieved through cyclic offset, is a strong technical point. Benchmarking is notoriously difficult to make consistent, and the founder's explicit mention of tackling this challenge, and reporting 9/10 stability, adds credibility to the tool's output. The feature to flag low confidence when top configurations are too close is also valuable, guiding users to run more trials rather than making premature decisions based on noisy data. The reported finding that Q4_K_M beat Q8_0 by 274ms TTFT on Qwen2.5-7B (bartowski) on a Modal L4 GPU is a concrete example of the type of actionable insight the tool aims to provide.
What's not covered, or what's missing from the founder's pitch, is a more comprehensive definition of "quality." While PPL is measured, the founder explicitly states the tool does not measure full quality aspects like tool calling or structured JSON validity. These are critical for many real-world LLM applications. The current 5-sample smoke test for quality is not yet integrated into the scoring, which limits the tool's utility for use cases where output correctness and format adherence are paramount. While the tool excels at performance metrics, users needing to balance speed with complex quality requirements might still need to perform additional, manual evaluations. The reliance on pre-defined configurations, while simplifying the sweep, also means users cannot easily test arbitrary or custom quantization schemes or KV cache settings beyond the 16 combinations.
PRICING
Sigilant-sweep is an open-source CLI tool and is free to use. All code is available on GitHub (https://github.com/sigilantlabs/sigilant-sweep/). Pricing snapshot: 2026-05-28.
VERDICT
Sigilant-sweep is best for developers and operators who need to precisely tune their local LLM deployments for maximum performance on specific hardware. Its ability to systematically sweep through configurations and provide deterministic, hardware-specific TPS and TTFT metrics is a significant advantage over relying on generic benchmarks. While it currently excels at performance optimization, users focused on complex LLM quality aspects, such as tool calling or structured output validity, will find its quality evaluation limited to perplexity. For those prioritizing speed and efficiency in local LLM inference, Sigilant-sweep offers a pragmatic, open-source solution to identify optimal settings, as evidenced by its finding of Q4_K_M outperforming Q8_0 in TTFT for Qwen2.5-7B (bartowski) on Modal L4.
WHAT WE'D TEST NEXT
Our next steps would involve independently reproducing the founder's reported benchmarks, specifically the Q4_K_M vs. Q8_0 performance on a Qwen2.5-7B model, across various hardware configurations beyond Modal L4. We would also integrate a custom quality evaluation suite, focusing on common LLM tasks like function calling, JSON schema adherence, and factual accuracy, to see how different configurations impact these critical aspects. Exploring the extensibility of the 16 pre-defined configurations would also be valuable, to understand if users can easily add or modify the sweep parameters for niche quantization methods. Finally, we would assess the CLI's ease of setup and integration into existing MLOps workflows for continuous performance monitoring.
Pull quote: “The founder, diptanshu1991, highlights a significant effort to achieve deterministic results.”
Every claim ties to a primary source. See our methodology.