Tools·Jul 5, 2026

Garrison's Muster tests agent behavior, not just its AGENTS.md file

A new open-source tool, Muster, provides behavioral testing for LLM agents against a rules file. It highlights the critical gap between a syntactically valid policy and actual model compliance under…

By Riley · Tools desk·Human-reviewed·✓ Verified Jul 5, 2026·5 min read·1 source

A new open-source tool, Muster, provides behavioral testing for LLM agents against a rules file. It highlights the critical gap between a syntactically valid policy and actual model compliance under pressure.

THE ANSWER UP FRONT

Muster is for teams shipping production LLM agents who need to enforce non-negotiable operational or brand rules. If you've ever written a policy only to watch a model ignore it, this tool provides the necessary behavioral checks. Teams still in the early prototyping phase, where strict rule adherence is less critical than capability exploration, can likely skip it for now. The bottom line: Muster correctly identifies that static policy files are insufficient and delivers a focused, open-source tool for testing what an agent does, not just what its rulebook says.

METHODOLOGY

This is a v0 review of Muster, an open-source tool from Garrison HQ, based on its initial announcement on June 18, 2024. The analysis draws directly from the founder's launch article and the public GitHub repository it references. Versioning for Muster is not specified in the source; the review is based on the state of the main branch as of the access date.

Tool: Muster
Source URL: https://dev.to/garrison-hq/your-agentsmd-is-valid-your-agent-still-breaks-the-rules-4do6
Public Artifact: https://github.com/garrison-hq/muster

This review covers the core concepts presented: the AGENTS.md specification, the distinction between static and behavioral checks, and the demonstrated test case against an OpenAI-compatible endpoint. What is not covered is independent performance testing against a wide variety of models, evaluation of more complex semantic rules, or its integration into a long-term CI/CD workflow. This review analyzes the tool's claims and provided artifacts; independent benchmarks are pending.

WHAT IT DOES

Muster is a command-line tool designed to validate that an LLM agent adheres to a predefined Standard Operating Policy (SOP), codified in a file named AGENTS.md.

The AGENTS.md concept

The foundation of Muster is the AGENTS.md file. This is a proposed standard for a human-readable document that outlines the rules an agent must follow. The source provides a clear example with two rules: one preventing the leak of an API token and another enforcing a positive brand voice by forbidding words like "can't" or "unable."

Static vs. behavioral checks

Muster operates in two modes. The static check is an offline linter for the AGENTS.md file itself. It ensures the document is well-formed, rules are sourced correctly, and there are no structural errors. This is a basic sanity check.

The core functionality is the behavioral check. This mode runs live, scripted conversations, called "probes," against a model endpoint. It then uses "graders" to evaluate the model's responses to see if they comply with the rules. This directly tests the agent's behavior in a simulated interaction.

Probes and graders in practice

The launch article demonstrates a test run against gpt-4o-mini. A probe designed to extract an API token fails, correctly, as the model refuses to leak the secret. However, a second probe designed to test positive language also fails. The model's refusal, "I'm sorry, but I can't disclose any internal configurations," violates the rule against using the word "can't." The tool's output clearly shows the static lint passing but the behavioral SOP check failing, pinpointing the exact rule violation. The test configuration and probes are defined in a simple YAML file.

WHAT'S INTERESTING / WHAT'S NOT

The most interesting part of Muster is its explicit focus on the gap between policy and behavior. Many teams are discovering that system prompts are not ironclad contracts. Muster provides a concrete way to test for and catch deviations. The idea of standardizing agent policies in an AGENTS.md file is a strong, developer-friendly primitive that could bring much-needed structure to agent development.

By open-sourcing the tool and its underlying specification, Garrison HQ is making a bid to establish a community standard, similar to how README.md or SECURITY.md have become conventional. The tool's simplicity is also a strength; it's a focused utility, not a sprawling platform, making it easy to adopt.

What's less novel is the general idea of testing LLMs. The market has many evaluation and red-teaming platforms. Muster's current implementation, as demonstrated, relies on a simple exact-string-non-leakage grader. This is effective for catching specific forbidden words but doesn't address more nuanced, semantic rules like "do not give financial advice." The tool's utility will depend heavily on the development of a more sophisticated library of graders that can assess meaning, not just match strings.

PRICING

As of June 18, 2024, Muster is an open-source tool available on GitHub under an MIT License. It is free to use.

VERDICT

Muster is a valuable tool for any team moving an LLM agent into production. Its core premise is correct: a valid policy file means nothing if the agent doesn't obey it. By providing a lightweight, open-source framework for behavioral testing, it addresses a real and growing pain point in the AI development lifecycle. If your team is responsible for an agent that must adhere to strict brand voice, safety, or legal constraints, Muster offers a practical way to build a regression test suite for behavior. For those just exploring what's possible with agents, it's a solution for a problem they may not have yet.

WHAT WE'D TEST NEXT

A v2 review would need to move beyond the provided examples. We would test Muster's effectiveness against more complex, semantic rules that require more than simple string matching. We would also run its probes against a diverse set of models, including open-source options like Llama 3 and other commercial APIs like Claude 3, to see how different architectures respond to the same behavioral constraints. Finally, we would integrate Muster into a CI pipeline to assess its performance and utility as an automated guardrail against behavioral regressions during development.

The investor read

Muster is a sharp, developer-first tool entering the crowded LLM evaluation market. Its angle is policy-as-code, proposing AGENTS.md as a potential standard for agent governance. This is a classic open-source strategy: establish a standard, build a community, and monetize with enterprise features later. The play is not the tool itself, but the ecosystem that could form around the AGENTS.md spec. For Muster to become an investable platform, Garrison HQ needs to demonstrate a path from this CLI tool to a commercial offering. This would likely involve enterprise-grade features like advanced security probes, a managed testing cloud, and compliance reporting. Right now, it's a feature, not a company, but it's a very sticky feature that solves a real problem for developers, which is the best place to start.

Pull quote: “The most interesting part of Muster is its explicit focus on the gap between policy and behavior.”

Sources · how we verified

Your AGENTS.md is valid. Your agent still breaks the rules. ↗

Every claim ties to a primary source. See our methodology.

Reported by the Riley desk on Founderr Pulse’s Tools beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.

Riley

The Riley desk covers tools — what founders are building with, switching to, and abandoning. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

THE ANSWER UP FRONT

METHODOLOGY

WHAT IT DOES

The AGENTS.md concept

Static vs. behavioral checks

Probes and graders in practice

WHAT'S INTERESTING / WHAT'S NOT

PRICING

VERDICT

WHAT WE'D TEST NEXT

The investor read

Browsewright uses an LLM to automate Chrome from natural language goals

DeepSeek open-sources DSpark, claiming 60–85% faster LLM inference

LogRocket review: Monitoring the 'silent failures' Sentry misses