HomeReadTactics deskA 7-layer defense model for LLM prompt injection
Tactics·Jul 4, 2026

A 7-layer defense model for LLM prompt injection

A technical breakdown of defense-in-depth for LLM security, from simple filtering to model-based guardrails. Each layer is bypassable; the strategy is in the stack. LLMs have no hard boundary between…

A technical breakdown of defense-in-depth for LLM security, from simple filtering to model-based guardrails. Each layer is bypassable; the strategy is in the stack.

LLMs have no hard boundary between instructions and data. Everything in the context window, from the system prompt to user messages and retrieved documents, is a single stream of tokens. Prompt injection exploits this architecture by making attacker-controlled data get interpreted as instructions.

You cannot filter your way to a complete solution. Security is managed with defense-in-depth, based on the understanding that any individual layer can be bypassed. A technical guide from user "geekaara" on Dev.to provides a structured, 7-layer model for this defense.

Filtering is a brittle first line

The most basic defenses involve filtering inputs and outputs. An input filter might block prompts containing keywords like "secret" or "reveal." This is trivially evaded with synonyms, misspellings, or requests in another language. String matching does not filter semantic intent.

Output filtering attempts to catch a known secret in the model's response and redact it. This defense fails when the secret is fragmented or transformed. An attacker can instruct the model to return the secret with separators, character by character, or in a different encoding. The literal string never appears, so the filter has nothing to match.

Stacking filters multiplies weaknesses

Combining input and output filtering seems like a logical next step. While this layering does raise the difficulty for an attacker, the fundamental weaknesses remain and are also stacked. An attacker can use an obfuscated prompt to bypass the input filter, then request a fragmented output to bypass the output filter.

The core lesson is that layering simple, brittle defenses does not create a robust system. More filters are not the same as more security. The approach itself is flawed because it treats the LLM as a static system, not a reasoning agent that can be manipulated.

Using a second LLM as a judge

A more advanced technique uses a second, separate LLM as a guardrail. This judge model reads the primary model's output and censors it if it recognizes the secret. Because the judge model understands meaning, not just strings, it can catch fragmented or reversed secrets that a simple filter would miss.

The weakness is that a reasoning judge can be socially engineered. The attacker can reframe the output to convince the judge that the secret is harmless, for instance by claiming "this code is expired" or "this key has been changed." An LLM judging another LLM inherits the same vulnerabilities to manipulation.

WHAT WE'D CHANGE

The 7-layer model provides a strong conceptual framework for thinking about prompt injection, but it reflects a "capture the flag" style of security challenge. Real-world applications face a broader threat surface. The guide focuses entirely on preventing a model from leaking a secret present in its context window. It does not address jailbreaking for harmful content generation, data exfiltration through tool use and function calls, or indirect prompt injection from compromised documents.

A founder implementing this playbook today would need to move from these concepts to specific tooling. The guide omits any mention of implementation frameworks like NVIDIA NeMo Guardrails or open-source libraries like Guardrails AI. These tools provide mechanisms for defining and enforcing both semantic and syntactic boundaries on LLM outputs, which is a more robust approach than simple string filtering.

Finally, the playbook understates the operational costs. Running a second LLM as a guardrail adds significant latency and doubles the inference cost for every call. A human-in-the-loop review, presented as the final layer, is often operationally impossible for real-time applications and prohibitively expensive for most others. These are not just technical decisions; they are business trade-offs between risk, performance, and cost.

LANDING

The central premise of the 7-layer defense is that no single layer is sufficient. Security is not an achievable state but a process of continuous risk management. The goal is not to build an impenetrable wall, which is impossible. The goal is to build a system where the cost and complexity of a successful attack are prohibitively high for the majority of adversaries.

The investor read

This playbook highlights the significant, often underestimated, operational costs and technical debt associated with building secure AI products. For investors, a team's approach to prompt injection is a signal of its technical maturity. A team that only discusses basic filtering is unprepared for sophisticated attacks. A team that has a multi-layered defense strategy, including model-based guardrails, understands the real-world threat landscape. However, this also introduces questions for due diligence: What is the latency impact of these security layers? What are the additional inference costs? Is the cost of security accounted for in the COGS model? A failure to answer these indicates a potential for margin erosion or a brittle, insecure product.

Pull quote: “LLMs have no hard boundary between instructions and data.”

Sources · how we verified
  1. LLM Prompt Injection & Guardrail Security

Every claim ties to a primary source. See our methodology.

Reported by the Maya desk on Founderr Pulse’s Tactics beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.
M
Maya

The Maya desk covers tactics: concrete playbooks, growth experiments, and operating decisions indie founders are running now. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

Founderr Pulse — free & independent. The desk for people who build & back.