New blog post

2026-04-11 16:34:25 +02:00
parent e037a78c4e
commit 5b6a607746
1 changed files with 173 additions and 0 deletions
@@ -0,0 +1,173 @@
 ---
 title: "Building a Security Audit Skill for the LLM Age: What We Learned About Making AI Actually Useful for Security"
 description: "How we built a production-grade security audit skill that fights false positives, severity inflation, and hallucination — and the design reasoning behind every decision."
 pubDate: 2026-04-11
 tags: ["security", "ai", "skills", "prompt-engineering", "false-positives"]
 categories: ["Engineering"]
 draft: false
 ---
 ## The Problem
 Everyone has tried using an LLM for security auditing by now. The results are usually the same: a wall of Medium-severity findings that sound authoritative but don't survive contact with a real security engineer. The model finds every `eval()`, every `os.system()`, every place user input exists — and calls them all vulnerabilities.
 They're not. Most of them aren't. And the ones that are get buried in noise.
 We spent the last several weeks building a security audit skill for [Zaguán Blade](https://zaguan.ai) that tries to solve this problem properly. Not by adding more checks — but by teaching the model when *not* to flag something.
 This post is about the design reasoning, not just the artifact. The prompt itself is [published in full](https://github.com/ZaguanAI/security-audit-skill/blob/main/security-audit.md). What's interesting is *why* each piece exists.
 ## What Most LLM Audit Prompts Get Wrong
 The failure modes are remarkably consistent across models and frameworks:
 1. **Severity inflation.** Every dangerous API is Critical. Every user-controlled input is "attacker-controlled." The model doesn't distinguish between a web-facing SQL injection and a local CLI flag that passes user input to `exec()` — they're both RCE, right?
 2. **Context blindness.** The model assumes everything is a web application. A desktop app executing commands from its own config file gets flagged the same as a server executing commands from an HTTP request body. The trust model is completely different, but the model doesn't know that.
 3. **Hallucinated paths.** The model constructs plausible-sounding attack chains that don't actually exist in the code. "An attacker could send a crafted payload to the `/api/process` endpoint..." — but that endpoint doesn't exist.
 4. **Generic advice.** "Sanitize all inputs." "Use parameterized queries." These are true but useless. A real audit tells you *which* input, *which* query, and *what exactly* to change.
 5. **Missing the real bugs.** While the model is busy flagging every `eval()` in your test suite, the actual vulnerability — a subtle authorization gap in a multi-tenant API, or a deserialization path through a parser the model didn't investigate — goes unnoticed.
 Our goal was to build something that produces audits a security engineer would actually want to read and act on.
 ## The Key Insight: From "Find Scary Things" to "Decide What Matters"
 The single most important design decision was this: **a dangerous sink is not a vulnerability.**
 This sounds obvious, but it's the root of most false positives. The model sees `os.system(user_input)` and flags it. But the question isn't whether the sink is dangerous — it's whether the attacker *crosses a meaningful trust boundary* to reach it, and whether they *gain capability they didn't already have*.
 A desktop application executing commands from its own config file, which only the user can edit, running as that same user? That's not a vulnerability. That's the application working as designed. The user already has all the authority the "exploit" would give them.
 A web server executing commands from an HTTP request body? That's a completely different trust model, and it *is* a vulnerability.
 The skill has to reason about trust boundaries and privilege deltas, not just dangerous function calls.
 ## The Exploit Value Test
 Early versions of the skill used "Exploit Gain" — does the attacker gain new capability? This was good but incomplete. We kept missing a class of issues that are quiet but dangerous:
 - A same-user persistence mechanism (autostart entry, shell hook, CI pipeline poisoning)
 - A config file that gets loaded from a remote sync rather than a static local path
 - A build step that executes code in a different context than the runtime
 These don't give the attacker *new privileges* in the traditional sense. But they give them **persistence**, **stealth**, **scope expansion**, or **context shifting**. That's valuable to an attacker even without privilege escalation.
 So we evolved "Exploit Gain" into **"Exploit Value"**:
 ```
 Exploit Value = Capability Gain + Persistence + Stealth + Scope
 ```
 This is now the gating test before any severity assignment. If Exploit Value ≤ 0 (no new capability, no persistence, no stealthy scope expansion), the finding is capped at Low or Informational. Usually it belongs in a different category entirely.
 ## The Classification Taxonomy
 Most audit frameworks have findings and... that's it. We found this forced the model to either inflate borderline issues into "findings" or drop them entirely. Neither is correct.
 The skill now distinguishes five categories:
 - **Confirmed finding** — you can trace the vulnerable path end to end
 - **Likely risk** — dangerous pattern, needs one missing fact confirmed (but you must name a specific file/function, not an abstract category)
 - **Abuse primitive** — not a vulnerability, but a dangerous building block that could be chained in future attacks
 - **Hardening opportunity** — not currently exploitable, but weakens security posture
 - **Design property (by design)** — behavior intentionally exposed to a trusted actor; not a vulnerability, but deserves documentation
 The **Abuse Primitive** category is the one that surprises people. It captures things like "executes arbitrary shell from config" or "evaluates templates dynamically." These aren't vulnerabilities on their own — the config is trusted, the templates are trusted. But they're *perfect building blocks* for an attacker who finds a way to influence that config or those templates through a different path.
 This matters because LLMs and modern attackers are increasingly capable of combining multiple low-severity issues into critical impact. If you only track standalone vulnerabilities, you miss the chains.
 ## The Same-User Exception
 Here's the subtlety that took us the longest to get right.
 Our false-positive downgrade heuristics say: "same-user, same-authority effects should not be Medium+." This correctly kills the most common class of false positives — desktop apps, local tools, trusted config execution.
 But it *over-corrects*. Same-user is NOT safe if the attacker gains:
 - **Persistence** across restarts or sessions
 - **Integrity impact** on future trusted execution (e.g., poisoning a CI pipeline, autostart, or shell hook)
 - **Context shifting** (triggering execution in a different context, like build time vs. run time)
 - **Silent hijacking** of trusted workflows
 This exception rule is critical. Without it, the skill would systematically miss persistence mechanisms, supply chain attacks, and developer tooling compromise — exactly the class of issues that are most valuable to modern attackers.
 ## Design Properties Need Guardrails
 The "Design property (by design)" category is useful, but it's also dangerous. In the real world, "that's by design" is how actual vulnerabilities get dismissed.
 So we require every Design Property to explicitly answer:
 - **Who is trusted?**
 - **Why are they trusted?**
 - **Can trust be violated in practice?** (e.g., config loaded from a remote sync vs. a static local file)
 If you can't answer these, it doesn't belong in this category. This prevents lazy classification and forces the model to reason about whether the trust assumption actually holds.
 ## "What Would Prove Me Wrong?"
 This is the safety valve. The skill requires the model to explicitly state, for every finding and in its scratchpad reasoning: "What would prove me wrong?"
 This matters because the skill is now strong enough to be *convincingly wrong*. It reasons well, sounds authoritative, and filters aggressively. When it makes a mistake, it will be a confident mistake. Forcing it to articulate how its own hypothesis could be falsified is the best defense against that.
 ## The Scratchpad as Execution Environment
 The `<security_scratchpad>` isn't a post-hoc summary — it's the model's live investigation workspace. The model must use it to:
 - Plan which files and routes to investigate
 - Trace data flows from ingress → trust boundary → sink → impact
 - State attacker capability before and after
 - Play Devil's Advocate against its own hypotheses
 - Evaluate exploit chaining
 - State what would prove it wrong
 - Conclude with exact classification
 The scratchpad template mirrors these instructions explicitly — Devil's Advocate and Exploit Chaining are required headers, not just internal reasoning steps. This means they actually appear in the output, not just silently influence it.
 ## The 2025-2026 Threat Model
 The skill is explicitly grounded in the current threat landscape, not a 2021-era checklist. Key shifts it reflects:
 - **Broken access control** remains the top application risk
 - **Supply chain failures** are now a core appsec category (SHA pinning, OIDC token scope, artifact integrity, mutable action references)
 - **Mishandling of exceptional conditions** is a first-class security category (fail-open paths, partial transaction recovery, missing rollback)
 - **AI/Agent surfaces** are real attack surfaces (prompt injection, tool-output-to-tool-input leakage, excessive agency)
 - **AI-driven exploit chaining** — LLMs and modern attackers combine multiple low-severity issues to achieve critical impact
 The threat model has a version and a cutoff date (April 2026), so you know when it starts getting stale.
 ## What We Learned From Testing
 We tested the skill against the Openbox source code — a Linux/BSD desktop environment. This was the perfect stress test because it's exactly the kind of codebase that produces false positives with naive audit prompts:
 - Config files that intentionally execute commands
 - Desktop entries that intentionally launch programs
 - IPC mechanisms that intentionally pass messages between same-user processes
 - Session management that intentionally restores state
 A naive audit flags all of these as Critical. Our early versions flagged most of them as Medium+. The final version correctly classified the majority as Design Properties, with specific reasoning about why trust holds (or doesn't) in each case.
 The real vulnerabilities — the subtle authorization gaps, the parser edge cases, the incomplete fixes — actually became *more* visible once the noise was gone.
 ## The Full Skill
 The complete skill definition is [available on GitHub](https://github.com/ZaguanAI/security-audit-skill). We're publishing it in full because:
 1. **It's defensive tooling.** Knowing the audit methodology doesn't help attackers bypass it — it tells them what we'll catch, which pushes them toward the gaps we want to find anyway.
 2. **The moat isn't the prompt.** The competitive advantage is the integration into the tool loop, the continuous refinement from real audits, and the execution environment. Anyone can copy the markdown; nobody can copy the flywheel.
 3. **Feedback accelerates quality.** We've already gotten massive improvements from having multiple frontier models review it. Opening it to security practitioners will produce another order of refinement.
 4. **It raises the bar.** Most "AI security audit" tools right now are glorified checklists. Publishing something rigorous forces the field to level up.
 ## The One-Sentence Rule
 If you take nothing else from this post, take this:
 > Never declare a security finding unless you can trace attacker-controlled data across a trust boundary to a privileged sink with positive exploit value.
 That's the entire skill compressed into one sentence. Everything else is enforcement machinery.
 ---
 *The Security Audit Skill is part of [Zaguán Blade](https://zaguan.ai), an AI-powered coding environment. The skill definition is [available on GitHub](https://github.com/ZaguanAI/security-audit-skill) under the Apache 2.0 license. Feedback, issues, and contributions are welcome.*