From 5b6a6077461cd0438149f5e5cf4a14f4d66f430a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Stig-=C3=98rjan=20Smelror?= Date: Sat, 11 Apr 2026 16:34:25 +0200 Subject: [PATCH] New blog post --- ...-11-security-audit-skill-in-the-llm-age.md | 173 ++++++++++++++++++ 1 file changed, 173 insertions(+) create mode 100644 src/content/blog/2026-04-11-security-audit-skill-in-the-llm-age.md diff --git a/src/content/blog/2026-04-11-security-audit-skill-in-the-llm-age.md b/src/content/blog/2026-04-11-security-audit-skill-in-the-llm-age.md new file mode 100644 index 0000000..e5c3f48 --- /dev/null +++ b/src/content/blog/2026-04-11-security-audit-skill-in-the-llm-age.md @@ -0,0 +1,173 @@ +--- +title: "Building a Security Audit Skill for the LLM Age: What We Learned About Making AI Actually Useful for Security" +description: "How we built a production-grade security audit skill that fights false positives, severity inflation, and hallucination — and the design reasoning behind every decision." +pubDate: 2026-04-11 +tags: ["security", "ai", "skills", "prompt-engineering", "false-positives"] +categories: ["Engineering"] +draft: false +--- + +## The Problem + +Everyone has tried using an LLM for security auditing by now. The results are usually the same: a wall of Medium-severity findings that sound authoritative but don't survive contact with a real security engineer. The model finds every `eval()`, every `os.system()`, every place user input exists — and calls them all vulnerabilities. + +They're not. Most of them aren't. And the ones that are get buried in noise. + +We spent the last several weeks building a security audit skill for [Zaguán Blade](https://zaguan.ai) that tries to solve this problem properly. Not by adding more checks — but by teaching the model when *not* to flag something. + +This post is about the design reasoning, not just the artifact. The prompt itself is [published in full](https://github.com/ZaguanAI/security-audit-skill/blob/main/security-audit.md). What's interesting is *why* each piece exists. + +## What Most LLM Audit Prompts Get Wrong + +The failure modes are remarkably consistent across models and frameworks: + +1. **Severity inflation.** Every dangerous API is Critical. Every user-controlled input is "attacker-controlled." The model doesn't distinguish between a web-facing SQL injection and a local CLI flag that passes user input to `exec()` — they're both RCE, right? + +2. **Context blindness.** The model assumes everything is a web application. A desktop app executing commands from its own config file gets flagged the same as a server executing commands from an HTTP request body. The trust model is completely different, but the model doesn't know that. + +3. **Hallucinated paths.** The model constructs plausible-sounding attack chains that don't actually exist in the code. "An attacker could send a crafted payload to the `/api/process` endpoint..." — but that endpoint doesn't exist. + +4. **Generic advice.** "Sanitize all inputs." "Use parameterized queries." These are true but useless. A real audit tells you *which* input, *which* query, and *what exactly* to change. + +5. **Missing the real bugs.** While the model is busy flagging every `eval()` in your test suite, the actual vulnerability — a subtle authorization gap in a multi-tenant API, or a deserialization path through a parser the model didn't investigate — goes unnoticed. + +Our goal was to build something that produces audits a security engineer would actually want to read and act on. + +## The Key Insight: From "Find Scary Things" to "Decide What Matters" + +The single most important design decision was this: **a dangerous sink is not a vulnerability.** + +This sounds obvious, but it's the root of most false positives. The model sees `os.system(user_input)` and flags it. But the question isn't whether the sink is dangerous — it's whether the attacker *crosses a meaningful trust boundary* to reach it, and whether they *gain capability they didn't already have*. + +A desktop application executing commands from its own config file, which only the user can edit, running as that same user? That's not a vulnerability. That's the application working as designed. The user already has all the authority the "exploit" would give them. + +A web server executing commands from an HTTP request body? That's a completely different trust model, and it *is* a vulnerability. + +The skill has to reason about trust boundaries and privilege deltas, not just dangerous function calls. + +## The Exploit Value Test + +Early versions of the skill used "Exploit Gain" — does the attacker gain new capability? This was good but incomplete. We kept missing a class of issues that are quiet but dangerous: + +- A same-user persistence mechanism (autostart entry, shell hook, CI pipeline poisoning) +- A config file that gets loaded from a remote sync rather than a static local path +- A build step that executes code in a different context than the runtime + +These don't give the attacker *new privileges* in the traditional sense. But they give them **persistence**, **stealth**, **scope expansion**, or **context shifting**. That's valuable to an attacker even without privilege escalation. + +So we evolved "Exploit Gain" into **"Exploit Value"**: + +``` +Exploit Value = Capability Gain + Persistence + Stealth + Scope +``` + +This is now the gating test before any severity assignment. If Exploit Value ≤ 0 (no new capability, no persistence, no stealthy scope expansion), the finding is capped at Low or Informational. Usually it belongs in a different category entirely. + +## The Classification Taxonomy + +Most audit frameworks have findings and... that's it. We found this forced the model to either inflate borderline issues into "findings" or drop them entirely. Neither is correct. + +The skill now distinguishes five categories: + +- **Confirmed finding** — you can trace the vulnerable path end to end +- **Likely risk** — dangerous pattern, needs one missing fact confirmed (but you must name a specific file/function, not an abstract category) +- **Abuse primitive** — not a vulnerability, but a dangerous building block that could be chained in future attacks +- **Hardening opportunity** — not currently exploitable, but weakens security posture +- **Design property (by design)** — behavior intentionally exposed to a trusted actor; not a vulnerability, but deserves documentation + +The **Abuse Primitive** category is the one that surprises people. It captures things like "executes arbitrary shell from config" or "evaluates templates dynamically." These aren't vulnerabilities on their own — the config is trusted, the templates are trusted. But they're *perfect building blocks* for an attacker who finds a way to influence that config or those templates through a different path. + +This matters because LLMs and modern attackers are increasingly capable of combining multiple low-severity issues into critical impact. If you only track standalone vulnerabilities, you miss the chains. + +## The Same-User Exception + +Here's the subtlety that took us the longest to get right. + +Our false-positive downgrade heuristics say: "same-user, same-authority effects should not be Medium+." This correctly kills the most common class of false positives — desktop apps, local tools, trusted config execution. + +But it *over-corrects*. Same-user is NOT safe if the attacker gains: + +- **Persistence** across restarts or sessions +- **Integrity impact** on future trusted execution (e.g., poisoning a CI pipeline, autostart, or shell hook) +- **Context shifting** (triggering execution in a different context, like build time vs. run time) +- **Silent hijacking** of trusted workflows + +This exception rule is critical. Without it, the skill would systematically miss persistence mechanisms, supply chain attacks, and developer tooling compromise — exactly the class of issues that are most valuable to modern attackers. + +## Design Properties Need Guardrails + +The "Design property (by design)" category is useful, but it's also dangerous. In the real world, "that's by design" is how actual vulnerabilities get dismissed. + +So we require every Design Property to explicitly answer: + +- **Who is trusted?** +- **Why are they trusted?** +- **Can trust be violated in practice?** (e.g., config loaded from a remote sync vs. a static local file) + +If you can't answer these, it doesn't belong in this category. This prevents lazy classification and forces the model to reason about whether the trust assumption actually holds. + +## "What Would Prove Me Wrong?" + +This is the safety valve. The skill requires the model to explicitly state, for every finding and in its scratchpad reasoning: "What would prove me wrong?" + +This matters because the skill is now strong enough to be *convincingly wrong*. It reasons well, sounds authoritative, and filters aggressively. When it makes a mistake, it will be a confident mistake. Forcing it to articulate how its own hypothesis could be falsified is the best defense against that. + +## The Scratchpad as Execution Environment + +The `` isn't a post-hoc summary — it's the model's live investigation workspace. The model must use it to: + +- Plan which files and routes to investigate +- Trace data flows from ingress → trust boundary → sink → impact +- State attacker capability before and after +- Play Devil's Advocate against its own hypotheses +- Evaluate exploit chaining +- State what would prove it wrong +- Conclude with exact classification + +The scratchpad template mirrors these instructions explicitly — Devil's Advocate and Exploit Chaining are required headers, not just internal reasoning steps. This means they actually appear in the output, not just silently influence it. + +## The 2025-2026 Threat Model + +The skill is explicitly grounded in the current threat landscape, not a 2021-era checklist. Key shifts it reflects: + +- **Broken access control** remains the top application risk +- **Supply chain failures** are now a core appsec category (SHA pinning, OIDC token scope, artifact integrity, mutable action references) +- **Mishandling of exceptional conditions** is a first-class security category (fail-open paths, partial transaction recovery, missing rollback) +- **AI/Agent surfaces** are real attack surfaces (prompt injection, tool-output-to-tool-input leakage, excessive agency) +- **AI-driven exploit chaining** — LLMs and modern attackers combine multiple low-severity issues to achieve critical impact + +The threat model has a version and a cutoff date (April 2026), so you know when it starts getting stale. + +## What We Learned From Testing + +We tested the skill against the Openbox source code — a Linux/BSD desktop environment. This was the perfect stress test because it's exactly the kind of codebase that produces false positives with naive audit prompts: + +- Config files that intentionally execute commands +- Desktop entries that intentionally launch programs +- IPC mechanisms that intentionally pass messages between same-user processes +- Session management that intentionally restores state + +A naive audit flags all of these as Critical. Our early versions flagged most of them as Medium+. The final version correctly classified the majority as Design Properties, with specific reasoning about why trust holds (or doesn't) in each case. + +The real vulnerabilities — the subtle authorization gaps, the parser edge cases, the incomplete fixes — actually became *more* visible once the noise was gone. + +## The Full Skill + +The complete skill definition is [available on GitHub](https://github.com/ZaguanAI/security-audit-skill). We're publishing it in full because: + +1. **It's defensive tooling.** Knowing the audit methodology doesn't help attackers bypass it — it tells them what we'll catch, which pushes them toward the gaps we want to find anyway. +2. **The moat isn't the prompt.** The competitive advantage is the integration into the tool loop, the continuous refinement from real audits, and the execution environment. Anyone can copy the markdown; nobody can copy the flywheel. +3. **Feedback accelerates quality.** We've already gotten massive improvements from having multiple frontier models review it. Opening it to security practitioners will produce another order of refinement. +4. **It raises the bar.** Most "AI security audit" tools right now are glorified checklists. Publishing something rigorous forces the field to level up. + +## The One-Sentence Rule + +If you take nothing else from this post, take this: + +> Never declare a security finding unless you can trace attacker-controlled data across a trust boundary to a privileged sink with positive exploit value. + +That's the entire skill compressed into one sentence. Everything else is enforcement machinery. + +--- + +*The Security Audit Skill is part of [Zaguán Blade](https://zaguan.ai), an AI-powered coding environment. The skill definition is [available on GitHub](https://github.com/ZaguanAI/security-audit-skill) under the Apache 2.0 license. Feedback, issues, and contributions are welcome.*