From 5b6a6077461cd0438149f5e5cf4a14f4d66f430a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Stig-=C3=98rjan=20Smelror?= <smelror@gmail.com>
Date: Sat, 11 Apr 2026 16:34:25 +0200
Subject: [PATCH] New blog post

---
 ...-11-security-audit-skill-in-the-llm-age.md | 173 ++++++++++++++++++
 1 file changed, 173 insertions(+)
 create mode 100644 src/content/blog/2026-04-11-security-audit-skill-in-the-llm-age.md

diff --git a/src/content/blog/2026-04-11-security-audit-skill-in-the-llm-age.md b/src/content/blog/2026-04-11-security-audit-skill-in-the-llm-age.md
new file mode 100644
index 0000000..e5c3f48
--- /dev/null
+++ b/src/content/blog/2026-04-11-security-audit-skill-in-the-llm-age.md
@@ -0,0 +1,173 @@
+---
+title: "Building a Security Audit Skill for the LLM Age: What We Learned About Making AI Actually Useful for Security"
+description: "How we built a production-grade security audit skill that fights false positives, severity inflation, and hallucination — and the design reasoning behind every decision."
+pubDate: 2026-04-11
+tags: ["security", "ai", "skills", "prompt-engineering", "false-positives"]
+categories: ["Engineering"]
+draft: false
+---
+
+## The Problem
+
+Everyone has tried using an LLM for security auditing by now. The results are usually the same: a wall of Medium-severity findings that sound authoritative but don't survive contact with a real security engineer. The model finds every `eval()`, every `os.system()`, every place user input exists — and calls them all vulnerabilities.
+
+They're not. Most of them aren't. And the ones that are get buried in noise.
+
+We spent the last several weeks building a security audit skill for [Zaguán Blade](https://zaguan.ai) that tries to solve this problem properly. Not by adding more checks — but by teaching the model when *not* to flag something.
+
+This post is about the design reasoning, not just the artifact. The prompt itself is [published in full](https://github.com/ZaguanAI/security-audit-skill/blob/main/security-audit.md). What's interesting is *why* each piece exists.
+
+## What Most LLM Audit Prompts Get Wrong
+
+The failure modes are remarkably consistent across models and frameworks:
+
+1. **Severity inflation.** Every dangerous API is Critical. Every user-controlled input is "attacker-controlled." The model doesn't distinguish between a web-facing SQL injection and a local CLI flag that passes user input to `exec()` — they're both RCE, right?
+
+2. **Context blindness.** The model assumes everything is a web application. A desktop app executing commands from its own config file gets flagged the same as a server executing commands from an HTTP request body. The trust model is completely different, but the model doesn't know that.
+
+3. **Hallucinated paths.** The model constructs plausible-sounding attack chains that don't actually exist in the code. "An attacker could send a crafted payload to the `/api/process` endpoint..." — but that endpoint doesn't exist.
+
+4. **Generic advice.** "Sanitize all inputs." "Use parameterized queries." These are true but useless. A real audit tells you *which* input, *which* query, and *what exactly* to change.
+
+5. **Missing the real bugs.** While the model is busy flagging every `eval()` in your test suite, the actual vulnerability — a subtle authorization gap in a multi-tenant API, or a deserialization path through a parser the model didn't investigate — goes unnoticed.
+
+Our goal was to build something that produces audits a security engineer would actually want to read and act on.
+
+## The Key Insight: From "Find Scary Things" to "Decide What Matters"
+
+The single most important design decision was this: **a dangerous sink is not a vulnerability.**
+
+This sounds obvious, but it's the root of most false positives. The model sees `os.system(user_input)` and flags it. But the question isn't whether the sink is dangerous — it's whether the attacker *crosses a meaningful trust boundary* to reach it, and whether they *gain capability they didn't already have*.
+
+A desktop application executing commands from its own config file, which only the user can edit, running as that same user? That's not a vulnerability. That's the application working as designed. The user already has all the authority the "exploit" would give them.
+
+A web server executing commands from an HTTP request body? That's a completely different trust model, and it *is* a vulnerability.
+
+The skill has to reason about trust boundaries and privilege deltas, not just dangerous function calls.
+
+## The Exploit Value Test
+
+Early versions of the skill used "Exploit Gain" — does the attacker gain new capability? This was good but incomplete. We kept missing a class of issues that are quiet but dangerous:
+
+- A same-user persistence mechanism (autostart entry, shell hook, CI pipeline poisoning)
+- A config file that gets loaded from a remote sync rather than a static local path
+- A build step that executes code in a different context than the runtime
+
+These don't give the attacker *new privileges* in the traditional sense. But they give them **persistence**, **stealth**, **scope expansion**, or **context shifting**. That's valuable to an attacker even without privilege escalation.
+
+So we evolved "Exploit Gain" into **"Exploit Value"**:
+
+```
+Exploit Value = Capability Gain + Persistence + Stealth + Scope
+```
+
+This is now the gating test before any severity assignment. If Exploit Value ≤ 0 (no new capability, no persistence, no stealthy scope expansion), the finding is capped at Low or Informational. Usually it belongs in a different category entirely.
+
+## The Classification Taxonomy
+
+Most audit frameworks have findings and... that's it. We found this forced the model to either inflate borderline issues into "findings" or drop them entirely. Neither is correct.
+
+The skill now distinguishes five categories:
+
+- **Confirmed finding** — you can trace the vulnerable path end to end
+- **Likely risk** — dangerous pattern, needs one missing fact confirmed (but you must name a specific file/function, not an abstract category)
+- **Abuse primitive** — not a vulnerability, but a dangerous building block that could be chained in future attacks
+- **Hardening opportunity** — not currently exploitable, but weakens security posture
+- **Design property (by design)** — behavior intentionally exposed to a trusted actor; not a vulnerability, but deserves documentation
+
+The **Abuse Primitive** category is the one that surprises people. It captures things like "executes arbitrary shell from config" or "evaluates templates dynamically." These aren't vulnerabilities on their own — the config is trusted, the templates are trusted. But they're *perfect building blocks* for an attacker who finds a way to influence that config or those templates through a different path.
+
+This matters because LLMs and modern attackers are increasingly capable of combining multiple low-severity issues into critical impact. If you only track standalone vulnerabilities, you miss the chains.
+
+## The Same-User Exception
+
+Here's the subtlety that took us the longest to get right.
+
+Our false-positive downgrade heuristics say: "same-user, same-authority effects should not be Medium+." This correctly kills the most common class of false positives — desktop apps, local tools, trusted config execution.
+
+But it *over-corrects*. Same-user is NOT safe if the attacker gains:
+
+- **Persistence** across restarts or sessions
+- **Integrity impact** on future trusted execution (e.g., poisoning a CI pipeline, autostart, or shell hook)
+- **Context shifting** (triggering execution in a different context, like build time vs. run time)
+- **Silent hijacking** of trusted workflows
+
+This exception rule is critical. Without it, the skill would systematically miss persistence mechanisms, supply chain attacks, and developer tooling compromise — exactly the class of issues that are most valuable to modern attackers.
+
+## Design Properties Need Guardrails
+
+The "Design property (by design)" category is useful, but it's also dangerous. In the real world, "that's by design" is how actual vulnerabilities get dismissed.
+
+So we require every Design Property to explicitly answer:
+
+- **Who is trusted?**
+- **Why are they trusted?**
+- **Can trust be violated in practice?** (e.g., config loaded from a remote sync vs. a static local file)
+
+If you can't answer these, it doesn't belong in this category. This prevents lazy classification and forces the model to reason about whether the trust assumption actually holds.
+
+## "What Would Prove Me Wrong?"
+
+This is the safety valve. The skill requires the model to explicitly state, for every finding and in its scratchpad reasoning: "What would prove me wrong?"
+
+This matters because the skill is now strong enough to be *convincingly wrong*. It reasons well, sounds authoritative, and filters aggressively. When it makes a mistake, it will be a confident mistake. Forcing it to articulate how its own hypothesis could be falsified is the best defense against that.
+
+## The Scratchpad as Execution Environment
+
+The `<security_scratchpad>` isn't a post-hoc summary — it's the model's live investigation workspace. The model must use it to:
+
+- Plan which files and routes to investigate
+- Trace data flows from ingress → trust boundary → sink → impact
+- State attacker capability before and after
+- Play Devil's Advocate against its own hypotheses
+- Evaluate exploit chaining
+- State what would prove it wrong
+- Conclude with exact classification
+
+The scratchpad template mirrors these instructions explicitly — Devil's Advocate and Exploit Chaining are required headers, not just internal reasoning steps. This means they actually appear in the output, not just silently influence it.
+
+## The 2025-2026 Threat Model
+
+The skill is explicitly grounded in the current threat landscape, not a 2021-era checklist. Key shifts it reflects:
+
+- **Broken access control** remains the top application risk
+- **Supply chain failures** are now a core appsec category (SHA pinning, OIDC token scope, artifact integrity, mutable action references)
+- **Mishandling of exceptional conditions** is a first-class security category (fail-open paths, partial transaction recovery, missing rollback)
+- **AI/Agent surfaces** are real attack surfaces (prompt injection, tool-output-to-tool-input leakage, excessive agency)
+- **AI-driven exploit chaining** — LLMs and modern attackers combine multiple low-severity issues to achieve critical impact
+
+The threat model has a version and a cutoff date (April 2026), so you know when it starts getting stale.
+
+## What We Learned From Testing
+
+We tested the skill against the Openbox source code — a Linux/BSD desktop environment. This was the perfect stress test because it's exactly the kind of codebase that produces false positives with naive audit prompts:
+
+- Config files that intentionally execute commands
+- Desktop entries that intentionally launch programs
+- IPC mechanisms that intentionally pass messages between same-user processes
+- Session management that intentionally restores state
+
+A naive audit flags all of these as Critical. Our early versions flagged most of them as Medium+. The final version correctly classified the majority as Design Properties, with specific reasoning about why trust holds (or doesn't) in each case.
+
+The real vulnerabilities — the subtle authorization gaps, the parser edge cases, the incomplete fixes — actually became *more* visible once the noise was gone.
+
+## The Full Skill
+
+The complete skill definition is [available on GitHub](https://github.com/ZaguanAI/security-audit-skill). We're publishing it in full because:
+
+1. **It's defensive tooling.** Knowing the audit methodology doesn't help attackers bypass it — it tells them what we'll catch, which pushes them toward the gaps we want to find anyway.
+2. **The moat isn't the prompt.** The competitive advantage is the integration into the tool loop, the continuous refinement from real audits, and the execution environment. Anyone can copy the markdown; nobody can copy the flywheel.
+3. **Feedback accelerates quality.** We've already gotten massive improvements from having multiple frontier models review it. Opening it to security practitioners will produce another order of refinement.
+4. **It raises the bar.** Most "AI security audit" tools right now are glorified checklists. Publishing something rigorous forces the field to level up.
+
+## The One-Sentence Rule
+
+If you take nothing else from this post, take this:
+
+> Never declare a security finding unless you can trace attacker-controlled data across a trust boundary to a privileged sink with positive exploit value.
+
+That's the entire skill compressed into one sentence. Everything else is enforcement machinery.
+
+---
+
+*The Security Audit Skill is part of [Zaguán Blade](https://zaguan.ai), an AI-powered coding environment. The skill definition is [available on GitHub](https://github.com/ZaguanAI/security-audit-skill) under the Apache 2.0 license. Feedback, issues, and contributions are welcome.*