You've already forked zblade.dev
New blog post
This commit is contained in:
@@ -0,0 +1,173 @@
|
|||||||
|
---
|
||||||
|
title: "Building a Security Audit Skill for the LLM Age: What We Learned About Making AI Actually Useful for Security"
|
||||||
|
description: "How we built a production-grade security audit skill that fights false positives, severity inflation, and hallucination — and the design reasoning behind every decision."
|
||||||
|
pubDate: 2026-04-11
|
||||||
|
tags: ["security", "ai", "skills", "prompt-engineering", "false-positives"]
|
||||||
|
categories: ["Engineering"]
|
||||||
|
draft: false
|
||||||
|
---
|
||||||
|
|
||||||
|
## The Problem
|
||||||
|
|
||||||
|
Everyone has tried using an LLM for security auditing by now. The results are usually the same: a wall of Medium-severity findings that sound authoritative but don't survive contact with a real security engineer. The model finds every `eval()`, every `os.system()`, every place user input exists — and calls them all vulnerabilities.
|
||||||
|
|
||||||
|
They're not. Most of them aren't. And the ones that are get buried in noise.
|
||||||
|
|
||||||
|
We spent the last several weeks building a security audit skill for [Zaguán Blade](https://zaguan.ai) that tries to solve this problem properly. Not by adding more checks — but by teaching the model when *not* to flag something.
|
||||||
|
|
||||||
|
This post is about the design reasoning, not just the artifact. The prompt itself is [published in full](https://github.com/ZaguanAI/security-audit-skill/blob/main/security-audit.md). What's interesting is *why* each piece exists.
|
||||||
|
|
||||||
|
## What Most LLM Audit Prompts Get Wrong
|
||||||
|
|
||||||
|
The failure modes are remarkably consistent across models and frameworks:
|
||||||
|
|
||||||
|
1. **Severity inflation.** Every dangerous API is Critical. Every user-controlled input is "attacker-controlled." The model doesn't distinguish between a web-facing SQL injection and a local CLI flag that passes user input to `exec()` — they're both RCE, right?
|
||||||
|
|
||||||
|
2. **Context blindness.** The model assumes everything is a web application. A desktop app executing commands from its own config file gets flagged the same as a server executing commands from an HTTP request body. The trust model is completely different, but the model doesn't know that.
|
||||||
|
|
||||||
|
3. **Hallucinated paths.** The model constructs plausible-sounding attack chains that don't actually exist in the code. "An attacker could send a crafted payload to the `/api/process` endpoint..." — but that endpoint doesn't exist.
|
||||||
|
|
||||||
|
4. **Generic advice.** "Sanitize all inputs." "Use parameterized queries." These are true but useless. A real audit tells you *which* input, *which* query, and *what exactly* to change.
|
||||||
|
|
||||||
|
5. **Missing the real bugs.** While the model is busy flagging every `eval()` in your test suite, the actual vulnerability — a subtle authorization gap in a multi-tenant API, or a deserialization path through a parser the model didn't investigate — goes unnoticed.
|
||||||
|
|
||||||
|
Our goal was to build something that produces audits a security engineer would actually want to read and act on.
|
||||||
|
|
||||||
|
## The Key Insight: From "Find Scary Things" to "Decide What Matters"
|
||||||
|
|
||||||
|
The single most important design decision was this: **a dangerous sink is not a vulnerability.**
|
||||||
|
|
||||||
|
This sounds obvious, but it's the root of most false positives. The model sees `os.system(user_input)` and flags it. But the question isn't whether the sink is dangerous — it's whether the attacker *crosses a meaningful trust boundary* to reach it, and whether they *gain capability they didn't already have*.
|
||||||
|
|
||||||
|
A desktop application executing commands from its own config file, which only the user can edit, running as that same user? That's not a vulnerability. That's the application working as designed. The user already has all the authority the "exploit" would give them.
|
||||||
|
|
||||||
|
A web server executing commands from an HTTP request body? That's a completely different trust model, and it *is* a vulnerability.
|
||||||
|
|
||||||
|
The skill has to reason about trust boundaries and privilege deltas, not just dangerous function calls.
|
||||||
|
|
||||||
|
## The Exploit Value Test
|
||||||
|
|
||||||
|
Early versions of the skill used "Exploit Gain" — does the attacker gain new capability? This was good but incomplete. We kept missing a class of issues that are quiet but dangerous:
|
||||||
|
|
||||||
|
- A same-user persistence mechanism (autostart entry, shell hook, CI pipeline poisoning)
|
||||||
|
- A config file that gets loaded from a remote sync rather than a static local path
|
||||||
|
- A build step that executes code in a different context than the runtime
|
||||||
|
|
||||||
|
These don't give the attacker *new privileges* in the traditional sense. But they give them **persistence**, **stealth**, **scope expansion**, or **context shifting**. That's valuable to an attacker even without privilege escalation.
|
||||||
|
|
||||||
|
So we evolved "Exploit Gain" into **"Exploit Value"**:
|
||||||
|
|
||||||
|
```
|
||||||
|
Exploit Value = Capability Gain + Persistence + Stealth + Scope
|
||||||
|
```
|
||||||
|
|
||||||
|
This is now the gating test before any severity assignment. If Exploit Value ≤ 0 (no new capability, no persistence, no stealthy scope expansion), the finding is capped at Low or Informational. Usually it belongs in a different category entirely.
|
||||||
|
|
||||||
|
## The Classification Taxonomy
|
||||||
|
|
||||||
|
Most audit frameworks have findings and... that's it. We found this forced the model to either inflate borderline issues into "findings" or drop them entirely. Neither is correct.
|
||||||
|
|
||||||
|
The skill now distinguishes five categories:
|
||||||
|
|
||||||
|
- **Confirmed finding** — you can trace the vulnerable path end to end
|
||||||
|
- **Likely risk** — dangerous pattern, needs one missing fact confirmed (but you must name a specific file/function, not an abstract category)
|
||||||
|
- **Abuse primitive** — not a vulnerability, but a dangerous building block that could be chained in future attacks
|
||||||
|
- **Hardening opportunity** — not currently exploitable, but weakens security posture
|
||||||
|
- **Design property (by design)** — behavior intentionally exposed to a trusted actor; not a vulnerability, but deserves documentation
|
||||||
|
|
||||||
|
The **Abuse Primitive** category is the one that surprises people. It captures things like "executes arbitrary shell from config" or "evaluates templates dynamically." These aren't vulnerabilities on their own — the config is trusted, the templates are trusted. But they're *perfect building blocks* for an attacker who finds a way to influence that config or those templates through a different path.
|
||||||
|
|
||||||
|
This matters because LLMs and modern attackers are increasingly capable of combining multiple low-severity issues into critical impact. If you only track standalone vulnerabilities, you miss the chains.
|
||||||
|
|
||||||
|
## The Same-User Exception
|
||||||
|
|
||||||
|
Here's the subtlety that took us the longest to get right.
|
||||||
|
|
||||||
|
Our false-positive downgrade heuristics say: "same-user, same-authority effects should not be Medium+." This correctly kills the most common class of false positives — desktop apps, local tools, trusted config execution.
|
||||||
|
|
||||||
|
But it *over-corrects*. Same-user is NOT safe if the attacker gains:
|
||||||
|
|
||||||
|
- **Persistence** across restarts or sessions
|
||||||
|
- **Integrity impact** on future trusted execution (e.g., poisoning a CI pipeline, autostart, or shell hook)
|
||||||
|
- **Context shifting** (triggering execution in a different context, like build time vs. run time)
|
||||||
|
- **Silent hijacking** of trusted workflows
|
||||||
|
|
||||||
|
This exception rule is critical. Without it, the skill would systematically miss persistence mechanisms, supply chain attacks, and developer tooling compromise — exactly the class of issues that are most valuable to modern attackers.
|
||||||
|
|
||||||
|
## Design Properties Need Guardrails
|
||||||
|
|
||||||
|
The "Design property (by design)" category is useful, but it's also dangerous. In the real world, "that's by design" is how actual vulnerabilities get dismissed.
|
||||||
|
|
||||||
|
So we require every Design Property to explicitly answer:
|
||||||
|
|
||||||
|
- **Who is trusted?**
|
||||||
|
- **Why are they trusted?**
|
||||||
|
- **Can trust be violated in practice?** (e.g., config loaded from a remote sync vs. a static local file)
|
||||||
|
|
||||||
|
If you can't answer these, it doesn't belong in this category. This prevents lazy classification and forces the model to reason about whether the trust assumption actually holds.
|
||||||
|
|
||||||
|
## "What Would Prove Me Wrong?"
|
||||||
|
|
||||||
|
This is the safety valve. The skill requires the model to explicitly state, for every finding and in its scratchpad reasoning: "What would prove me wrong?"
|
||||||
|
|
||||||
|
This matters because the skill is now strong enough to be *convincingly wrong*. It reasons well, sounds authoritative, and filters aggressively. When it makes a mistake, it will be a confident mistake. Forcing it to articulate how its own hypothesis could be falsified is the best defense against that.
|
||||||
|
|
||||||
|
## The Scratchpad as Execution Environment
|
||||||
|
|
||||||
|
The `<security_scratchpad>` isn't a post-hoc summary — it's the model's live investigation workspace. The model must use it to:
|
||||||
|
|
||||||
|
- Plan which files and routes to investigate
|
||||||
|
- Trace data flows from ingress → trust boundary → sink → impact
|
||||||
|
- State attacker capability before and after
|
||||||
|
- Play Devil's Advocate against its own hypotheses
|
||||||
|
- Evaluate exploit chaining
|
||||||
|
- State what would prove it wrong
|
||||||
|
- Conclude with exact classification
|
||||||
|
|
||||||
|
The scratchpad template mirrors these instructions explicitly — Devil's Advocate and Exploit Chaining are required headers, not just internal reasoning steps. This means they actually appear in the output, not just silently influence it.
|
||||||
|
|
||||||
|
## The 2025-2026 Threat Model
|
||||||
|
|
||||||
|
The skill is explicitly grounded in the current threat landscape, not a 2021-era checklist. Key shifts it reflects:
|
||||||
|
|
||||||
|
- **Broken access control** remains the top application risk
|
||||||
|
- **Supply chain failures** are now a core appsec category (SHA pinning, OIDC token scope, artifact integrity, mutable action references)
|
||||||
|
- **Mishandling of exceptional conditions** is a first-class security category (fail-open paths, partial transaction recovery, missing rollback)
|
||||||
|
- **AI/Agent surfaces** are real attack surfaces (prompt injection, tool-output-to-tool-input leakage, excessive agency)
|
||||||
|
- **AI-driven exploit chaining** — LLMs and modern attackers combine multiple low-severity issues to achieve critical impact
|
||||||
|
|
||||||
|
The threat model has a version and a cutoff date (April 2026), so you know when it starts getting stale.
|
||||||
|
|
||||||
|
## What We Learned From Testing
|
||||||
|
|
||||||
|
We tested the skill against the Openbox source code — a Linux/BSD desktop environment. This was the perfect stress test because it's exactly the kind of codebase that produces false positives with naive audit prompts:
|
||||||
|
|
||||||
|
- Config files that intentionally execute commands
|
||||||
|
- Desktop entries that intentionally launch programs
|
||||||
|
- IPC mechanisms that intentionally pass messages between same-user processes
|
||||||
|
- Session management that intentionally restores state
|
||||||
|
|
||||||
|
A naive audit flags all of these as Critical. Our early versions flagged most of them as Medium+. The final version correctly classified the majority as Design Properties, with specific reasoning about why trust holds (or doesn't) in each case.
|
||||||
|
|
||||||
|
The real vulnerabilities — the subtle authorization gaps, the parser edge cases, the incomplete fixes — actually became *more* visible once the noise was gone.
|
||||||
|
|
||||||
|
## The Full Skill
|
||||||
|
|
||||||
|
The complete skill definition is [available on GitHub](https://github.com/ZaguanAI/security-audit-skill). We're publishing it in full because:
|
||||||
|
|
||||||
|
1. **It's defensive tooling.** Knowing the audit methodology doesn't help attackers bypass it — it tells them what we'll catch, which pushes them toward the gaps we want to find anyway.
|
||||||
|
2. **The moat isn't the prompt.** The competitive advantage is the integration into the tool loop, the continuous refinement from real audits, and the execution environment. Anyone can copy the markdown; nobody can copy the flywheel.
|
||||||
|
3. **Feedback accelerates quality.** We've already gotten massive improvements from having multiple frontier models review it. Opening it to security practitioners will produce another order of refinement.
|
||||||
|
4. **It raises the bar.** Most "AI security audit" tools right now are glorified checklists. Publishing something rigorous forces the field to level up.
|
||||||
|
|
||||||
|
## The One-Sentence Rule
|
||||||
|
|
||||||
|
If you take nothing else from this post, take this:
|
||||||
|
|
||||||
|
> Never declare a security finding unless you can trace attacker-controlled data across a trust boundary to a privileged sink with positive exploit value.
|
||||||
|
|
||||||
|
That's the entire skill compressed into one sentence. Everything else is enforcement machinery.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*The Security Audit Skill is part of [Zaguán Blade](https://zaguan.ai), an AI-powered coding environment. The skill definition is [available on GitHub](https://github.com/ZaguanAI/security-audit-skill) under the Apache 2.0 license. Feedback, issues, and contributions are welcome.*
|
||||||
Reference in New Issue
Block a user