Within two weeks of each other in early 2026, Anthropic shipped Claude Code Security and OpenAI shipped Codex Security. Google DeepMind followed with CodeMender. All three promise the same thing: AI that finds vulnerabilities in your codebase the way a human security researcher would, not by pattern matching, but by reasoning about code.
The security industry reacted predictably. Cybersecurity stocks sold off. Twitter declared traditional auditing dead. Neither reaction was warranted. A week after we first published this comparison, DryRun Security released data showing that AI coding agents themselves introduce vulnerabilities in 87% of pull requests. The tools built to find bugs are shipping alongside tools that create them.
We run smart contract audits for a living. We also build Sentinel, our own AI-powered audit engine. We have a direct stake in understanding what these tools actually deliver, where they genuinely advance the state of the art, and where the marketing outruns the capability. Especially when the question our clients keep asking is: "does this replace a smart contract audit?" This is that breakdown.
What Each Tool Claims to Do
Both tools position themselves as reasoning-first vulnerability scanners. Both claim to go beyond static analysis. Both emphasize that they understand code the way humans do, tracing data flows, modeling interactions between components, and catching the bugs that rule-based tools miss.
The framing is nearly identical. The architectures are not. This piece focuses on Claude Code Security and Codex Security as the two most complete offerings. Google's CodeMender takes a different approach: combining static analysis, dynamic analysis, fuzzing, and SMT solvers with Gemini Deep Think models to both find and rewrite vulnerable code. It has upstreamed 72 security fixes to open-source projects over six months, but remains earlier-stage and less documented than the other two.
Claude Code Security: Reasoning-First, Human-in-the-Loop
Anthropic launched Claude Code Security in February 2026 as a research preview for Enterprise and Team customers, with expedited access for open-source maintainers.
How It Works
Claude Code Security scans a codebase by reasoning about it rather than matching against known vulnerability patterns. It models component interactions, traces data flows across files, and identifies logic flaws that require holistic understanding of the system.
Every finding goes through a multi-stage verification process. After the initial scan, Claude re-examines each result, attempting to prove or disprove its own findings and filter out false positives. Each surviving finding gets both a severity rating and a confidence rating, an honest acknowledgment that "these issues often involve nuances that are difficult to assess from source code alone."
Nothing ships without human approval. Claude Code Security identifies problems and suggests patches, but developers review everything through a dashboard before any change is applied.
What It Found
Using Opus 4.6, Anthropic's team found over 500 previously unknown zero-day vulnerabilities in production open-source codebases. Bugs that had gone undetected for decades despite years of expert review.
The flagship example: a heap buffer overflow discovered by reasoning about the LZW compression algorithm. Traditional coverage-guided fuzzing couldn't catch it even with 100% code coverage. The vulnerability existed in the logic of how the algorithm handled edge cases, not in any pattern a static analyzer would flag.
Strengths
- Deep reasoning on complex logic flaws. Cross-file analysis, authentication bypasses, broken access control, business logic bugs. The types of issues that require understanding what the code is supposed to do, not just what it does.
- Confidence scoring. Acknowledging uncertainty is more useful than false certainty. A "high severity, medium confidence" finding tells you something different from "high severity, high confidence," and both are actionable.
- Human-in-the-loop by design. Not bolted on. The workflow assumes a human reviews every finding and approves every patch.
Limitations
- No sandbox validation. Findings are based on static reasoning over source code. There is no runtime environment to test whether a flagged vulnerability is actually exploitable in the running system.
- The code it generates is not reliably secure. This is the paradox Snyk documented: Claude Opus 4.5 produces secure code only 56% of the time without security prompting. The model that finds vulnerabilities can introduce new ones in its suggested patches. DryRun Security's March 2026 study confirmed this in practice: when Claude Code built applications from scratch, it produced the fewest total issues (13) but carried the longest-lived unresolved high-severity findings across multiple PRs, including a 2FA-disable bypass, an insecure direct object reference, and an unauthenticated destructive endpoint.
- AI-generated code is 2.74x more likely to contain XSS vulnerabilities and 1.57x more likely to have security findings overall, according to CodeRabbit analysis. The patches need their own review.
- The tool itself had security vulnerabilities. Check Point Research disclosed two CVEs in Claude Code: CVE-2025-59536 (CVSS 8.7), a code injection flaw allowing arbitrary shell command execution via malicious repository config files, and CVE-2026-21852 (CVSS 5.3), an information disclosure vulnerability enabling API key exfiltration. Both are patched, but the irony is instructive: the security scanner needed its own security review.
Codex Security: Threat Models and Sandbox Validation
OpenAI launched Codex Security in March 2026 as a research preview, available to Pro, Enterprise, Business, and Edu customers through the Codex web interface. It was previously known as Aardvark during private beta.
How It Works
Codex Security operates in three stages:
1. System analysis and threat modeling. The agent analyzes repository structure and generates an editable threat model that captures what the system does, what it trusts, and where it is most exposed. Teams can edit the threat model to keep the agent's understanding aligned with reality.
2. Vulnerability identification. Using the threat model as context, Codex searches for vulnerabilities with agentic reasoning. Findings are classified based on real-world impact rather than theoretical severity.
3. Sandbox validation. Flagged issues are tested in sandboxed environments. When configured with a project-specific environment, Codex validates potential issues against the running system and produces working proofs-of-concept.
What It Found
During the 30-day beta period, Codex Security scanned over 1.2 million commits across external repositories, identifying 792 critical findings and 10,561 high-severity findings.
The tool discovered CVEs in OpenSSH, GnuTLS, GOGS, Thorium, libssh, PHP, and Chromium. Specific CVEs include vulnerabilities in GnuPG (CVE-2026-24881, CVE-2026-24882) and GnuTLS (CVE-2025-32988, CVE-2025-32989, CVE-2025-32990, a heap buffer overflow in certtool).
Strengths
- Editable threat models. Generating a project-specific threat model and letting teams edit it is a meaningful architectural choice. It means the scanner's context can be corrected before it starts looking for bugs, reducing false positives at the source.
- Sandbox validation with proof-of-concept generation. Testing findings in a runtime environment, and producing working exploits when possible, is the strongest differentiator from Claude Code Security. A finding with a working PoC is qualitatively different from a finding with a confidence score.
- False positive reduction at scale. OpenAI reports false positive rates dropped by more than 50% across all repositories during beta, and over-reported severity findings dropped by more than 90%.
Limitations
- Detection quality is baseline. OpenAI explicitly states that "detection quality and false-positive ratios should improve as adoption expands." The current numbers represent early capabilities, not optimized results.
- No confidence scoring on individual findings. Unlike Claude Code Security, Codex does not publish per-finding confidence ratings. You get severity, but not the tool's self-assessed certainty.
- Same patch reliability problem. The code generation limitations apply equally. AI-suggested fixes require the same human validation as Claude's — the scanner and the patch generator share a model, and that model's secure-code rate hasn't kept pace with its detection capabilities.
Head-to-Head
| Capability | Claude Code Security | Codex Security |
|---|---|---|
| Approach | Reasoning-first, multi-stage self-verification | Threat model + agentic search + sandbox validation |
| Threat modeling | Implicit (model reasons about code context) | Explicit, editable threat model per project |
| Sandbox validation | No | Yes, with proof-of-concept generation |
| Severity ratings | Yes | Yes |
| Confidence ratings | Yes, per finding | No |
| Notable findings | 500+ zero-days in open-source projects (curated) | 792 critical + 10,561 high-severity across 1.2M commits (raw volume) |
| False positive handling | Multi-stage self-verification | Sandbox testing, 50%+ reduction reported |
| Patch generation | Yes, human-approved | Yes, architecture-aligned |
| Availability | Enterprise + Team (research preview) | Pro, Enterprise, Business, Edu (research preview) |
| Human-in-the-loop | Mandatory for all changes | Mandatory for all changes |
| DryRun Security test (March 2026) | 13 issues in web app build, fewest total but longest-lived unresolved high-severity flaws | Fewest remaining issues in web app test |
The pattern: Claude Code Security is stronger on reasoning depth and honest uncertainty quantification. Codex Security is stronger on structured threat modeling and runtime validation. Both require human review. Neither produces reliably secure patches. The DryRun Security data below quantifies this in detail.
What Neither Tool Solves
Both tools represent genuine progress. Finding a heap buffer overflow by reasoning about compression algorithms is not something static analyzers do. Producing working proofs-of-concept in sandboxed environments is not something traditional SAST delivers. The era of AI-powered vulnerability discovery is real.
But there are two gaps that neither tool closes.
Gap 1: Detection Is Not Remediation
Finding vulnerabilities has never been the bottleneck. The bottleneck is fixing them at scale, validating the fixes, and ensuring the fixes don't introduce new issues. Both tools generate patches, but both generate patches with the same models that produce secure code roughly half the time. The remediation loop, from finding to verified fix to deployment, still requires human security engineers.
DryRun Security's Agentic Coding Security Report (March 13, 2026) quantified this precisely. Across 30 pull requests generated by Claude Code, OpenAI Codex, and Google Gemini, 87% introduced at least one vulnerability — 143 total security issues across 38 scans. Broken access control appeared across all three agents. Every OAuth implementation across every agent contained exploitable flaws. The models that power vulnerability scanners are simultaneously the models generating vulnerable code at scale.
As Snyk's analysis put it: "The future of AppSec isn't about building better scanners. It's about closing the loop between detection and remediation automatically, at scale." Neither tool closes that loop yet. The DryRun data suggests the loop may actually be widening: as AI coding agents ship more code faster, the volume of vulnerabilities requiring remediation grows alongside detection capabilities.
Gap 2: Smart Contracts Are a Different Attack Surface
Both tools scan traditional codebases: Python, JavaScript, C, C++, Go, Rust. They reason about web applications, APIs, server infrastructure, and system-level code.
Neither tool audits smart contracts the way smart contracts need to be audited.
Smart contract vulnerabilities operate under fundamentally different constraints. Execution is deterministic and public. Transactions are irreversible. Composability means a vulnerability in one contract can be exploited through interactions with other contracts that didn't exist when the code was written. Economic attack vectors (flash loan manipulation, oracle exploitation, governance attacks) require understanding financial system design, not just code logic.
A Codex Security threat model built from a Solidity repository will identify some access control issues and some reentrancy patterns. It will not model flash loan attack paths, cross-chain bridge validation failures, or economic exploits that require understanding how AMMs, lending protocols, and liquidation engines interact under adversarial conditions.
OpenAI and Paradigm acknowledge this gap implicitly. Their EVMbench benchmark (February 2026) evaluates AI agents on 117 curated smart contract vulnerabilities across three tasks: detecting, patching, and exploiting flaws. The results are telling. GPT-5.3-Codex can now exploit over 70% of critical Code4rena bugs — up from under 20% when the project started. But patching remains unreliable: fixing contracts while preserving correct behavior across edge cases is where models still fail. Detection and exploitation are advancing. Remediation is not. And this benchmark only covers known vulnerability patterns from past audits, not the novel economic attack vectors that cause the largest losses.
This is the layer where $10.77 billion in exploit losses have occurred. It is not a layer that general-purpose code scanners are built to cover.
Where We Fit
This is the gap we exist to close.
Sentinel is our AI-powered audit engine, purpose-built for smart contract security. Not adapted from a general-purpose code scanner. Built from the ground up for Solidity, Rust/Solana, and the vulnerability classes that matter on-chain: reentrancy, flash loan vectors, oracle manipulation, economic exploits, cross-protocol composability risks. The same AI-driven reasoning that Claude Code Security and Codex Security bring to traditional codebases, but trained on DeFi-specific threat models and validated by auditors who have seen these exploits in production.
Sentinel augments our auditors. It doesn't replace them. The same limitation applies: AI accelerates detection, but final judgment on exploitability and remediation comes from engineers who have seen these attacks in production.
After the audit, Tripwire provides continuous on-chain monitoring. Real-time anomaly detection seeded from your audit findings, running 24/7 against live contract state. Not periodic scans. Continuous surveillance.
Use Claude Code Security and Codex Security for your off-chain infrastructure. They're good at what they do. But if your system touches smart contracts, that's a different attack surface with different stakes. That surface is ours.
What Your Security Team Should Do Now
Regardless of which AI scanner you adopt:
1. Use AI scanners for off-chain code — all of them. Claude Code Security, Codex Security, and CodeMender use different architectures and will catch different things. They're complementary, not redundant. Run them on your backend, API, and infrastructure code. But recognize that none of them cover your on-chain attack surface — that requires a dedicated smart contract audit.
2. Never auto-merge AI-generated patches. Both tools require human review for a reason. Treat every suggested fix as a pull request that needs security review, because the patch itself may introduce vulnerabilities.
3. Layer your scanning stack. AI scanners complement deterministic tools (Semgrep, Slither, Mythril). They don't replace them. Use rule-based tools for known patterns. Use AI tools for the logic flaws and cross-component bugs that rules can't catch.
4. Audit your smart contracts separately. If your application interacts with on-chain systems, no amount of off-chain code scanning covers that surface. Scope an engagement.
5. Monitor continuously. Vulnerability scanning is point-in-time. Continuous monitoring catches what changes after the scan.
SigIntZero provides smart contract security audits, AI-powered code analysis via Sentinel, and continuous on-chain monitoring via Tripwire. If your AI agents touch the chain, talk to us.



