Claude Code Security vs Codex: 2026 Comparison

Within two weeks of each other in early 2026, Anthropic shipped Claude Code Security and OpenAI shipped Codex Security. Google DeepMind followed with CodeMender. All three promise the same thing: AI that finds vulnerabilities in your codebase the way a human security researcher would, not by pattern matching, but by reasoning about code.

The security industry reacted predictably. Cybersecurity stocks sold off. Twitter declared traditional auditing dead. Neither reaction was warranted. A week after we first published this comparison, DryRun Security released data showing that AI coding agents themselves introduce vulnerabilities in 87% of pull requests. The tools built to find bugs are shipping alongside tools that create them. (For the broader picture of where these scanners fit, see our deep dive on AI's role in auditing.)

We run smart contract audits for a living. We also build Sentinel, our own AI-powered analysis engine. We have a direct stake in understanding what these tools actually deliver, where they genuinely advance the state of the art, and where the marketing outruns the capability. Especially when the question our clients keep asking is: "does this replace a smart contract audit?" This is that breakdown.

What Each Tool Claims to Do

Both tools position themselves as reasoning-first vulnerability scanners. Both claim to go beyond static analysis. Both emphasize that they understand code the way humans do, tracing data flows, modeling interactions between components, and catching the bugs that rule-based tools miss.

The framing is nearly identical. The architectures are not. This piece focuses on Claude Code Security and Codex Security as the two most complete offerings. Google's CodeMender takes a different approach: combining static analysis, dynamic analysis, fuzzing, and SMT solvers with Gemini Deep Think models to both find and rewrite vulnerable code. It has upstreamed 72 security fixes to open-source projects over six months, but remains earlier-stage and less documented than the other two.

How does Claude Code Security work?

Anthropic launched Claude Code Security in February 2026 as a research preview for Enterprise and Team customers, with expedited access for open-source maintainers.

How It Works

Claude Code Security scans a codebase by reasoning about it rather than matching against known vulnerability patterns. It models component interactions, traces data flows across files, and identifies logic flaws that require holistic understanding of the system.

Every finding goes through a multi-stage verification process. After the initial scan, Claude re-examines each result, attempting to prove or disprove its own findings and filter out false positives. Each surviving finding gets both a severity rating and a confidence rating, an honest acknowledgment that "these issues often involve nuances that are difficult to assess from source code alone."

Nothing ships without human approval. Claude Code Security identifies problems and suggests patches, but developers review everything through a dashboard before any change is applied.

What It Found

Using Opus 4.6, Anthropic's team found over 500 previously unknown zero-day vulnerabilities in production open-source codebases. Bugs that had gone undetected for decades despite years of expert review.

The flagship example: a heap buffer overflow discovered by reasoning about the LZW compression algorithm. Traditional coverage-guided fuzzing couldn't catch it even with 100% code coverage. The vulnerability existed in the logic of how the algorithm handled edge cases, not in any pattern a static analyzer would flag.

Strengths

Deep reasoning on complex logic flaws. Cross-file analysis, authentication bypasses, broken access control, business logic bugs. The types of issues that require understanding what the code is supposed to do, not just what it does.
Confidence scoring. Acknowledging uncertainty is more useful than false certainty. A "high severity, medium confidence" finding tells you something different from "high severity, high confidence," and both are actionable.
Human-in-the-loop by design. Not bolted on. The workflow assumes a human reviews every finding and approves every patch.

Limitations

No sandbox validation. Findings are based on static reasoning over source code. There is no runtime environment to test whether a flagged vulnerability is actually exploitable in the running system.
The code it generates is not reliably secure. This is the paradox Snyk documented: Claude Opus 4.5 produces secure code only 56% of the time without security prompting. The model that finds vulnerabilities can introduce new ones in its suggested patches. DryRun Security's March 2026 study confirmed this in practice: when Claude Code built applications from scratch, it produced the fewest total issues (13) but carried the longest-lived unresolved high-severity findings across multiple PRs, including a 2FA-disable bypass, an insecure direct object reference, and an unauthenticated destructive endpoint.
AI-generated code is 2.74x more likely to contain XSS vulnerabilities and 1.57x more likely to have security findings overall, according to CodeRabbit analysis. The patches need their own review.
The tool itself had security vulnerabilities. Check Point Research disclosed two CVEs in Claude Code: CVE-2025-59536 (CVSS 8.7), a code injection flaw allowing arbitrary shell command execution via malicious repository config files, and CVE-2026-21852 (CVSS 5.3), an information disclosure vulnerability enabling API key exfiltration. Both are patched, but the irony is instructive: the security scanner needed its own security review.

How does Codex Security work?

OpenAI launched Codex Security in March 2026 as a research preview, available to Pro, Enterprise, Business, and Edu customers through the Codex web interface. It was previously known as Aardvark during private beta.

How It Works

Codex Security operates in three stages:

1. System analysis and threat modeling. The agent analyzes repository structure and generates an editable threat model that captures what the system does, what it trusts, and where it is most exposed. Teams can edit the threat model to keep the agent's understanding aligned with reality.

2. Vulnerability identification. Using the threat model as context, Codex searches for vulnerabilities with agentic reasoning. Findings are classified based on real-world impact rather than theoretical severity.

3. Sandbox validation. Flagged issues are tested in sandboxed environments. When configured with a project-specific environment, Codex validates potential issues against the running system and produces working proofs-of-concept.

What It Found

During the 30-day beta period, Codex Security scanned over 1.2 million commits across external repositories, identifying 792 critical findings and 10,561 high-severity findings.

The tool discovered CVEs in OpenSSH, GnuTLS, GOGS, Thorium, libssh, PHP, and Chromium. Specific CVEs include vulnerabilities in GnuPG (CVE-2026-24881, CVE-2026-24882) and GnuTLS (CVE-2025-32988, CVE-2025-32989, CVE-2025-32990, a heap buffer overflow in certtool).

Strengths

Editable threat models. Generating a project-specific threat model and letting teams edit it is a meaningful architectural choice. It means the scanner's context can be corrected before it starts looking for bugs, reducing false positives at the source.
Sandbox validation with proof-of-concept generation. Testing findings in a runtime environment, and producing working exploits when possible, is the strongest differentiator from Claude Code Security. A finding with a working PoC is qualitatively different from a finding with a confidence score.
False positive reduction at scale. OpenAI reports false positive rates dropped by more than 50% across all repositories during beta, and over-reported severity findings dropped by more than 90%.

Limitations

Detection quality is baseline. OpenAI explicitly states that "detection quality and false-positive ratios should improve as adoption expands." The current numbers represent early capabilities, not optimized results.
No confidence scoring on individual findings. Unlike Claude Code Security, Codex does not publish per-finding confidence ratings. You get severity, but not the tool's self-assessed certainty.
Same patch reliability problem. The code generation limitations apply equally. AI-suggested fixes require the same human validation as Claude's — the scanner and the patch generator share a model, and that model's secure-code rate hasn't kept pace with its detection capabilities.

Head-to-Head

Capability	Claude Code Security	Codex Security
Approach	Reasoning-first, multi-stage self-verification	Threat model + agentic search + sandbox validation
Threat modeling	Implicit (model reasons about code context)	Explicit, editable threat model per project
Sandbox validation	No	Yes, with proof-of-concept generation
Severity ratings	Yes	Yes
Confidence ratings	Yes, per finding	No
Notable findings	500+ zero-days in open-source projects (curated)	792 critical + 10,561 high-severity across 1.2M commits (raw volume)
False positive handling	Multi-stage self-verification	Sandbox testing, 50%+ reduction reported
Patch generation	Yes, human-approved	Yes, architecture-aligned
Availability	Enterprise + Team (research preview)	Pro, Enterprise, Business, Edu (research preview)
Human-in-the-loop	Mandatory for all changes	Mandatory for all changes
DryRun Security test (March 2026)	13 issues in web app build, fewest total but longest-lived unresolved high-severity flaws	Fewest remaining issues in web app test

The pattern: Claude Code Security is stronger on reasoning depth and honest uncertainty quantification. Codex Security is stronger on structured threat modeling and runtime validation. Both require human review. Neither produces reliably secure patches. The DryRun Security data below quantifies this in detail.

What Neither Tool Solves

Both tools represent genuine progress. Finding a heap buffer overflow by reasoning about compression algorithms is not something static analyzers do. Producing working proofs-of-concept in sandboxed environments is not something traditional SAST delivers. The era of AI-powered vulnerability discovery is real.

But there are two gaps that neither tool closes.

Gap 1: Detection Is Not Remediation

Finding vulnerabilities has never been the bottleneck. The bottleneck is fixing them at scale, validating the fixes, and ensuring the fixes don't introduce new issues. Both tools generate patches, but both generate patches with the same models that produce secure code roughly half the time. The remediation loop, from finding to verified fix to deployment, still requires human security engineers.

DryRun Security's Agentic Coding Security Report (March 13, 2026) quantified this precisely. Across 30 pull requests generated by Claude Code, OpenAI Codex, and Google Gemini, 87% introduced at least one vulnerability — 143 total security issues across 38 scans. Broken access control appeared across all three agents. Every OAuth implementation across every agent contained exploitable flaws. The models that power vulnerability scanners are simultaneously the models generating vulnerable code at scale.

As Snyk's analysis put it: "The future of AppSec isn't about building better scanners. It's about closing the loop between detection and remediation automatically, at scale." Neither tool closes that loop yet. The DryRun data suggests the loop may actually be widening: as AI coding agents ship more code faster, the volume of vulnerabilities requiring remediation grows alongside detection capabilities.

Gap 2: Smart Contracts Are a Different Attack Surface

Both tools scan traditional codebases: Python, JavaScript, C, C++, Go, Rust. They reason about web applications, APIs, server infrastructure, and system-level code.

Neither tool audits smart contracts the way smart contracts need to be audited.

Smart contract vulnerabilities operate under fundamentally different constraints. Execution is deterministic and public. Transactions are irreversible. Composability means a vulnerability in one contract can be exploited through interactions with other contracts that didn't exist when the code was written. Economic attack vectors (flash loan manipulation, oracle exploitation, governance attacks) require understanding financial system design, not just code logic.

A Codex Security threat model built from a Solidity repository will identify some access control issues — the class behind the $3.2M SquidRouterModule exploit — and some reentrancy patterns. It will not model flash loan attack paths, cross-chain bridge validation failures, or economic exploits that require understanding how AMMs, lending protocols, and liquidation engines interact under adversarial conditions.

OpenAI and Paradigm acknowledge this gap implicitly. Their EVMbench benchmark (February 2026) evaluates AI agents on 117 curated smart contract vulnerabilities across three tasks: detecting, patching, and exploiting flaws. The results are telling. GPT-5.3-Codex can now exploit over 70% of critical Code4rena bugs — up from under 20% when the project started. But patching remains unreliable: fixing contracts while preserving correct behavior across edge cases is where models still fail. Detection and exploitation are advancing. Remediation is not. And this benchmark only covers known vulnerability patterns from past audits, not the novel economic attack vectors that cause the largest losses.

This is the layer where $10.77 billion in exploit losses have occurred. It is not a layer that general-purpose code scanners are built to cover.

Where We Fit

This is the gap we exist to close.

Sentinel is our AI-powered analysis engine, purpose-built for smart contract security. Not adapted from a general-purpose code scanner. Built from the ground up for Solidity, Rust/Solana, and the vulnerability classes that matter on-chain: reentrancy, flash loan vectors, oracle manipulation, economic exploits, cross-protocol composability risks. The same AI-driven reasoning that Claude Code Security and Codex Security bring to traditional codebases, but trained on DeFi-specific threat models and the on-chain exploit classes that matter in production.

Sentinel is fully automated — it accelerates detection and scales coverage, but it doesn't replace a human audit. Final judgment on exploitability and remediation comes from a full engagement with auditors who have seen these attacks in production. And it doesn't cover the off-chain side that has driven 80%+ of recent losses, see The Human Factor for the social-engineering and ops-security gap that no AI scanner closes. No scanner flags seven multisig keys sitting on one laptop — that was Humanity Protocol's $36M key compromise, and not a line of contract code was at fault.

After the audit, Tripwire provides continuous on-chain monitoring. Real-time anomaly detection seeded from your audit findings, running 24/7 against live contract state. Not periodic scans. Continuous surveillance.

Use Claude Code Security and Codex Security for your off-chain infrastructure. They're good at what they do. But if your system touches smart contracts, that's a different attack surface with different stakes. That surface is ours.

What Your Security Team Should Do Now

Regardless of which AI scanner you adopt:

1. Use AI scanners for off-chain code — all of them. Claude Code Security, Codex Security, and CodeMender use different architectures and will catch different things. They're complementary, not redundant. Run them on your backend, API, and infrastructure code. But recognize that none of them cover your on-chain attack surface — that requires a dedicated smart contract audit.

2. Never auto-merge AI-generated patches. Both tools require human review for a reason. Treat every suggested fix as a pull request that needs security review, because the patch itself may introduce vulnerabilities.

3. Layer your scanning stack. AI scanners complement deterministic tools (Semgrep, Slither, Mythril). They don't replace them. Use rule-based tools for known patterns. Use AI tools for the logic flaws and cross-component bugs that rules can't catch.

4. Audit your smart contracts separately. If your application interacts with on-chain systems, no amount of off-chain code scanning covers that surface. Scope an engagement.

5. Monitor continuously. Vulnerability scanning is point-in-time. Continuous monitoring catches what changes after the scan.

SigIntZero provides smart contract security audits, AI-powered code analysis via Sentinel, and continuous on-chain monitoring via Tripwire. If your AI agents touch the chain, talk to us.

Sources

AI's Growing Role in Auditing and Cybersecurity — SigIntZero. Retrieved: 2026-05-25
The Human Factor: Web3's Biggest Threat in 2026 — SigIntZero. Retrieved: 2026-05-25
What $10.77 Billion in Hacks Reveals About Audit Effectiveness — SigIntZero. Retrieved: 2026-05-25

Claude Code Security vs Codex Security: What Each AI Vulnerability Scanner Actually Delivers

What Each Tool Claims to Do

How does Claude Code Security work?

How It Works

What It Found

Strengths

Limitations

How does Codex Security work?

How It Works

What It Found

Strengths

Limitations

Head-to-Head

What Neither Tool Solves

Gap 1: Detection Is Not Remediation

Gap 2: Smart Contracts Are a Different Attack Surface

Where We Fit

What Your Security Team Should Do Now

Sources

Related Posts

AI's Growing Role in Auditing and Cybersecurity

The Human Factor: Why Web3's Biggest Threat in 2026 Isn't Bad Code — It's People

What $10.77 Billion in Hacks Reveals About Audit Effectiveness