What 3,500 Scans Reveal About AI-Generated Code

538,860 findings across 3,518 automated security scans. The top vulnerability is not injection or XSS. It is missing idempotency defense in webhook handlers.

A few weeks ago I wrote a piece arguing that the era of vibe coding needs a verification layer. That piece was a thesis. The argument was straightforward: AI-generated code ships fast, looks polished, passes smoke tests, and still carries structural vulnerabilities that traditional review processes were never designed to catch at volume.

That was March. Since then, we ran 3,518 automated security scans across multiple production codebases. The scans covered Python, TypeScript, JavaScript, CI/CD configuration, and infrastructure-as-code. They were performed by orchestrated multi-agent review systems using models from OpenAI (GPT-5.3 Codex, GPT-5.2 Codex, GPT-4.1) and Google (Gemini 3.1 Pro), operating in deep-scan and PR-diff modes against real repositories maintained by teams using AI assistants alongside human engineers.

The reviewed code was a mix of human-authored, AI-assisted, and fully AI-generated changes. We did not isolate AI-generated code from human-written code in this analysis. What we measured instead was the security posture of codebases where AI is part of the workflow — which, increasingly, describes most active codebases.

The results are worth examining closely.

The Dataset

3,518 scans. 538,860 distinct findings flagged before merge. Five codebases ranging from a FastAPI backend to a React frontend to a Node.js CLI tool to a full-stack identity platform. Scan depth split roughly 77% deep scan (full-codebase analysis), 22% PR-diff (changed-files-only), 1% baseline. Model distribution: 91.6% OpenAI Codex family (GPT-5.2 and GPT-5.3), 5.7% Google Gemini 3.1 Pro, 2.6% OpenAI GPT-4.1.

Deep scans analyzed entire repositories, not just changed files. PR-diff scans were scoped to the files modified in a pull request. Both modes produced findings with file paths, line ranges, and reproducible verification commands. Each finding was deduplicated by canonical finding ID and payload hash within a run; repeated scans of the same codebase could surface the same underlying issue across runs.

The average deep scan surfaced approximately 908 findings per run. The average PR-diff scan surfaced approximately 47. Across all scan types combined, the overall average was 153 findings per scan.

Severity distribution across 538,860 blocked findings
Severity distribution across 538,860 blocked findings

Where the Findings Land

The severity distribution across all 538,860 findings:

  • **19,344 (3.6%)** were P0 — critical severity. Hard merge blocks.
  • **6,797 (1.3%)** were P1 — high severity. Escalation-worthy.
  • **282,826 (52.5%)** were P2 — medium severity. The bulk of the dataset.
  • **229,893 (42.7%)** were P3 — low severity. Informational or hygiene-level.

The instinct is to focus on P0. That is the wrong instinct. The interesting signal lives in P2.

P2 findings are the ones that pass every functional test. The app boots. The endpoints respond. The buttons work. The CI pipeline is green. And somewhere inside that green pipeline, a webhook handler accepts replayed events without an idempotency check, or a rate limiter defaults to fail-open instead of fail-closed, or a deploy script calls a rollback hook without verifying the rollback actually succeeded.

These are not theoretical risks. They are the specific findings our scanners flagged, with file paths, line numbers, and reproducible verification steps attached.

A single scan output: 4 P2 findings, 1,282 P3 findings, gate status PASSED
A single scan output: 4 P2 findings, 1,282 P3 findings, gate status PASSED

Here is what a single scan output looks like. This run passed. Twelve P2 findings, zero P0 or P1, 1,282 informational items. The gate opened and the merge went through. The top findings shown include an authentication gate that relies on fail-open behavior, a publish step using npm token-based auth instead of OIDC-federated identity, a Playwright test that scrapes response headers without asserting security-relevant ones, and a manual patch workflow with no idempotency guard against duplicate deploys — and those are just the first few of twelve. Every one is the kind of issue that works perfectly in a demo and breaks quietly in production. The gate passed because none of them crossed the P0/P1 severity threshold. The findings still exist.

The Shape of the Mistake

This is the part that matters most for anyone trying to understand where code in AI-assisted teams actually breaks.

Where code in AI-assisted teams breaks: category distribution
Where code in AI-assisted teams breaks: category distribution

The top five finding categories, ranked by share of P0-P2 findings:

  1. **CI/CD Integrity (31%)** — Gate ordering gaps, missing dependency chains between pipeline stages, jobs that can execute before prerequisite quality checks complete.
  2. **Backend Reliability (27%)** — Rollback hooks that fire without result validation, deploy webhooks missing idempotency keys, health checks that report ready before downstream dependencies are confirmed.
  3. **Security Overlay (24%)** — Supply chain trust gaps where binaries are checksum-verified from files fetched over the same unverified channel, dependency installations that diverge between CI and scheduled workflows.
  4. **Supply Chain Provenance (11%)** — Missing signature verification for downloaded tooling, provenance attestation gaps in build artifacts.
  5. **Data Layer Integrity (7%)** — Write paths without retry safety, query patterns that behave differently under concurrent load than under sequential test execution.

Notice what is *not* at the top. SQL injection is not there. XSS is not there. The classic OWASP list that dominates most security conversations barely registers in this dataset. The findings that dominate in codebases maintained by AI-assisted teams in 2026 are subtler and more structural. They live in the spaces between systems — the handshake between a CI gate and a deploy step, the contract between a webhook sender and a webhook receiver, the assumption that a health check endpoint actually checks health.

The invisible contracts: what gets flagged most often
The invisible contracts: what gets flagged most often

The Invisible Contract Problem

The single most common P0-P2 finding, at 18% of all critical and medium findings, was missing idempotency keys and replay defense.

Think about what that means. A webhook handler that correctly parses the payload, validates the schema, updates the database, and returns a 200. Tests that cover the happy path. A code review that looks clean. Two approvals on the PR. And the first time a payment provider retries a delivery — which every payment provider does, by design — the system processes the same event twice. Silently.

The second most common finding was CI/CD gate ordering (15%). Pipeline configurations where security checks and quality gates run in parallel rather than in sequence, allowing a build step to proceed before a vulnerability scan completes. The pipeline looks fast. The logs look green. The gates are technically present. They just do not gate anything.

Third was supply chain provenance gaps (13%). Installation of build tools via checksum-only verification where the checksum file itself is fetched from the same unauthenticated channel as the binary. This class of vulnerability shows up frequently in our dataset because AI assistants tend to generate installation scripts that closely follow documentation examples — and documentation examples often skip signature verification for brevity.

What the File Types Tell Us

File types that produce the most critical findings
File types that produce the most critical findings

YAML and configuration files (.yml, .yaml) produced more P0 and P1 findings than any other file type in our dataset. This is counterintuitive until you think about what lives in YAML: CI/CD pipeline definitions, deployment manifests, infrastructure configuration. These files define the trust boundaries of the entire system, and they receive less review scrutiny than application code because they look like configuration rather than logic.

Python files (.py) ranked second for critical findings, driven primarily by backend reliability issues — async connection handling, retry behavior, error propagation in middleware chains.

TypeScript and JavaScript files ranked lower on the P0-P1 axis but dominated P2 findings, largely due to frontend state management issues, stale closure patterns, and uncleared async side effects.

The Model Question

LLM providers across 3,518 scans
LLM providers across 3,518 scans

A note on what this data does and does not show. The scans were performed *by* these models acting as reviewers, not exclusively *on* code generated by these models. The Codex family (GPT-5.2 and 5.3) performed 91.6% of scans. We observed that the average deep scan across a full codebase surfaced approximately 908 findings per run — a number that suggests even mature production codebases carry significant residual risk when examined at depth.

Of the 100 most recent sampled runs, 4 resulted in hard merge blocks (P0-level findings triggering automatic rejection). A 96% pass rate does not mean 96% of code is clean. It means 96% of the time, the findings fell below the P0/P1 severity threshold for automatic blocking. The P2 findings still existed. They were just below the gate.

The External Evidence

The public evidence is now strong enough that this argument no longer depends on a single dataset.

Veracode's 2025 GenAI code security report found security flaws in 45% of AI-generated code samples across 100+ models, and its Spring 2026 longitudinal update — covering 150+ models — says secure-code pass rates are still hovering around 55% even as syntax correctness exceeds 95%. Sonar's 2026 developer survey found that 96% of developers do not fully trust AI-generated code to be functionally correct, yet only 48% always review AI-assisted code before committing — and only 18% of enterprise respondents report having distinct guidelines or automated checks for AI-generated code. IBM's 2025 Cost of a Data Breach report found that organizations with high levels of shadow AI paid $670,000 more on average per breach, that 63% of breached organizations lacked AI governance policies, and that 97% of breached organizations with AI-related security incidents lacked proper AI access controls.

The pattern is consistent: code generation is improving faster than verification and governance.

And the frontier labs are now treating this as a frontline cybersecurity problem, not a side quest. Anthropic launched Project Glasswing this week, committing up to $100M in credits and $4M in direct open-source donations, with launch partners including AWS, Apple, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks. Anthropic says Claude Mythos Preview has already found thousands of high-severity vulnerabilities — including bugs in every major operating system and web browser, a 27-year-old vulnerability in OpenBSD, and a 16-year-old bug in FFmpeg — scoring 83.1% on CyberGym vulnerability reproduction benchmarks versus 66.6% for Opus 4.6. Separately, Anthropic's Claude Code Security applies multi-stage verification, confidence ratings, and human approval to findings before fixes are accepted — a productized version of the verification layer concept.

Glasswing is also useful for a narrower reason: it shows how a frontier lab is starting to validate cyber capability more seriously. Anthropic says that by late 2025, Claude Opus 4.5 was already approaching saturation on CyberGym, so it worked with Mozilla to build a harder evaluation set from prior Firefox CVEs. Glasswing then moved the validation story beyond public benchmark scores to real-world vulnerability discovery, coordinated disclosure, patched examples, and restricted access for vetted partners rather than a broad public release. That is closer to what production security validation should look like: benchmark evidence, real-world evidence, and operational safeguards together.

Google says Big Sleep found a real SQLite vulnerability in 2024, and that CodeMender has since upstreamed 72 security fixes to open-source projects. Microsoft has open-sourced CTI-REALM as an end-to-end benchmark for detection engineering. Meta's CyberSecEval 4 now includes defensive benchmarks like CyberSOCEval and AutoPatchBench. And OpenAI explicitly treats GPT-5.3-Codex as High capability in Cybersecurity under its Preparedness Framework — the first OpenAI deployment to carry that designation.

The center of gravity has shifted. The question is no longer whether models can write code quickly. The question is whether the systems around them can prove that code is safe enough to merge.

What This Means for Security Research

There is a growing body of academic and technical work on AI code security. Recent preprints and technical reports are pushing the field forward: SecureAgentBench evaluates 105 realistic repository-level tasks with multi-file edits, functional tests, and PoC-based vulnerability checking. SEC-bench focuses on authentic security engineering tasks with reproducible artifacts and gold patches. SecRepoBench covers 318 repository-level tasks from 27 real C/C++ repos spanning 15 CWEs. SafeGenBench benchmarks security-vulnerability detection in LLM-generated code using SAST plus LLM-based judging. SecCodeBench-V2 covers 98 industrial scenarios across 22 CWE types with executable PoCs and expert-authored, double-reviewed cases.

This work is advancing the field significantly. These benchmarks now use repo-level tasks, reproducible artifacts, exploit validation, and expert review. What they still do not fully capture is the merge-boundary approval problem: the moment where a specific change in a specific repository under a specific organizational policy must be approved or rejected with evidence, severity mapping, and an audit trail. That is a different kind of evaluation — less about whether code *can* be generated securely, and more about whether a team *can prove* that a given change *was* generated securely enough to ship.

Standards bodies are converging on this distinction. The OWASP Top 10 2025, OWASP ASVS 5.0, and OWASP's 2025 GenAI Top 10 provide framework-level guidance. NIST's Generative AI Profile and Secure Software Development Framework (SSDF) provide the compliance anchors. The question these standards collectively raise is the same one our data points to: what does "safe enough to merge" actually mean, and who is responsible for proving it?

The Emerging Stack

What does adequate verification actually require? Based on what we observed across these 3,518 scans, the minimum viable verification infrastructure looks like this:

Deterministic ingest. Every file in every PR must be scanned. Sampling misses the YAML configuration files that carry the highest severity findings in our dataset.

Multi-persona review. A single model scanning for "security issues" misses the CI/CD gate ordering problems and the backend reliability patterns. Specialized review agents — one focused on security overlay, one on release engineering, one on data layer integrity — consistently surfaced findings that a general-purpose scan missed.

Evidence-bearing findings. Every finding must include the file path, line range, a reproducible verification command, and an impact statement. Findings without evidence are noise. Our data shows that findings with verification steps attached are reviewed and acted on at significantly higher rates than bare line-flagging.

Severity gates with policy mapping. Not every finding should block a merge. But P0 and P1 findings in payment paths, authentication flows, and deployment pipelines should always block. The gate must be configurable per repository, per team, per risk surface.

Human-in-the-loop checkpoints. Automated review catches the volume. Human review catches the context. The highest-value pattern we observed was automated triage followed by human review of P0-P1 findings, with structured verdicts (true positive, false positive, severity adjustment) feeding back into the review system.

The organizations that will handle AI-assisted code safely are not the ones with the best models. They are the ones that build this verification infrastructure early — the intake pipelines, the policy engines, the review queues, the audit trails — and treat it as production-critical rather than optional tooling.

Conclusion

538,860 findings across 3,518 scans. The top finding is not injection or XSS. It is missing idempotency defense in webhook handlers. The most dangerous file type is not .py or .ts. It is .yml — the CI/CD configurations that define trust boundaries and receive the least review attention.

Code in AI-assisted teams does not fail where people expect. The app loads. The tests pass. The demo is convincing. The failure lives downstream, in the contract between a payment webhook and a database write, in the ordering of CI gates that determines whether a vulnerability scan actually gates anything, in the gap between "this tool was downloaded" and "this tool's provenance was verified."

These are not the kinds of failures that get caught by running the app and clicking around. They are the kinds of failures that get caught by systematic, evidence-based verification at the merge boundary — or they do not get caught at all.

The data says the verification layer is not optional. It says the shape of vulnerability in codebases maintained by AI-assisted teams in 2026 is structural, inter-system, and concentrated in the configurations and contracts that hold production software together. And it says the industry is generating this code faster than it is building the infrastructure to verify it.

That gap is the story. The 538,860 findings are the evidence.


Methodology note: Scans were performed between February and April 2026 using orchestrated multi-agent review systems across five production codebases. Models used include OpenAI GPT-5.3 Codex (59.1%), GPT-5.2 Codex (32.5%), GPT-4.1 (2.6%), and Google Gemini 3.1 Pro Preview (5.7%). Deep scans (77% of runs) analyzed full repositories; PR-diff scans (22%) analyzed only changed files in pull requests. Findings were deduplicated within each run by canonical finding ID and payload hash (version: v1_material). The same underlying issue could appear across multiple runs of the same codebase. Severity classifications follow a four-tier system: P0 (critical, automatic merge block), P1 (high, escalation-required), P2 (medium, tracked), P3 (informational). The block rate of 4% reflects the share of sampled runs where P0/P1 findings triggered automatic merge rejection. The 3,518 scan count reflects individual review executions across five codebases, not unique repositories.