The AI adoption maturity matrix

A CEO told me yesterday: “Proudly saying ‘I am using AI’ is no longer cool or acceptable.”

He’s right. Of course we’re all all-in on AI. Ninety percent of us do to be ai-precise. The questions your board and your team actually ask are different. Where do we stand? What does “good” look like? Where do we boldly go from here?

This post hands you the instrument to answer them: my AI adoption maturity matrix. Seven dimensions, five color-coded levels, a scoring ritual that surfaces the conversations your team has been avoiding. And a roadmap for your CEO to follow.

Why you can’t answer it by feel

Ask your engineers where the team stands and you’ll get confident answers. Trouble is, confidence doesn’t survive measurement. METR ran a randomized controlled trial with experienced open-source developers working in their own mature codebases. With AI they were 19% slower. While estimating they had been sped up 20%. Go figure.

The system-level numbers point the same way. DORA’s 2025 report found AI adoption correlates with higher delivery throughput and lower delivery stability. GitClear’s analysis of 211 million changed lines shows refactoring collapsing while copy-paste takes over. Duh. Perceived speed is not a metric. License counts and token burn are not maturity either. To know where you stand, you have to look at the whole system around the code: who verifies, who owns, and what gets measured. That’s exactly where things usually go sideways.

Throwing it over the wall

As fractional CTO, I get to watch the same movie at several companies in parallel. Same plot, different logo. And the plot is never about the tools. It’s about what AI does to a team.

Scene one: an enthusiastic engineer discovers that agents write architecture decision records on demand. Minutes later, a watertight-looking document sails over the wall into review. The poor reviewers now have to judge pages of blabbering the author never fully read himself. Generation took minutes, review takes days. Nobody can tell which trade-offs the engineer actually weighed and which ones the model padded in. Unhappy faces on both sides of the wall.

Scene two is the entrepreneurial cowboy business variant:

The vibe-coded wonder app
A business colleague vibe-codes a complete web app. Demos it to a customer. The customer is thrilled, the CEO ecstatic: why does engineering need months for what one person built in days? Then engineering gets to wrestle the wonder app into the design system, infrastructure as code, secure authentication and authorization, the integrations nobody demoed, continuous integration, logging, and monitoring. The demo was the easy part. The product is everything the demo skipped.

Did I mention security? DryRun Security let three frontier coding agents build two applications, pull request by pull request. 26 of the 30 PRs introduced at least one security vulnerability. Eighty-seven percent. Guess how many your wonder app ships.

Both scenes are the same failure: one individual’s autonomy racing ahead of the team’s ability to verify. Generation is no longer the bottleneck. Trust is.

The teams that do get real gains treat AI as a team sport. They learn together, and they capture and share what they learn. The most striking practice: teams that pair- and even mob-program with AI in the loop, architecting, debugging, and coding together. With ADRs and shared context engineering baked into the daily routine.

Dialog-driven development
Equal Experts distilled this pattern into Dialog Driven Delivery: the conversation, not the backlog, is the primary source of truth. When product, design, and engineering work through a problem together, assumptions surface and constraints become explicit. AI turns that shared context into specifications and code on the spot. Mob programming with AI in the loop makes context engineering a team practice instead of a private prompt library. Teams that learn and grow together, and systematically capture the learning, “lead the revolution”.

Seven dimensions, one weakest link

So how do you find out which movie your team is in? I built a maturity matrix for exactly this. Seven independent dimensions: autonomy, lifecycle coverage, codebase and platform readiness, scope, team process, governance, and measurement. Each scored on the same levels. With some nice colors for those of us that don’t like reading. From red (ad hoc) to blue (optimized). TL;DR: scroll down to the full matrix.

Why seven dimensions instead of one magic score? Because the failure modes are mismatches between dimensions. A team running multi-hour autonomous agent sessions scores green on autonomy. The same team with no review standards, no budget owner, and no baseline scores red on governance and measurement. That team isn’t mature. It’s a Ferrari racing headlong into a brick wall.

A team is only as AI-mature as its weakest dimension.

The assessment bakes this in with verification gates. The dominant stall point in practice is the verification tax: code generation speeds up first, while review capacity lags. So the time saved generating gets re-spent auditing. Clear that constraint and the next one surfaces upstream: QA. Then the quality of your requirements themselves. Sound familiar? It’s the theory of constraints wearing an AI badge.

And the readiness dimension settles an old score. I argued back in 2016 (was too busy CTO-ing to write blogs) that you must bootstrap your software quality while you still can, because retrofitting quality means changing process, mindset, and people. Agents turn that dial to eleven. Trustworthy tests, fast and visible feedback, and tribal knowledge encoded in the repo are now the difference between an agent that ships and an agent that guesses.

Score it planning-poker style

Now the part that makes it work: don’t fill in the matrix yourself. Don’t delegate it to the most enthusiastic engineer either. Have every individual on the team score all seven dimensions privately. Oh and have the leaders also score the teams and their members. Full 360 truth. Then sit together and discuss the outliers, planning poker style.

You’d be surprised. The tech lead scores team process green: we have a guild, we have a channel! The junior scores it red: nobody really reviews my agent PRs, they just sigh and approve. That gap is the conversation your team needed to have for months. The discussion is the deliverable. The matrix is just the excuse.

Do write down today’s colors and the colors you want two quarters from now. That’s your roadmap. It’s also your shield: use it to keep your trigger-happy CEO apprised of where you really are. In colors a board deck can absorb.

And whatever you do, don’t put a meter on token consumption and dock the pay of developers who don’t burn enough. Just, don’t. Measure outcomes, not input. Pair every throughput metric with a stability twin: PR throughput with revert rate, deployment frequency with change-failure rate, and cost per merged PR against your pre-AI baseline. Otherwise you’re measuring how fast the team digs, not whether the hole is in the right place.

AI doesn’t skip the boring phases

Your board whispers that AI-native competitors are breathing down your neck. Your C-level gets more impatient by the week. So why isn’t the AI initiative live yet?

Because every project still grows up the same way, AI or not. Resourcing: finding someone to actually run with it. Requirements: talking to the business, understanding the legacy code, disambiguating, and aligning on the architecture choices while everyone has their say. Building the infrastructure. Then a first MVP or proof of concept, tested internally or with a few friendly customers. Baking in the learnings. Productizing. And only then is there an AI-enabled business outcome you can brag about.

There’s a second trap hiding in that MVP phase: staying in the lab. You can adopt the latest autonomous-agentic-workflow-orchestration platform (insert this quarter’s jargon), but if it only ever runs in the comfort of your own laptop, safely behind the corporate firewall… You’ll have a hard time bringing it out into the real, scary world. Simple technology in a complex environment is still a giant leap. Make baby steps, but make them out there in the real world.

ING’s not-so-simple chatbot
ING adopted a customer-facing chatbot. Building the bot took a few weeks. Getting it through corporate security and compliance not only took months, it bloated the code base by multiple orders of magnitude. A simple chatbot. Duh. But a simple chatbot under a banking license, at a big corporate with lots to lose, stops being simple technology. That’s rocket-ship engineering.

The matrix doesn’t let you skip phases. It shows which friction you can remove and which impatience you simply have to manage. Sharing your colors and your next gate with the C-level beats promising magic. Magic slips. Checklists don’t.

The full matrix

Yes, my inner consultant built a seven-dimension, five-color matrix. At least this one comes without the invoice. Steal it, adapt it, and run it quarterly.

A team can be advanced on one dimension and immature on another; the assessment exists to surface exactly that mismatch. Aim to lift the lowest color, not the highest.

Dimension	Question it answers
Autonomy	How much does the agent do unattended?
Lifecycle Coverage	How much of the delivery lifecycle beyond code authoring does AI cover — tickets, review, tests, dependencies, incidents, customer issues?
Codebase & Platform Readiness	Can the codebase and the platform around it support agent work: encoded knowledge, trustworthy tests, fast feedback, live environments to verify in, observability to debug with?
Scope	How far do the practices reach — one dev, the team, the org?
Team Process	Do shared rituals, norms, and ownership turn individual usage into team practice?
Governance	Are guardrails, ownership, and policy in place?
Measurement	Can we prove impact, track cost, show ROI, and catch regressions?

Core principle: a team is only as mature as its weakest dimension. Tool count does not predict outcomes; the most common failure mode is autonomy running ahead of governance and measurement, producing unauditable, unmeasurable output.

Placement rule for organizational structures: structures are means, not dimensions. Score them by purpose: structures that spread practice (champions, guilds, SIGs) count under Team Process; structures that exercise decision rights (AI board, steering body, harness owner) count under Governance.

Vocabulary: the “agent harness” is the shared setup around the agents themselves: context files, skills (reusable, versioned task instructions), permissions, CI hooks, and review automation. A “flow” is a recurring agent task that runs as part of delivery, such as ticket enhancement, dependency updates, or CI failure triage.

The master assessment matrix

Assess each dimension independently. Circle one cell per row. The table is dense by design: it is the summary. The checklists after it unpack every cell into things you can verify instead of feel.

Dimension	🔴 1 — Ad hoc	🟠 2 — Emerging	🟡 3 — Established	🟢 4 — Managed	🔵 5 — Optimized
Autonomy	Code completion + Q&A chat, used occasionally	Interactive agents (IDE/CLI), human approves every step	Context-engineered: repo context files, skills, spec-driven + test-driven development make agents reliable; model and effort level chosen per task	Supervised autonomous: multi-hour unattended sessions (features, refactors) with human review at checkpoints; token-optimized flows (caching, routing, right-sized models)	Orchestrated: agents self-source work from tickets and change requests; humans write requirements, not code; cost-per-outcome tuned continuously
Lifecycle Coverage	Code authoring and Q&A only	Dev-adjacent assists: commit messages, docs, unit tests, PR descriptions; ad-hoc CLI use	Integrated with team systems via MCP/CLI: ticket enhancement, AI code review on every PR, automated dependency-update PRs	Pipeline-embedded: e2e test creation & repair, CI failure triage with fix PRs, auto-updated docs and release notes	Closed loop with production: agents diagnose incidents and open fix PRs; customer issues reproduced and fixed by flows
Codebase & Platform Readiness	Knowledge is tribal; tests cover lines, not behavior; setup is undocumented; deploys are manual console clicks	Root context file as a map (not an encyclopedia); reproducible dev environment; lint, types, and pre-commit enforced; CI on every PR; scripted, repeatable deploys	Behavioral tests agents can treat as ground truth; architectural invariants enforced by build/CI; tacit knowledge encoded (ADRs, scoped context files, do-not-touch paths); infrastructure as code changed via reviewed PRs; e2e tests on every PR; structured, centralized logging	Fast CI sized for agent iteration; flaky tests quarantined; codebase queryable through tools (code graph, find-callers) instead of prompt-stuffing; ephemeral preview environment per PR, torn down on merge; distributed tracing across services; visual regression tests guard UI changes	Self-describing repo: every agent task leaves artifacts (tests, invariants, context) that make the next run more reliable; agents verify their own changes against preview environments and telemetry; readiness tracked per repo
Scope	One or two enthusiasts, personal setups	Pockets of practice; informal tips circulate	Team-wide: shared skills, context files, and permissions versioned in the repo	Multi-team: shared standards, common agent harness, reusable patterns	Org-wide: skill/context registry, standard harness with a named owner, onboarding includes AI practices
Team Process	No shared rituals; AI-generated code/docs thrown over the wall into review	AI is a recurring retrospective topic; tips and prompts shared informally; early champions visible	Written “good PR” norms for AI work (author = first reviewer); dedicated learning time; learnings documented	Guild / SIG / community of practice; named process owner; PR templates require evidence; team-based (not individual) incentives	Process changes trialed and measured before rollout; learnings institutionalized into the harness, skills, and onboarding
Governance	No policy; shadow usage on personal accounts	Written acceptable-use guidance; tool inventory exists	Guardrails: security scanning on AI output; “no AI slop” norm enforced in PRs and docs	Owned: AI board / steering body with decision rights; agent harness has a named owner; permissions are identity-based; review standards for AI code are explicit; cloud-org guardrails bound what agents can provision	Policy-as-code: automated quality and security gates; compliance enforced at every stage of the pipeline; agent execution sandboxed from production credentials
Measurement	No baseline; productivity is anecdotal; AI spend unknown	Usage tracked: licenses, active users, acceptance rate; tool/subscription spend visible	Outcomes: pre-rollout baseline; cycle time, review time, defect rate attributable to AI; token spend attributed to team/feature	Paired metrics on a live dashboard; unit economics (cost per merged PR / completed task); budgets and alerts per use case	ROI proven: business value (time saved, defects avoided, outcomes shipped) vs full cost drives go/no-go decisions

Level descriptors and checklists

Check every box in a level before claiming it, or write down why a box does not apply (e.g. dependency-update PRs in a repo without external dependencies). The highest fully-checked level is the score. Levels build on each other, but a higher practice supersedes the lower one it replaces: checkpoint review at level 4 satisfies the level 2 expectation of human oversight even though nobody approves every step anymore.

Autonomy

Autonomy is the dimension everyone obsesses over: how much does the agent do without you? The road runs from autocomplete, through agents you approve step by step, to context-engineered setups you can trust with a whole feature, and finally to agents that pick up their own work from the ticket queue. Resist the urge to sprint up this ladder. Every level you skip comes back later as review pain.

🔴 1 — Ad hoc

(Nothing required; this is the default state: occasional code completion and Q&A chat, or no use at all)

🟠 2 — Emerging

Interactive agents are a normal way code gets written
Humans review and approve each agent step

🟡 3 — Established

Repo contains agent context files (e.g. CLAUDE.md / AGENTS.md / rules)
Reusable skills exist for recurring tasks
Spec-driven development and TDD anchor agent work
Model and effort level are chosen deliberately per task (small/fast for boilerplate, frontier/high-effort for design), not left on one default

🟢 4 — Managed

Multi-hour unattended sessions deliver real features or refactors
Human review happens at defined checkpoints, not every step
At least one flow runs non-interactively end-to-end without prompting (which flows is scored under Lifecycle Coverage)
Flows are token-optimized: prompt caching, context trimming, and model routing are applied where they cut cost without hurting quality

🔵 5 — Optimized

Agents pick up work directly from tickets/change requests
Engineering effort centers on requirements and review, not code entry
Model/effort routing is tuned continuously against cost-per-outcome, judged by total token efficiency (tokens to complete the task), not price-per-token

Lifecycle Coverage

Code authoring is the obvious use of AI, and the smallest prize. Most engineering hours go to everything around the code: tickets, reviews, tests, dependencies, incidents, and customer issues. This dimension scores breadth across that lifecycle. Score what AI touches; how unattended each flow runs is the Autonomy score, and whether it’s owned and affordable is Governance/Measurement.

🔴 1 — Ad hoc

(Nothing required; this is the default state: AI touches code authoring and Q&A at most)

🟠 2 — Emerging

AI assists dev-adjacent tasks interactively: commit messages, documentation, unit test generation, PR descriptions
Developers use AI from the CLI against local tooling (logs, git history, build output), not just inside the editor

🟡 3 — Established

Agents are integrated with team systems via MCP/CLI: issue tracker, wiki, CI/CD (e.g. Jira, Confluence, GitHub)
Ticket enhancement runs as a flow: acceptance criteria, repro steps, and linked context added automatically
AI code review runs as a first pass on every PR
Dependency/version-update PRs are generated and pre-validated automatically

🟢 4 — Managed

End-to-end test creation and repair is agent-driven: missing coverage proposed, failing or flaky tests triaged and fixed by a flow
CI/build failures receive automated triage and a proposed fix PR
Documentation and release notes update automatically from merged changes

🔵 5 — Optimized

Production incident monitoring feeds agents that diagnose and open fix PRs with supporting evidence
Customer-reported issues are reproduced and fixed (or a fix proposed) by automated flows
The loop is closed: production and customer signals route to agents without a human dispatcher; humans approve merges, not initiate work

Codebase & Platform Readiness

Whether the codebase and the platform it runs on can support agent work at all. AI is an amplifier: teams with loosely coupled architectures, trustworthy tests, and fast feedback see gains, while teams on tightly coupled systems with tribal knowledge see little, regardless of how mature their practices are. The intelligence of an agent-driven setup lives less in the model than in the instructions, tests, and feedback loops around it. The platform half follows the same logic (CNCF platform engineering maturity model): an agent can only be trusted unattended if its change can be deployed somewhere disposable, verified mechanically, and observed in production. Infrastructure as code matters doubly here: infra defined as text is infra agents can read, modify, and have reviewed like any other PR. Score this per repository/service where agents actually work; the weakest one agents touch unattended is the score.

🔴 1 — Ad hoc

(Nothing required; this is the default state: knowledge is tribal, tests cover lines rather than behavior, environment setup is passed around by word of mouth, deploys are manual console clicks)

🟠 2 — Emerging

A README and a root context file exist and work as a map, not an encyclopedia: what this repo is, how to build and test it, where to look next
The dev environment is reproducible (devcontainer, nix, or equivalent): an agent or new hire gets from clone to a passing test run without human help
Linting, type checks, and pre-commit hooks are enforced, so agents get mechanical feedback before review
CI runs on every PR, and deployment is scripted and repeatable, not a sequence of manual console steps in one person’s head

🟡 3 — Established

Test suites verify behavior, not just line coverage; a green run is evidence an agent can rely on
Architectural invariants are enforced mechanically: forbidden imports, layer violations, and contract breaks fail the build rather than depending on reviewer memory
Tacit knowledge is encoded in the repo: ADRs capture why decisions were made, per-directory context files cover local conventions, do-not-touch paths are declared
Infrastructure is defined as code (Terraform/OpenTofu, Pulumi, CloudFormation, or equivalent) and changed through reviewed PRs, with no hand-edited cloud resources
End-to-end tests run on every PR, not just nightly or on main
Logging is structured and centralized; “what happened” is answerable from a log query, not from SSH-ing into machines

🟢 4 — Managed

CI feedback is fast enough for agent iteration loops; slow suites are split, parallelized, or staged
Flaky tests are detected and quarantined automatically, so agents don’t chase phantom failures
Agents query the codebase through tools (code graph, find-callers, deterministic cross-repo search) rather than loading whole files into context
Reviewing an agent PR does not require tribal knowledge that is absent from the repo
Every PR gets an ephemeral preview environment, spun up automatically and torn down on merge/close, so reviewers and agents verify against a live deployment instead of a description
Requests are traced end-to-end across services (e.g. OpenTelemetry); a developer or agent debugging a failure can follow one request through the system
UI changes are guarded by visual regression tests (pixel-level snapshots compared against a baseline on every PR), so “looks right” is checked mechanically, not eyeballed

🔵 5 — Optimized

Every agent task cycle leaves the repo more legible: characterization tests, captured invariants, and updated context files are part of the definition of done
Agents verify their own changes the way a reviewer would: deploy to the preview environment, run e2e and visual checks, read logs and traces, and attach that evidence to the PR
Readiness is scored per repository and tracked over time; gaps block autonomy expansion on that repo

Scope

Scope measures reach: one enthusiast, a team, several teams, or the whole org. The anti-pattern here is the hero setup, one developer’s lovingly tuned private toolbox that walks out the door with them. Maturity means the setup lives in the repo and every new joiner inherits it by default.

🔴 1 — Ad hoc

(Nothing required; this is the default state: one or two enthusiasts on personal setups, or no use at all)

🟠 2 — Emerging

Multiple developers share tips and configurations informally

🟡 3 — Established

Skills, context files, and permissions are shared and versioned in the repo
New team members inherit the setup by default

🟢 4 — Managed

Multiple teams use a common harness and shared standards
Patterns proven on one team transfer to others deliberately

🔵 5 — Optimized

Org-level registry of skills and context
The harness has a named owner and a roadmap
AI practices are part of engineering onboarding

Team Process

The over-the-wall scenes all trace back to this dimension, and it’s the one most teams skip. Tools spread in days; norms take quarters. Team process is where individual usage turns into shared practice: rituals, written norms, and learning time that make the whole team better instead of one enthusiast faster.

🔴 1 — Ad hoc

(Nothing required; this is the default state: no shared rituals, AI output goes straight into review unexamined)

🟠 2 — Emerging

AI adoption is a recurring retrospective topic: what’s working, what’s failing, where AI saves time vs creates friction
Effective prompts and failure modes are shared informally (channel, doc, demos)
At least one named AI champion per team experiments publicly and teaches peers

🟡 3 — Established

A written “good PR” definition for AI-assisted work exists and is enforced (see box below)
Dedicated, recurring learning time exists (e.g. a monthly AI learning day or guild hour)
Learnings — good and bad — are documented in a shared place rather than only discussed
Training is peer-to-peer, demo-driven; champions show, not mandate
Juniors run investigations and write designs themselves; AI may type the fix, but the diagnosis and the design doc are the junior’s deliverable
The team is fluent with revert/rollback and uses it routinely; it is the safety net that makes higher change volume survivable

🟢 4 — Managed

A guild / SIG / community of practice connects champions across teams, with a regular cadence and visible backlog
A named process owner is accountable for AI workflow norms and their evolution
PR templates require evidence over assertion: links to tests, scans, dashboards, rollback plan
Incentives are team-based, not individual usage quotas (competition kills knowledge sharing)
Norms are revisited on a cadence as tools and practices change
Role expectations and onboarding are updated for AI-era work; mentorship is a measured, funded responsibility rather than something seniors fit in around delivery pressure

🔵 5 — Optimized

Process changes are trialed on one team and measured before broad rollout
Retro learnings flow into the harness: context files, skills, and review rules get updated, not just discussed
AI workflow norms are part of onboarding; new joiners inherit them on day one
Cross-team learning loop runs on its own (writeups, demos, guild backlog) without a manager pushing it
The team can show that engineers still grow into seniors under heavy AI use: investigation skill, design ownership, and debugging depth are developing, not atrophying

What “good PR” means for AI-assisted work: the anti-over-the-wall standard

Author is the first reviewer. Every generated line is read, understood, and owned before requesting review. “AI wrote it” is not an abstraction boundary.
Explainability test. The author can answer: why this approach, what breaks if it’s wrong, how it’s observed in production, how it’s reverted. An author who can’t answer is not ready to merge.
Evidence over assertion. The PR links to passing tests, scan results, and a rollback path — not just a description.
Small and scoped. Generation speed is not a license for novel-sized PRs; the size norms tighten, not loosen.
AI reviews first, humans decide. Automated review catches style and boilerplate issues before a human spends time; humans judge design, correctness, and risk.

Governance

Guardrails should be proportionate to blast radius: an interactive pilot needs lighter controls than an unattended flow touching production. Applying the full org-wide policy stack to every experiment kills adoption. Letting unattended flows run on pilot-level controls is the danger zone. Tighten controls as autonomy and coverage grow.

🔴 1 — Ad hoc

(Nothing required; this is the default state)

🟠 2 — Emerging

Written acceptable-use policy exists and is findable
Inventory of approved tools is maintained

🟡 3 — Established

Security scanning runs on AI-generated output
Low-effort AI output (“slop”) is rejected in PRs and documents, and the norm is enforced
Data-handling rules for prompts/context are defined

🟢 4 — Managed

An AI board / steering body exists with decision rights over tool approval, risk appetite, and investment
The agent harness has a single named owner
Agent permissions are identity-based and least-privilege
Review standards for AI-authored code are written down
AI-authored changes are tagged at commit/PR level so they can be traced and audited; this is also what makes the AI-vs-non-AI comparisons under Measurement possible
Budgets and spending authorization are assigned per team/use case, with alerts on threshold breach (Governance decides who may spend what; Measurement tracks the spend against it)
A model-selection policy defines which models are approved for which task classes (and which data may reach them)
Flows that feed untrusted content to agents (tickets, customer reports, web pages) are threat-modeled for prompt injection, and what those agents can reach is bounded accordingly
Cloud-org-level guardrails (a landing zone with SCPs / Azure Policy / GCP org policies) bound what anyone, human or agent, can provision: regions, encryption, security services that cannot be disabled
Agents get scoped, short-lived cloud credentials per task, never standing admin access; infra changes flow through the IaC pipeline, not direct console/API access

🔵 5 — Optimized

Quality and security gates are automated (policy-as-code, e.g. OPA / Sentinel / Cloud Custodian), ratified by the AI board
Compliance checks run at every pipeline stage
Unattended agent execution is sandboxed (isolated runtime, no reach into production credentials or data beyond the task’s grant), with its actions logged and auditable
Incidents involving AI output have a defined response path
Cost governance is continuous: spend reviews on a regular cadence with the teams that own the agents, not a year-end accounting exercise

Measurement

Measurement is where most adoption stories quietly fall apart: no baseline, no attribution, no idea what the real spend is. Two habits carry this whole dimension.

Capture a baseline before you roll anything out, and never track a throughput metric without its stability twin. Everything else builds on those two.

🔴 1 — Ad hoc

(Nothing required; this is the default state: no baseline, anecdotal productivity claims, unknown AI spend)

🟠 2 — Emerging

License count, active usage, and acceptance rate are tracked
Total tool/subscription spend is visible in one place (including shadow usage on personal licenses, or an honest estimate of it)

🟡 3 — Established

A pre-rollout baseline was captured, ideally per engineer, against their own pre-AI history
Cycle time, PR review time, and defect rate are attributable to AI vs non-AI work
Token/API spend is attributed: every agent call tagged with team, feature/flow, and model

🟢 4 — Managed

Every throughput metric is paired with a stability metric (see table below)
The dashboard is live and reviewed on a cadence
Unit economics are tracked: cost per merged PR, cost per completed task/flow, tokens per active developer
Spend is tracked against the budgets Governance assigns, per use case, and optimization changes show up in the dashboard

🔵 5 — Optimized

Workflow changes are trialed and measured before scaling
ROI is computed against business outcomes: (value generated − full cost) / full cost, where value = baselined time savings + defects avoided + outcomes shipped
Full cost includes the hidden parts: added review time, rework, and harness maintenance, not just subscriptions and tokens
ROI and cost-per-outcome drive go/no-go decisions on expanding autonomy and on model/tool choices

Throughput ↔ stability pairs. Never track the left column alone:

Throughput	Pair with
PR throughput per developer	PR revert rate
Deployment frequency	Change-failure / rework rate
Lead time for changes	PR review cycle time (AI vs non-AI)
Tasks completed per developer	Defect rate of AI code vs human baseline
Lines of code added	Duplication % and complexity trend

The last pair catches slow-burn debt that defect rates miss: AI inflates “tasks completed” most cheaply by cloning code, and the cost lands in maintenance years later.

Cost & ROI ladder. How cost measurement matures alongside:

Level	Cost question you can answer
🟠 2	“What do we spend on AI tools in total?”
🟡 3	“Which team, flow, and model is the spend going to?”
🟢 4	“What does a merged PR / completed flow cost, and is it trending right?”
🔵 5	“Is the value generated worth the full cost and where do we invest or cut next?”

Four cost warnings from the trenches. Activity is not impact: more PRs is not more business. So tie value to outcomes, never to output counts. Don’t shop on price-per-token: a cheap model that burns triple the tokens or produces rework is the expensive one, so judge by tokens-to-done and cost-per-outcome. Know that the subscription is just the visible tip: the real money leaks into review time, rework cycles, and harness babysitting. All booked in other budgets. And treat cost as a live dial, not something finance discovers in December.

Why so paranoid about measurement? Because the evidence all points the same way. You met METR’s slower-but-feeling-faster developers and DORA’s throughput-up-stability-down pattern earlier in this post (same survey: three in ten developers don’t trust the output they ship anyway). It gets worse. Repositories that adopt coding agents pick up roughly 18% more static-analysis warnings and 39% more cognitive complexity. And it sticks for months (MSR 2026). When DryRun Security let three frontier coding agents build two applications, 87% of the pull requests shipped at least one security vulnerability. Your gut says you’re flying. The instruments say: measure.

Verification gates

The dominant stall point in practice is the verification tax: code generation speeds up first, while review, governance, and deployment capacity lag. Time saved generating is re-spent auditing. Clearing one constraint surfaces the next, moving upstream: review capacity first, then QA, then the quality of requirements themselves. Three gates must hold before advancing Autonomy:

Gate	Before advancing to	Must be true
G1 — Review capacity	Autonomy 🟡 3	Review time per PR is flat or falling even as PRs are produced faster
G2 — Proven outcome	Autonomy 🟢 4	A business outcome (not a developer feeling) has measurably changed because of AI
G3 — Governance first	Autonomy 🔵 5	Governance is at 🟢 4+ before scaling, not retrofitted after

Reading the profile

The score itself is not the point; the pattern is. Write the seven colors side by side (e.g. Autonomy 🟢 · Coverage 🟠 · Readiness 🟠 · Scope 🟡 · Team Process 🟠 · Governance 🔴 · Measurement 🔴) and match the pattern:

Profile pattern	Diagnosis	Action
Autonomy ahead of Governance + Measurement	Danger zone: unauditable, unmeasurable autonomous output	Freeze autonomy expansion; raise Governance and Measurement first
Autonomy deep, Coverage narrow	AI is a coding tool, not a delivery system; gains are capped by everything around the code	Expand sideways: code review, tests, tickets, dependencies before pushing autonomy further
Coverage broad, Governance + Measurement low	Flow sprawl: automations multiply without owners, budgets, or success metrics	Inventory every flow; assign an owner and a cost/outcome metric to each, retire the rest
Autonomy ahead of Team Process	Over-the-wall mode: generation outpaces shared norms, and review friction and resentment build	Adopt the “good PR” standard and start AI retros before expanding autonomy
Autonomy ahead of Codebase & Platform Readiness	Agents are guessing: output compiles and passes review, then breaks on undocumented invariants weeks later, and nobody sees it break because logging and tracing are absent	Encode invariants, behavioral tests, and tacit knowledge into the repo, give agents a live environment to verify in and telemetry to observe, then expand autonomy
Governance + Measurement ahead of Autonomy	Safe but slow	Push autonomy; the org can absorb the output
Autonomy ahead of Scope	Knowledge concentrated in individuals	Institutionalize: move personal skills/context into the repo before the knowledge walks out the door
Team Process ahead of Autonomy	Rituals without substance; meetings about AI exceed use of AI	Channel the guild’s energy into hands-on adoption targets
All dimensions level and low	Healthy early state	Clear the next verification gate before adding tools
All dimensions 🟢/🔵	Leading	Shift focus to experimentation and stability optimization

So which tools should we adopt?

Wrong question, but everybody asks it. Tools change by the quarter, and owning a tool is no more maturity than owning running shoes is fitness. Treat what follows as orientation. And remember: scope and team process are practice dimensions. No purchase order gets you there.

The autonomy ladder starts innocent: Copilot completions and a ChatGPT or Claude tab. Then interactive agents in the editor, Cursor and friends. Level 3 is where it gets interesting: CLI agents like Claude Code, Codex CLI, or Google Antigravity, made reliable by context engineering. That means CLAUDE.md or AGENTS.md files, reusable skills, spec-driven development with GitHub Spec Kit or the like, and a TDD harness. Level 4 adds agent workflow platforms, custom AI code review, automated ticket enrichment, MCP integrations into Jira and Confluence, and scheduled agents such as Claude Code routines. Level 5, agents that resolve production issues through cloud orchestration platforms like Warp Oz, is real but rare. Don’t pretend you live there.

Does the top of the ladder exist outside conference keynotes? It does.

Warp’s bold move: prompts are the new pull requests
Warp open-sourced their agentic terminal, with a twist. Contributors submit prompts, not code. AI implements and reviews; the community shapes direction and verifies the result. Non-developers can finally contribute meaningfully to open source. How’s that for full-fledged AI adoption: not at project level, but changing the way open-source projects collaborate. Humans write requirements. Agents write code (and OpenAI pays the bill in this case).

Lifecycle coverage is about wiring AI deeper into delivery. Start with MCP integrations into your tracker and wiki, AI review as the first pass on every PR. And Renovate- or Dependabot-style update agents that pre-validate their own PRs. Next come agentic e2e test creation and repair, CI triage bots that propose fixes, and docs and release notes that update themselves. The end game routes incidents and support tickets straight to agents. Approve the PR on Sunday morning and go back to bed.

Codebase and platform readiness is the least sexy list and the biggest lever. Devcontainers or nix, a context file that works as a map, pre-commit hooks, CI on every PR. Then ADRs, architecture tests (ArchUnit, dependency-cruiser), per-directory context files, infrastructure as code (Terraform/OpenTofu, Pulumi), Playwright on every PR, and centralized structured logging. The serious money comes later: repo readiness scoring (think Factory.ai-style agent-readiness reports), code-graph and LSP tools served to agents over MCP, flaky-test detection, preview environments per PR (ArgoCD ApplicationSets, Vercel- or Northflank-style platforms), distributed tracing with OpenTelemetry, and visual regression testing with Chromatic, Percy, or Playwright snapshots.

Governance tooling gets real at level 4: a cloud landing zone (AWS Control Tower, Azure Landing Zones), org policies nobody can switch off (SCPs, Azure Policy, GCP org policies), and short-lived credentials via OIDC federation. At level 5 you graduate to policy-as-code engines (OPA, Sentinel, Cloud Custodian) and sandboxed agent runtimes on microVM or gVisor isolation.

And measurement? Start embarrassingly simple: the vendor’s usage dashboard and a shared spend spreadsheet. That’s level 2, and it already beats most teams. Level 3 wants engineering-intelligence platforms (DX, Jellyfish, LinearB) and LLM spend observability (Langfuse, Helicone, or your cloud cost tools), tagged per team and per flow. Level 4 builds the unit-economics dashboard and budget alerts on top. By then you’ll know your cost per merged PR better than your cloud bill. Which, let’s be honest, nobody knows either.

Say what?

The CEO was right: proudly saying you use AI is no longer cool or acceptable. Everyone does. Knowing where you stand is the new flex: seven colors, a named weakest link, and the next gate to clear. Run the matrix with your whole team this month. Lift the lowest color. Repeat.

What’s your team’s weakest dimension? Enter the matrix and tell me what surprised you.

SoftwareStartups

The ai adoption maturity matrix

Martijn Rutten
June 11, 2026 30 min read

Why you can’t answer it by feel

Throwing it over the wall

Seven dimensions, one weakest link

Score it planning-poker style

AI doesn’t skip the boring phases