slide

The ai adoption maturity matrix

Martijn Rutten
30 min read

A CEO told me yesterday: “Proudly saying ‘I am using AI’ is no longer cool or acceptable.”

He’s right. Of course we’re all all-in on AI. Ninety percent of us do to be ai-precise. The questions your board and your team actually ask are different. Where do we stand? What does “good” look like? Where do we boldly go from here?

This post hands you the instrument to answer them: my AI adoption maturity matrix. Seven dimensions, five color-coded levels, a scoring ritual that surfaces the conversations your team has been avoiding. And a roadmap for your CEO to follow.

Why you can’t answer it by feel

Ask your engineers where the team stands and you’ll get confident answers. Trouble is, confidence doesn’t survive measurement. METR ran a randomized controlled trial with experienced open-source developers working in their own mature codebases. With AI they were 19% slower. While estimating they had been sped up 20%. Go figure.

The system-level numbers point the same way. DORA’s 2025 report found AI adoption correlates with higher delivery throughput and lower delivery stability. GitClear’s analysis of 211 million changed lines shows refactoring collapsing while copy-paste takes over. Duh. Perceived speed is not a metric. License counts and token burn are not maturity either. To know where you stand, you have to look at the whole system around the code: who verifies, who owns, and what gets measured. That’s exactly where things usually go sideways.

Throwing it over the wall

As fractional CTO, I get to watch the same movie at several companies in parallel. Same plot, different logo. And the plot is never about the tools. It’s about what AI does to a team.

Scene one: an enthusiastic engineer discovers that agents write architecture decision records on demand. Minutes later, a watertight-looking document sails over the wall into review. The poor reviewers now have to judge pages of blabbering the author never fully read himself. Generation took minutes, review takes days. Nobody can tell which trade-offs the engineer actually weighed and which ones the model padded in. Unhappy faces on both sides of the wall.

Scene two is the entrepreneurial cowboy business variant:

The vibe-coded wonder app
A business colleague vibe-codes a complete web app. Demos it to a customer. The customer is thrilled, the CEO ecstatic: why does engineering need months for what one person built in days? Then engineering gets to wrestle the wonder app into the design system, infrastructure as code, secure authentication and authorization, the integrations nobody demoed, continuous integration, logging, and monitoring. The demo was the easy part. The product is everything the demo skipped.

Did I mention security? DryRun Security let three frontier coding agents build two applications, pull request by pull request. 26 of the 30 PRs introduced at least one security vulnerability. Eighty-seven percent. Guess how many your wonder app ships.

Both scenes are the same failure: one individual’s autonomy racing ahead of the team’s ability to verify. Generation is no longer the bottleneck. Trust is.

The teams that do get real gains treat AI as a team sport. They learn together, and they capture and share what they learn. The most striking practice: teams that pair- and even mob-program with AI in the loop, architecting, debugging, and coding together. With ADRs and shared context engineering baked into the daily routine.

Dialog-driven development
Equal Experts distilled this pattern into Dialog Driven Delivery: the conversation, not the backlog, is the primary source of truth. When product, design, and engineering work through a problem together, assumptions surface and constraints become explicit. AI turns that shared context into specifications and code on the spot. Mob programming with AI in the loop makes context engineering a team practice instead of a private prompt library. Teams that learn and grow together, and systematically capture the learning, “lead the revolution”.

So how do you find out which movie your team is in? I built a maturity matrix for exactly this. Seven independent dimensions: autonomy, lifecycle coverage, codebase and platform readiness, scope, team process, governance, and measurement. Each scored on the same levels. With some nice colors for those of us that don’t like reading. From red (ad hoc) to blue (optimized). TL;DR: scroll down to the full matrix.

Why seven dimensions instead of one magic score? Because the failure modes are mismatches between dimensions. A team running multi-hour autonomous agent sessions scores green on autonomy. The same team with no review standards, no budget owner, and no baseline scores red on governance and measurement. That team isn’t mature. It’s a Ferrari racing headlong into a brick wall.

A team is only as AI-mature as its weakest dimension.

The assessment bakes this in with verification gates. The dominant stall point in practice is the verification tax: code generation speeds up first, while review capacity lags. So the time saved generating gets re-spent auditing. Clear that constraint and the next one surfaces upstream: QA. Then the quality of your requirements themselves. Sound familiar? It’s the theory of constraints wearing an AI badge.

And the readiness dimension settles an old score. I argued back in 2016 (was too busy CTO-ing to write blogs) that you must bootstrap your software quality while you still can, because retrofitting quality means changing process, mindset, and people. Agents turn that dial to eleven. Trustworthy tests, fast and visible feedback, and tribal knowledge encoded in the repo are now the difference between an agent that ships and an agent that guesses.

Score it planning-poker style

Now the part that makes it work: don’t fill in the matrix yourself. Don’t delegate it to the most enthusiastic engineer either. Have every individual on the team score all seven dimensions privately. Oh and have the leaders also score the teams and their members. Full 360 truth. Then sit together and discuss the outliers, planning poker style.

You’d be surprised. The tech lead scores team process green: we have a guild, we have a channel! The junior scores it red: nobody really reviews my agent PRs, they just sigh and approve. That gap is the conversation your team needed to have for months. The discussion is the deliverable. The matrix is just the excuse.

Do write down today’s colors and the colors you want two quarters from now. That’s your roadmap. It’s also your shield: use it to keep your trigger-happy CEO apprised of where you really are. In colors a board deck can absorb.

And whatever you do, don’t put a meter on token consumption and dock the pay of developers who don’t burn enough. Just, don’t. Measure outcomes, not input. Pair every throughput metric with a stability twin: PR throughput with revert rate, deployment frequency with change-failure rate, and cost per merged PR against your pre-AI baseline. Otherwise you’re measuring how fast the team digs, not whether the hole is in the right place.

AI doesn’t skip the boring phases

Your board whispers that AI-native competitors are breathing down your neck. Your C-level gets more impatient by the week. So why isn’t the AI initiative live yet?

Because every project still grows up the same way, AI or not. Resourcing: finding someone to actually run with it. Requirements: talking to the business, understanding the legacy code, disambiguating, and aligning on the architecture choices while everyone has their say. Building the infrastructure. Then a first MVP or proof of concept, tested internally or with a few friendly customers. Baking in the learnings. Productizing. And only then is there an AI-enabled business outcome you can brag about.

There’s a second trap hiding in that MVP phase: staying in the lab. You can adopt the latest autonomous-agentic-workflow-orchestration platform (insert this quarter’s jargon), but if it only ever runs in the comfort of your own laptop, safely behind the corporate firewall… You’ll have a hard time bringing it out into the real, scary world. Simple technology in a complex environment is still a giant leap. Make baby steps, but make them out there in the real world.

ING’s not-so-simple chatbot
ING adopted a customer-facing chatbot. Building the bot took a few weeks. Getting it through corporate security and compliance not only took months, it bloated the code base by multiple orders of magnitude. A simple chatbot. Duh. But a simple chatbot under a banking license, at a big corporate with lots to lose, stops being simple technology. That’s rocket-ship engineering.

The matrix doesn’t let you skip phases. It shows which friction you can remove and which impatience you simply have to manage. Sharing your colors and your next gate with the C-level beats promising magic. Magic slips. Checklists don’t.

The full matrix

Yes, my inner consultant built a seven-dimension, five-color matrix. At least this one comes without the invoice. Steal it, adapt it, and run it quarterly.

A team can be advanced on one dimension and immature on another; the assessment exists to surface exactly that mismatch. Aim to lift the lowest color, not the highest.

DimensionQuestion it answers
AutonomyHow much does the agent do unattended?
Lifecycle CoverageHow much of the delivery lifecycle beyond code authoring does AI cover โ€” tickets, review, tests, dependencies, incidents, customer issues?
Codebase & Platform ReadinessCan the codebase and the platform around it support agent work: encoded knowledge, trustworthy tests, fast feedback, live environments to verify in, observability to debug with?
ScopeHow far do the practices reach โ€” one dev, the team, the org?
Team ProcessDo shared rituals, norms, and ownership turn individual usage into team practice?
GovernanceAre guardrails, ownership, and policy in place?
MeasurementCan we prove impact, track cost, show ROI, and catch regressions?

Core principle: a team is only as mature as its weakest dimension. Tool count does not predict outcomes; the most common failure mode is autonomy running ahead of governance and measurement, producing unauditable, unmeasurable output.

Placement rule for organizational structures: structures are means, not dimensions. Score them by purpose: structures that spread practice (champions, guilds, SIGs) count under Team Process; structures that exercise decision rights (AI board, steering body, harness owner) count under Governance.

Vocabulary: the “agent harness” is the shared setup around the agents themselves: context files, skills (reusable, versioned task instructions), permissions, CI hooks, and review automation. A “flow” is a recurring agent task that runs as part of delivery, such as ticket enhancement, dependency updates, or CI failure triage.

The master assessment matrix

Assess each dimension independently. Circle one cell per row. The table is dense by design: it is the summary. The checklists after it unpack every cell into things you can verify instead of feel.

Dimension๐Ÿ”ด 1 โ€” Ad hoc๐ŸŸ  2 โ€” Emerging๐ŸŸก 3 โ€” Established๐ŸŸข 4 โ€” Managed๐Ÿ”ต 5 โ€” Optimized
AutonomyCode completion + Q&A chat, used occasionallyInteractive agents (IDE/CLI), human approves every stepContext-engineered: repo context files, skills, spec-driven + test-driven development make agents reliable; model and effort level chosen per taskSupervised autonomous: multi-hour unattended sessions (features, refactors) with human review at checkpoints; token-optimized flows (caching, routing, right-sized models)Orchestrated: agents self-source work from tickets and change requests; humans write requirements, not code; cost-per-outcome tuned continuously
Lifecycle CoverageCode authoring and Q&A onlyDev-adjacent assists: commit messages, docs, unit tests, PR descriptions; ad-hoc CLI useIntegrated with team systems via MCP/CLI: ticket enhancement, AI code review on every PR, automated dependency-update PRsPipeline-embedded: e2e test creation & repair, CI failure triage with fix PRs, auto-updated docs and release notesClosed loop with production: agents diagnose incidents and open fix PRs; customer issues reproduced and fixed by flows
Codebase & Platform ReadinessKnowledge is tribal; tests cover lines, not behavior; setup is undocumented; deploys are manual console clicksRoot context file as a map (not an encyclopedia); reproducible dev environment; lint, types, and pre-commit enforced; CI on every PR; scripted, repeatable deploysBehavioral tests agents can treat as ground truth; architectural invariants enforced by build/CI; tacit knowledge encoded (ADRs, scoped context files, do-not-touch paths); infrastructure as code changed via reviewed PRs; e2e tests on every PR; structured, centralized loggingFast CI sized for agent iteration; flaky tests quarantined; codebase queryable through tools (code graph, find-callers) instead of prompt-stuffing; ephemeral preview environment per PR, torn down on merge; distributed tracing across services; visual regression tests guard UI changesSelf-describing repo: every agent task leaves artifacts (tests, invariants, context) that make the next run more reliable; agents verify their own changes against preview environments and telemetry; readiness tracked per repo
ScopeOne or two enthusiasts, personal setupsPockets of practice; informal tips circulateTeam-wide: shared skills, context files, and permissions versioned in the repoMulti-team: shared standards, common agent harness, reusable patternsOrg-wide: skill/context registry, standard harness with a named owner, onboarding includes AI practices
Team ProcessNo shared rituals; AI-generated code/docs thrown over the wall into reviewAI is a recurring retrospective topic; tips and prompts shared informally; early champions visibleWritten “good PR” norms for AI work (author = first reviewer); dedicated learning time; learnings documentedGuild / SIG / community of practice; named process owner; PR templates require evidence; team-based (not individual) incentivesProcess changes trialed and measured before rollout; learnings institutionalized into the harness, skills, and onboarding
GovernanceNo policy; shadow usage on personal accountsWritten acceptable-use guidance; tool inventory existsGuardrails: security scanning on AI output; “no AI slop” norm enforced in PRs and docsOwned: AI board / steering body with decision rights; agent harness has a named owner; permissions are identity-based; review standards for AI code are explicit; cloud-org guardrails bound what agents can provisionPolicy-as-code: automated quality and security gates; compliance enforced at every stage of the pipeline; agent execution sandboxed from production credentials
MeasurementNo baseline; productivity is anecdotal; AI spend unknownUsage tracked: licenses, active users, acceptance rate; tool/subscription spend visibleOutcomes: pre-rollout baseline; cycle time, review time, defect rate attributable to AI; token spend attributed to team/featurePaired metrics on a live dashboard; unit economics (cost per merged PR / completed task); budgets and alerts per use caseROI proven: business value (time saved, defects avoided, outcomes shipped) vs full cost drives go/no-go decisions

Level descriptors and checklists

Check every box in a level before claiming it, or write down why a box does not apply (e.g. dependency-update PRs in a repo without external dependencies). The highest fully-checked level is the score. Levels build on each other, but a higher practice supersedes the lower one it replaces: checkpoint review at level 4 satisfies the level 2 expectation of human oversight even though nobody approves every step anymore.

Autonomy

Autonomy is the dimension everyone obsesses over: how much does the agent do without you? The road runs from autocomplete, through agents you approve step by step, to context-engineered setups you can trust with a whole feature, and finally to agents that pick up their own work from the ticket queue. Resist the urge to sprint up this ladder. Every level you skip comes back later as review pain.

๐Ÿ”ด 1 โ€” Ad hoc

  • (Nothing required; this is the default state: occasional code completion and Q&A chat, or no use at all)

๐ŸŸ  2 โ€” Emerging

  • Interactive agents are a normal way code gets written
  • Humans review and approve each agent step

๐ŸŸก 3 โ€” Established

  • Repo contains agent context files (e.g. CLAUDE.md / AGENTS.md / rules)
  • Reusable skills exist for recurring tasks
  • Spec-driven development and TDD anchor agent work
  • Model and effort level are chosen deliberately per task (small/fast for boilerplate, frontier/high-effort for design), not left on one default

๐ŸŸข 4 โ€” Managed

  • Multi-hour unattended sessions deliver real features or refactors
  • Human review happens at defined checkpoints, not every step
  • At least one flow runs non-interactively end-to-end without prompting (which flows is scored under Lifecycle Coverage)
  • Flows are token-optimized: prompt caching, context trimming, and model routing are applied where they cut cost without hurting quality

๐Ÿ”ต 5 โ€” Optimized

  • Agents pick up work directly from tickets/change requests
  • Engineering effort centers on requirements and review, not code entry
  • Model/effort routing is tuned continuously against cost-per-outcome, judged by total token efficiency (tokens to complete the task), not price-per-token

Lifecycle Coverage

Code authoring is the obvious use of AI, and the smallest prize. Most engineering hours go to everything around the code: tickets, reviews, tests, dependencies, incidents, and customer issues. This dimension scores breadth across that lifecycle. Score what AI touches; how unattended each flow runs is the Autonomy score, and whether it’s owned and affordable is Governance/Measurement.

๐Ÿ”ด 1 โ€” Ad hoc

  • (Nothing required; this is the default state: AI touches code authoring and Q&A at most)

๐ŸŸ  2 โ€” Emerging

  • AI assists dev-adjacent tasks interactively: commit messages, documentation, unit test generation, PR descriptions
  • Developers use AI from the CLI against local tooling (logs, git history, build output), not just inside the editor

๐ŸŸก 3 โ€” Established

  • Agents are integrated with team systems via MCP/CLI: issue tracker, wiki, CI/CD (e.g. Jira, Confluence, GitHub)
  • Ticket enhancement runs as a flow: acceptance criteria, repro steps, and linked context added automatically
  • AI code review runs as a first pass on every PR
  • Dependency/version-update PRs are generated and pre-validated automatically

๐ŸŸข 4 โ€” Managed

  • End-to-end test creation and repair is agent-driven: missing coverage proposed, failing or flaky tests triaged and fixed by a flow
  • CI/build failures receive automated triage and a proposed fix PR
  • Documentation and release notes update automatically from merged changes

๐Ÿ”ต 5 โ€” Optimized

  • Production incident monitoring feeds agents that diagnose and open fix PRs with supporting evidence
  • Customer-reported issues are reproduced and fixed (or a fix proposed) by automated flows
  • The loop is closed: production and customer signals route to agents without a human dispatcher; humans approve merges, not initiate work

Codebase & Platform Readiness

Whether the codebase and the platform it runs on can support agent work at all. AI is an amplifier: teams with loosely coupled architectures, trustworthy tests, and fast feedback see gains, while teams on tightly coupled systems with tribal knowledge see little, regardless of how mature their practices are. The intelligence of an agent-driven setup lives less in the model than in the instructions, tests, and feedback loops around it. The platform half follows the same logic (CNCF platform engineering maturity model): an agent can only be trusted unattended if its change can be deployed somewhere disposable, verified mechanically, and observed in production. Infrastructure as code matters doubly here: infra defined as text is infra agents can read, modify, and have reviewed like any other PR. Score this per repository/service where agents actually work; the weakest one agents touch unattended is the score.

๐Ÿ”ด 1 โ€” Ad hoc

  • (Nothing required; this is the default state: knowledge is tribal, tests cover lines rather than behavior, environment setup is passed around by word of mouth, deploys are manual console clicks)

๐ŸŸ  2 โ€” Emerging

  • A README and a root context file exist and work as a map, not an encyclopedia: what this repo is, how to build and test it, where to look next
  • The dev environment is reproducible (devcontainer, nix, or equivalent): an agent or new hire gets from clone to a passing test run without human help
  • Linting, type checks, and pre-commit hooks are enforced, so agents get mechanical feedback before review
  • CI runs on every PR, and deployment is scripted and repeatable, not a sequence of manual console steps in one person’s head

๐ŸŸก 3 โ€” Established

  • Test suites verify behavior, not just line coverage; a green run is evidence an agent can rely on
  • Architectural invariants are enforced mechanically: forbidden imports, layer violations, and contract breaks fail the build rather than depending on reviewer memory
  • Tacit knowledge is encoded in the repo: ADRs capture why decisions were made, per-directory context files cover local conventions, do-not-touch paths are declared
  • Infrastructure is defined as code (Terraform/OpenTofu, Pulumi, CloudFormation, or equivalent) and changed through reviewed PRs, with no hand-edited cloud resources
  • End-to-end tests run on every PR, not just nightly or on main
  • Logging is structured and centralized; “what happened” is answerable from a log query, not from SSH-ing into machines

๐ŸŸข 4 โ€” Managed

  • CI feedback is fast enough for agent iteration loops; slow suites are split, parallelized, or staged
  • Flaky tests are detected and quarantined automatically, so agents don’t chase phantom failures
  • Agents query the codebase through tools (code graph, find-callers, deterministic cross-repo search) rather than loading whole files into context
  • Reviewing an agent PR does not require tribal knowledge that is absent from the repo
  • Every PR gets an ephemeral preview environment, spun up automatically and torn down on merge/close, so reviewers and agents verify against a live deployment instead of a description
  • Requests are traced end-to-end across services (e.g. OpenTelemetry); a developer or agent debugging a failure can follow one request through the system
  • UI changes are guarded by visual regression tests (pixel-level snapshots compared against a baseline on every PR), so “looks right” is checked mechanically, not eyeballed

๐Ÿ”ต 5 โ€” Optimized

  • Every agent task cycle leaves the repo more legible: characterization tests, captured invariants, and updated context files are part of the definition of done
  • Agents verify their own changes the way a reviewer would: deploy to the preview environment, run e2e and visual checks, read logs and traces, and attach that evidence to the PR
  • Readiness is scored per repository and tracked over time; gaps block autonomy expansion on that repo

Scope

Scope measures reach: one enthusiast, a team, several teams, or the whole org. The anti-pattern here is the hero setup, one developer’s lovingly tuned private toolbox that walks out the door with them. Maturity means the setup lives in the repo and every new joiner inherits it by default.

๐Ÿ”ด 1 โ€” Ad hoc

  • (Nothing required; this is the default state: one or two enthusiasts on personal setups, or no use at all)

๐ŸŸ  2 โ€” Emerging

  • Multiple developers share tips and configurations informally

๐ŸŸก 3 โ€” Established

  • Skills, context files, and permissions are shared and versioned in the repo
  • New team members inherit the setup by default

๐ŸŸข 4 โ€” Managed

  • Multiple teams use a common harness and shared standards
  • Patterns proven on one team transfer to others deliberately

๐Ÿ”ต 5 โ€” Optimized

  • Org-level registry of skills and context
  • The harness has a named owner and a roadmap
  • AI practices are part of engineering onboarding

Team Process

The over-the-wall scenes all trace back to this dimension, and it’s the one most teams skip. Tools spread in days; norms take quarters. Team process is where individual usage turns into shared practice: rituals, written norms, and learning time that make the whole team better instead of one enthusiast faster.

๐Ÿ”ด 1 โ€” Ad hoc

  • (Nothing required; this is the default state: no shared rituals, AI output goes straight into review unexamined)

๐ŸŸ  2 โ€” Emerging

  • AI adoption is a recurring retrospective topic: what’s working, what’s failing, where AI saves time vs creates friction
  • Effective prompts and failure modes are shared informally (channel, doc, demos)
  • At least one named AI champion per team experiments publicly and teaches peers

๐ŸŸก 3 โ€” Established

  • A written “good PR” definition for AI-assisted work exists and is enforced (see box below)
  • Dedicated, recurring learning time exists (e.g. a monthly AI learning day or guild hour)
  • Learnings โ€” good and bad โ€” are documented in a shared place rather than only discussed
  • Training is peer-to-peer, demo-driven; champions show, not mandate
  • Juniors run investigations and write designs themselves; AI may type the fix, but the diagnosis and the design doc are the junior’s deliverable
  • The team is fluent with revert/rollback and uses it routinely; it is the safety net that makes higher change volume survivable

๐ŸŸข 4 โ€” Managed

  • A guild / SIG / community of practice connects champions across teams, with a regular cadence and visible backlog
  • A named process owner is accountable for AI workflow norms and their evolution
  • PR templates require evidence over assertion: links to tests, scans, dashboards, rollback plan
  • Incentives are team-based, not individual usage quotas (competition kills knowledge sharing)
  • Norms are revisited on a cadence as tools and practices change
  • Role expectations and onboarding are updated for AI-era work; mentorship is a measured, funded responsibility rather than something seniors fit in around delivery pressure

๐Ÿ”ต 5 โ€” Optimized

  • Process changes are trialed on one team and measured before broad rollout
  • Retro learnings flow into the harness: context files, skills, and review rules get updated, not just discussed
  • AI workflow norms are part of onboarding; new joiners inherit them on day one
  • Cross-team learning loop runs on its own (writeups, demos, guild backlog) without a manager pushing it
  • The team can show that engineers still grow into seniors under heavy AI use: investigation skill, design ownership, and debugging depth are developing, not atrophying
What “good PR” means for AI-assisted work: the anti-over-the-wall standard
  1. Author is the first reviewer. Every generated line is read, understood, and owned before requesting review. “AI wrote it” is not an abstraction boundary.
  2. Explainability test. The author can answer: why this approach, what breaks if it’s wrong, how it’s observed in production, how it’s reverted. An author who can’t answer is not ready to merge.
  3. Evidence over assertion. The PR links to passing tests, scan results, and a rollback path โ€” not just a description.
  4. Small and scoped. Generation speed is not a license for novel-sized PRs; the size norms tighten, not loosen.
  5. AI reviews first, humans decide. Automated review catches style and boilerplate issues before a human spends time; humans judge design, correctness, and risk.

Governance

Guardrails should be proportionate to blast radius: an interactive pilot needs lighter controls than an unattended flow touching production. Applying the full org-wide policy stack to every experiment kills adoption. Letting unattended flows run on pilot-level controls is the danger zone. Tighten controls as autonomy and coverage grow.

๐Ÿ”ด 1 โ€” Ad hoc

  • (Nothing required; this is the default state)

๐ŸŸ  2 โ€” Emerging

  • Written acceptable-use policy exists and is findable
  • Inventory of approved tools is maintained

๐ŸŸก 3 โ€” Established

  • Security scanning runs on AI-generated output
  • Low-effort AI output (“slop”) is rejected in PRs and documents, and the norm is enforced
  • Data-handling rules for prompts/context are defined

๐ŸŸข 4 โ€” Managed

  • An AI board / steering body exists with decision rights over tool approval, risk appetite, and investment
  • The agent harness has a single named owner
  • Agent permissions are identity-based and least-privilege
  • Review standards for AI-authored code are written down
  • AI-authored changes are tagged at commit/PR level so they can be traced and audited; this is also what makes the AI-vs-non-AI comparisons under Measurement possible
  • Budgets and spending authorization are assigned per team/use case, with alerts on threshold breach (Governance decides who may spend what; Measurement tracks the spend against it)
  • A model-selection policy defines which models are approved for which task classes (and which data may reach them)
  • Flows that feed untrusted content to agents (tickets, customer reports, web pages) are threat-modeled for prompt injection, and what those agents can reach is bounded accordingly
  • Cloud-org-level guardrails (a landing zone with SCPs / Azure Policy / GCP org policies) bound what anyone, human or agent, can provision: regions, encryption, security services that cannot be disabled
  • Agents get scoped, short-lived cloud credentials per task, never standing admin access; infra changes flow through the IaC pipeline, not direct console/API access

๐Ÿ”ต 5 โ€” Optimized

  • Quality and security gates are automated (policy-as-code, e.g. OPA / Sentinel / Cloud Custodian), ratified by the AI board
  • Compliance checks run at every pipeline stage
  • Unattended agent execution is sandboxed (isolated runtime, no reach into production credentials or data beyond the task’s grant), with its actions logged and auditable
  • Incidents involving AI output have a defined response path
  • Cost governance is continuous: spend reviews on a regular cadence with the teams that own the agents, not a year-end accounting exercise

Measurement

Measurement is where most adoption stories quietly fall apart: no baseline, no attribution, no idea what the real spend is. Two habits carry this whole dimension.

Capture a baseline before you roll anything out, and never track a throughput metric without its stability twin. Everything else builds on those two.

๐Ÿ”ด 1 โ€” Ad hoc

  • (Nothing required; this is the default state: no baseline, anecdotal productivity claims, unknown AI spend)

๐ŸŸ  2 โ€” Emerging

  • License count, active usage, and acceptance rate are tracked
  • Total tool/subscription spend is visible in one place (including shadow usage on personal licenses, or an honest estimate of it)

๐ŸŸก 3 โ€” Established

  • A pre-rollout baseline was captured, ideally per engineer, against their own pre-AI history
  • Cycle time, PR review time, and defect rate are attributable to AI vs non-AI work
  • Token/API spend is attributed: every agent call tagged with team, feature/flow, and model

๐ŸŸข 4 โ€” Managed

  • Every throughput metric is paired with a stability metric (see table below)
  • The dashboard is live and reviewed on a cadence
  • Unit economics are tracked: cost per merged PR, cost per completed task/flow, tokens per active developer
  • Spend is tracked against the budgets Governance assigns, per use case, and optimization changes show up in the dashboard

๐Ÿ”ต 5 โ€” Optimized

  • Workflow changes are trialed and measured before scaling
  • ROI is computed against business outcomes: (value generated โˆ’ full cost) / full cost, where value = baselined time savings + defects avoided + outcomes shipped
  • Full cost includes the hidden parts: added review time, rework, and harness maintenance, not just subscriptions and tokens
  • ROI and cost-per-outcome drive go/no-go decisions on expanding autonomy and on model/tool choices

Throughput โ†” stability pairs. Never track the left column alone:

ThroughputPair with
PR throughput per developerPR revert rate
Deployment frequencyChange-failure / rework rate
Lead time for changesPR review cycle time (AI vs non-AI)
Tasks completed per developerDefect rate of AI code vs human baseline
Lines of code addedDuplication % and complexity trend

The last pair catches slow-burn debt that defect rates miss: AI inflates “tasks completed” most cheaply by cloning code, and the cost lands in maintenance years later.

Cost & ROI ladder. How cost measurement matures alongside:

LevelCost question you can answer
๐ŸŸ  2“What do we spend on AI tools in total?”
๐ŸŸก 3“Which team, flow, and model is the spend going to?”
๐ŸŸข 4“What does a merged PR / completed flow cost, and is it trending right?”
๐Ÿ”ต 5“Is the value generated worth the full cost and where do we invest or cut next?”

Four cost warnings from the trenches. Activity is not impact: more PRs is not more business. So tie value to outcomes, never to output counts. Don’t shop on price-per-token: a cheap model that burns triple the tokens or produces rework is the expensive one, so judge by tokens-to-done and cost-per-outcome. Know that the subscription is just the visible tip: the real money leaks into review time, rework cycles, and harness babysitting. All booked in other budgets. And treat cost as a live dial, not something finance discovers in December.

Why so paranoid about measurement? Because the evidence all points the same way. You met METR’s slower-but-feeling-faster developers and DORA’s throughput-up-stability-down pattern earlier in this post (same survey: three in ten developers don’t trust the output they ship anyway). It gets worse. Repositories that adopt coding agents pick up roughly 18% more static-analysis warnings and 39% more cognitive complexity. And it sticks for months (MSR 2026). When DryRun Security let three frontier coding agents build two applications, 87% of the pull requests shipped at least one security vulnerability. Your gut says you’re flying. The instruments say: measure.

Verification gates

The dominant stall point in practice is the verification tax: code generation speeds up first, while review, governance, and deployment capacity lag. Time saved generating is re-spent auditing. Clearing one constraint surfaces the next, moving upstream: review capacity first, then QA, then the quality of requirements themselves. Three gates must hold before advancing Autonomy:

GateBefore advancing toMust be true
G1 โ€” Review capacityAutonomy ๐ŸŸก 3Review time per PR is flat or falling even as PRs are produced faster
G2 โ€” Proven outcomeAutonomy ๐ŸŸข 4A business outcome (not a developer feeling) has measurably changed because of AI
G3 โ€” Governance firstAutonomy ๐Ÿ”ต 5Governance is at ๐ŸŸข 4+ before scaling, not retrofitted after

Reading the profile

The score itself is not the point; the pattern is. Write the seven colors side by side (e.g. Autonomy ๐ŸŸข ยท Coverage ๐ŸŸ  ยท Readiness ๐ŸŸ  ยท Scope ๐ŸŸก ยท Team Process ๐ŸŸ  ยท Governance ๐Ÿ”ด ยท Measurement ๐Ÿ”ด) and match the pattern:

Profile patternDiagnosisAction
Autonomy ahead of Governance + MeasurementDanger zone: unauditable, unmeasurable autonomous outputFreeze autonomy expansion; raise Governance and Measurement first
Autonomy deep, Coverage narrowAI is a coding tool, not a delivery system; gains are capped by everything around the codeExpand sideways: code review, tests, tickets, dependencies before pushing autonomy further
Coverage broad, Governance + Measurement lowFlow sprawl: automations multiply without owners, budgets, or success metricsInventory every flow; assign an owner and a cost/outcome metric to each, retire the rest
Autonomy ahead of Team ProcessOver-the-wall mode: generation outpaces shared norms, and review friction and resentment buildAdopt the “good PR” standard and start AI retros before expanding autonomy
Autonomy ahead of Codebase & Platform ReadinessAgents are guessing: output compiles and passes review, then breaks on undocumented invariants weeks later, and nobody sees it break because logging and tracing are absentEncode invariants, behavioral tests, and tacit knowledge into the repo, give agents a live environment to verify in and telemetry to observe, then expand autonomy
Governance + Measurement ahead of AutonomySafe but slowPush autonomy; the org can absorb the output
Autonomy ahead of ScopeKnowledge concentrated in individualsInstitutionalize: move personal skills/context into the repo before the knowledge walks out the door
Team Process ahead of AutonomyRituals without substance; meetings about AI exceed use of AIChannel the guild’s energy into hands-on adoption targets
All dimensions level and lowHealthy early stateClear the next verification gate before adding tools
All dimensions ๐ŸŸข/๐Ÿ”ตLeadingShift focus to experimentation and stability optimization

So which tools should we adopt?

Wrong question, but everybody asks it. Tools change by the quarter, and owning a tool is no more maturity than owning running shoes is fitness. Treat what follows as orientation. And remember: scope and team process are practice dimensions. No purchase order gets you there.

The autonomy ladder starts innocent: Copilot completions and a ChatGPT or Claude tab. Then interactive agents in the editor, Cursor and friends. Level 3 is where it gets interesting: CLI agents like Claude Code, Codex CLI, or Google Antigravity, made reliable by context engineering. That means CLAUDE.md or AGENTS.md files, reusable skills, spec-driven development with GitHub Spec Kit or the like, and a TDD harness. Level 4 adds agent workflow platforms, custom AI code review, automated ticket enrichment, MCP integrations into Jira and Confluence, and scheduled agents such as Claude Code routines. Level 5, agents that resolve production issues through cloud orchestration platforms like Warp Oz, is real but rare. Don’t pretend you live there.

Does the top of the ladder exist outside conference keynotes? It does.

Warp’s bold move: prompts are the new pull requests
Warp open-sourced their agentic terminal, with a twist. Contributors submit prompts, not code. AI implements and reviews; the community shapes direction and verifies the result. Non-developers can finally contribute meaningfully to open source. How’s that for full-fledged AI adoption: not at project level, but changing the way open-source projects collaborate. Humans write requirements. Agents write code (and OpenAI pays the bill in this case).

Lifecycle coverage is about wiring AI deeper into delivery. Start with MCP integrations into your tracker and wiki, AI review as the first pass on every PR. And Renovate- or Dependabot-style update agents that pre-validate their own PRs. Next come agentic e2e test creation and repair, CI triage bots that propose fixes, and docs and release notes that update themselves. The end game routes incidents and support tickets straight to agents. Approve the PR on Sunday morning and go back to bed.

Codebase and platform readiness is the least sexy list and the biggest lever. Devcontainers or nix, a context file that works as a map, pre-commit hooks, CI on every PR. Then ADRs, architecture tests (ArchUnit, dependency-cruiser), per-directory context files, infrastructure as code (Terraform/OpenTofu, Pulumi), Playwright on every PR, and centralized structured logging. The serious money comes later: repo readiness scoring (think Factory.ai-style agent-readiness reports), code-graph and LSP tools served to agents over MCP, flaky-test detection, preview environments per PR (ArgoCD ApplicationSets, Vercel- or Northflank-style platforms), distributed tracing with OpenTelemetry, and visual regression testing with Chromatic, Percy, or Playwright snapshots.

Governance tooling gets real at level 4: a cloud landing zone (AWS Control Tower, Azure Landing Zones), org policies nobody can switch off (SCPs, Azure Policy, GCP org policies), and short-lived credentials via OIDC federation. At level 5 you graduate to policy-as-code engines (OPA, Sentinel, Cloud Custodian) and sandboxed agent runtimes on microVM or gVisor isolation.

And measurement? Start embarrassingly simple: the vendor’s usage dashboard and a shared spend spreadsheet. That’s level 2, and it already beats most teams. Level 3 wants engineering-intelligence platforms (DX, Jellyfish, LinearB) and LLM spend observability (Langfuse, Helicone, or your cloud cost tools), tagged per team and per flow. Level 4 builds the unit-economics dashboard and budget alerts on top. By then you’ll know your cost per merged PR better than your cloud bill. Which, let’s be honest, nobody knows either.

Say what?

The CEO was right: proudly saying you use AI is no longer cool or acceptable. Everyone does. Knowing where you stand is the new flex: seven colors, a named weakest link, and the next gate to clear. Run the matrix with your whole team this month. Lift the lowest color. Repeat.

What’s your team’s weakest dimension? Enter the matrix and tell me what surprised you.

Author

About Martijn Rutten

Fractional CTO & technology entrepreneur with a long history in challenging software projects. CTO at LUMO Labs impact VC. Former CTO of scale-up Insify, changing the insurance space for SMEs. Former CTO of fintech scale-up Othera, deep in the world of securitized digital assets. Coached many tech startups and corporate innovation teams at HighTechXL. Co-founded Vector Fabrics on parallelization of embedded software. PhD in hardware/software co-design at Philips Research & NXP Semiconductors. More about me.