Measuring What Matters: Code Review Velocity Without Goodhart's Law

Engineering managers who want to improve code review process face a measurement problem almost immediately. The outcome we care about — review quality, the degree to which review catches real problems before they ship — is hard to measure directly. What's easy to measure is review speed: time from PR opened to first review, time from PR opened to merge, number of review comments, comment resolution rate. These are proxies. The question worth asking carefully is whether optimizing for the proxy improves the underlying outcome, or whether it just teaches the team to move the number in ways that don't help.

Goodhart's Law states roughly that when a measure becomes a target, it ceases to be a good measure. Code review metrics have a particularly high susceptibility to this failure mode, because the behaviors that improve the number are often specifically the behaviors that reduce review quality. Fast merge times can mean better process, or they can mean people stopped actually reading the diff. More comments per PR can mean thorough reviews, or it can mean low-signal nitpicking. Knowing which is which requires looking at a constellation of signals rather than optimizing a single metric.

Time-to-first-review: useful with major caveats

Time-to-first-review (TTFR) measures how long a PR sits waiting for its first substantive reviewer interaction. This is one of the more defensible metrics because it directly measures a failure mode with real cost: PRs that sit for 24-48 hours before anyone looks at them create context-switching overhead for the author, introduce the possibility of merge conflicts, and slow down the development cycle in a way that's directly felt by the team.

The problem comes when teams start optimizing TTFR directly. Consider an engineering team at a growing infrastructure software company with 12 developers. They implemented a TTFR target of under four hours. Within a month, they'd hit the target reliably. They'd also developed a new pattern: reviewers would open a PR, leave a single "LGTM — looks good, a few minor things inline" comment, and approve. Comments were superficial. The TTFR number looked great. The review quality had degraded substantially, because reviewers were providing a response rather than a review.

The fix is pairing TTFR with a quality signal: post-merge defect rate (bugs filed within 30 days of merge that trace back to a specific PR), or the ratio of blocking vs. non-blocking comments. Neither is trivial to collect, but they make the TTFR number meaningful rather than gameable.

Review depth metrics and what they actually measure

Review depth metrics — comments per PR, comment resolution cycles, lines reviewed per hour — try to measure how thoroughly a reviewer engaged with the code. They're useful as diagnostic signals for individual outliers but dangerous as team targets.

Comments per PR has an obvious failure mode: engineers write more comments to hit the number, and the additional comments are noise. Nit-level style feedback ("I'd rename this variable to be clearer") is easy to produce at volume. The comments that matter — "this race condition will trigger when two requests hit this path simultaneously" — are rare and don't scale with effort. More comments does not mean better review.

Lines reviewed per hour is a metric that sounds useful but has almost no validity as an absolute number. A reviewer working through a 50-line change to a complex concurrent system will rightfully take an hour. A reviewer working through a 500-line mechanical refactor can move quickly without missing anything important. The relationship between lines-per-hour and review quality runs in opposite directions depending on the nature of the code being reviewed.

What does carry signal is trend over time for the same reviewer on the same codebase. If a reviewer's average engagement time per 100 lines of logic change drops substantially over a quarter, that's worth investigating. Is the code getting simpler? Is the reviewer moving faster because they know the codebase better (good) or because they've stopped engaging carefully (not good)? The metric is a prompt for a conversation, not a verdict.

Cycle time: the executive dashboard number that misleads most

Cycle time — usually measured as time from first commit to production deploy — is the DORA-adjacent metric that engineering leadership most frequently asks to improve. There's nothing wrong with caring about cycle time; long cycle times compound the costs of batch delivery, context switching, and feedback delay. The problem is that code review latency is only one component of cycle time, and it's a component with genuine quality trade-offs that get obscured when the whole number is the target.

A team that reduces code review latency by lowering review standards will show improved cycle time in the short term. The regressions this produces will show up in incident count, rollback frequency, and engineer time spent on post-merge fixes — metrics that aren't typically attributed back to the review process change that caused them. This creates a classic delayed-feedback problem: the intervention looks successful, the root cause of the regression is misdiagnosed as implementation bugs rather than review gaps, and the process change that caused the problem is never reversed.

The Accelerate research on elite software delivery organizations (the DORA State of DevOps work) is useful context here. High performers have both faster cycle times and lower change failure rates. The two metrics move together for high performers because the underlying enablers — small batch sizes, good test coverage, deployment automation — improve both simultaneously. Teams that try to improve cycle time by cutting review quality will diverge from this pattern: faster deploys with higher failure rates, which is not the destination.

Metrics that have genuine signal

The metrics worth tracking in a code review process are those that measure outcomes at the system level rather than reviewer behavior at the interaction level. Post-merge defect rate — the number of bugs filed within a defined window after merge that are attributable to a specific PR — is the most direct measure of review effectiveness. It's imperfect (not all defects are caught in bug reports, attribution to a specific PR requires tracing), but it points in the right direction. A team whose post-merge defect rate is trending down while cycle time is stable has evidence that their review process is improving. A team whose cycle time is improving while post-merge defect rate is climbing has evidence of a quality-speed trade-off that will eventually catch up with them.

Review coverage — the percentage of PRs that received at least one substantive reviewer comment (not just a drive-by LGTM) — is a proxy for engagement quality that's harder to game than comment count. It doesn't measure whether the comments were good, but it identifies the failure mode of perfunctory approval without engagement.

Reviewer assignment concentration is underused as a signal. When the same two engineers are reviewing the majority of PRs from a large team, you have a knowledge bottleneck and a sustainability risk. Spreading review responsibility across the team builds shared ownership of the codebase and reduces the cognitive load on any individual reviewer. Tracking it as a team health metric is low-cost and surfaces a problem that's invisible in productivity dashboards.

The measurement trap to avoid

The most common mistake in engineering process measurement is picking the metric that's easy to instrument and working backward to justify it as meaningful. Dashboards are built around what's available in the SCM API: PR count, merge time, comment count, approval count. These are available because they're generated by the tooling, not because they're the right things to optimize for.

The right approach is to start from the question: what are we actually trying to prevent or improve? If the answer is "we want fewer production incidents caused by code that shouldn't have been merged," the metric is post-merge defect attribution, which requires some manual work to collect accurately. If the answer is "we want authors to spend less time waiting for reviews so they can iterate faster," the metric is TTFR with a reviewer engagement qualifier. If the answer is "we want to prevent one or two engineers from becoming bottlenecks," the metric is review assignment distribution.

Tracking the wrong metric precisely is not better than tracking nothing. What changes in response to the metric is the behavior you're selecting for — and the behavior you select for will drift toward the metric rather than the underlying goal. That drift is the core of Goodhart's Law, and it's particularly costly in code review because the failure modes (bugs in production, engineer burnout from reviewing too much, knowledge silos) are real and expensive.

Metrics are most useful when they're treated as diagnostic probes rather than performance targets. Point them at the outcomes that matter. Watch for divergence between the metric and the outcome. And when they diverge, trust the outcome.

Time-to-first-review: useful with major caveats

Review depth metrics and what they actually measure

Cycle time: the executive dashboard number that misleads most

Metrics that have genuine signal

The measurement trap to avoid

Commitloom catches the bugs this article describes.

Related articles

Why Most AI Code Review Tools Miss the Point

The Bug That Slips Through Every Time: Null Checks and Off-by-Ones