Closure rate is a vanity metric — TandemTrace [DRAFT]

Every vendor pitch deck has a closure rate slide. "Our agent closes 78% of alerts autonomously." "Ours closes 84%." "Ours closes 91%." The numbers are presented like they're comparable. They aren't, and the comparison is the wrong question anyway.

Closure rate measures one thing: how often the agent is willing to commit to a verdict. It does not measure whether the verdict was right. It does not measure what the rest of the system did with the work the agent didn't close. It does not measure whether the team is better off having deployed the agent at all.

Four metrics do. Here's the scorecard we run internally and the formulas behind each one.

1. Verdict reversal rate

Formula: (cases closed by the agent that were later reopened by a human or by automation) / (total cases closed by the agent), rolling 30-day window.

This is the closest thing to a true "was the agent right" signal that exists in production. A closed case that gets reopened — because a human reviewed the daily sample, because the same indicator fired again three days later, because a customer pushed back, because a quarterly audit flagged it — is the strongest evidence available that the original verdict was wrong.

The healthy operating range we see across deployments is under 2%. Above 5%, the agent is closing too aggressively and the confidence floor needs to come up. Under 0.3%, the agent is closing too conservatively — it's leaving easy verdicts on the queue for humans to handle, and the team isn't getting the lift they paid for.

The trap is that reversal rate is a lagging indicator. A case can sit closed for weeks before someone flags it. So we pair it with a sampled quality review: 1% of closed cases are pulled at random every week and re-investigated by a Tier 2 analyst. The "manual reversal rate" on that sample is the leading indicator that the lagging metric is about to move.

2. Handoff fidelity score

Formula: for each escalated case, give the human only the agent's handoff packet and ask them to reach a verdict from that alone. Score 1 if it matches the eventual ground-truth resolution. Average across 30 days.

This is the metric that nobody publishes because it's painful to measure honestly. It asks: if your agent escalates to me, can I do the job from what you've handed me, or do I have to redo the investigation?

A handoff fidelity of 0.9+ means the agent's packet contains enough signal that the human is genuinely picking up where it left off. Below 0.7 means the human is starting over, and the agent's escalations are creating work rather than concentrating it.

The fidelity score is the single most diagnostic metric for whether your AI SOC is actually integrated with your team or running in parallel to them. Closure rate goes up when the agent gets more aggressive. Fidelity score goes up when the agent gets more honest.

3. Coverage gap latency

Formula: for each newly-published detection rule, MITRE technique, or vendor advisory, measure the time until the agent can detect it in your environment. Median across the trailing quarter.

Closure rate tells you about the alerts you saw. Coverage gap latency tells you about the alerts you didn't see, and how long that condition persisted.

This is a hard metric to put a number on, and we publish it as a distribution rather than a single value. Our internal target is a P50 under 7 days and a P90 under 21 days. Anything past 30 days, the technique is functionally invisible to your environment, no matter what your other metrics look like.

This metric also catches a class of failure that no closure-rate dashboard ever catches: the agent silently has no detection for an entire technique family, the alerts never fire, and the dashboard looks healthy because nothing escalates from a thing that never happens.

4. Long-tail accuracy

Formula: per-class verdict reversal rate, restricted to alert classes seen fewer than 10 times in the trailing quarter.

The agent's overall accuracy is dominated by the alert classes it sees thousands of times a day. The classes that actually matter — credential stuffing against a SaaS app you onboarded last month, the new MITRE technique that landed in February, the OAuth abuse pattern that's specific to one of your three identity providers — are the rare ones.

Overall closure rate hides this completely. Long-tail accuracy doesn't. If the rare-class reversal rate is much worse than the common-class rate, the agent has learned to handle the queue's center of gravity and is fumbling the cases where errors are most expensive. This is the opposite of what you want from an autonomous layer.

Our internal alarm threshold is a long-tail reversal rate that is more than 3× the overall reversal rate. Above that, we treat the agent as untrustworthy on rare classes and route them through human review automatically.

What we put on the dashboard

These four metrics, with the comparison to last quarter, on one row at the top of the SOC dashboard. Everything else (closure rate, MTTR, alert volume, analyst utilisation) sits below the fold. The four answer the only four questions that matter:

Are the agent's verdicts holding up? (reversal rate)
Is the agent making the human's job easier or harder? (fidelity)
Is the agent's blind spot growing? (coverage gap latency)
Is the agent good where it counts? (long-tail accuracy)

We also publish these to customers monthly. The closure rate is in the appendix.

What to ask your vendor

If you're evaluating an AI SOC product, ask for these four metrics by name. Not their definitions — ask to see the numbers, on their actual customer data, anonymised. A vendor that has these instrumented can pull them up in minutes. A vendor that doesn't will offer to follow up; they're going to invent the measurement after the call.

Closure rate is the most-cited number in this category and the least useful. The four above are how a serious SOC lead actually scores an autonomous layer. They're also how the agent's own engineering team should be scoring itself — if they aren't, the agent will drift in directions that look good on the wrong scoreboard.

See also What an SLO for a SOC actually looks like for the broader SLO framework these four sit inside.