TandemTrace TandemTrace
// OPS · FIELD NOTE 6 minute read

What an SLO for a SOC actually looks like.

Most SOCs measure themselves on MTTR (mean time to respond) and MTTD (mean time to detect). Both are easy to game and neither tracks what you actually care about. We've spent the last year building a different scorecard with our customers; here's what it looks like.

Why MTTR is a bad metric

Three reasons:

  1. Resolution is ambiguous. Was the case resolved, or just closed because the analyst stopped looking at it? MTTR doesn't distinguish a real fix from a graceful punt.
  2. Averages mask the tail. High-cost incidents always take longer. Averaging them with the routine cases produces a number that looks healthy while the cases you actually care about are getting worse.
  3. Optimising MTTR creates the wrong pressure. Analysts get rewarded for closing fast, not for closing right, and the faster path is often the worse one.

Why MTTD is worse

MTTD is detection time, which sounds important. The problem is what it measures: the gap between the data landing in your tools and the SOC seeing the alert. Most of that delay is in your pipeline, not in the world.

MTTD also says nothing about the alerts you didn't generate. If your detections never fire for a class of attack, MTTD on that class is undefined, and undefined hides in the average. The metric tells you nothing about the most dangerous case in your environment.

A better frame

Distinguish two things:

  • Target SLOs. What you commit to. Small in number, hard to game, meaningful enough that missing them triggers a real response.
  • Health SLIs. What you watch. Many of these, organised by alert class. They feed the SLOs but they aren't what you report up.

Four target SLOs we've seen work in practice:

SLO What it measures Why it's hard to game
P95 time-to-verdict, per alert class How long a verdict actually takes for each class, not in aggregate. Per-class P95 surfaces the bad classes. Averages can't hide them.
% of high-cost alerts closed with full evidence chain Whether the alerts that matter are actually being investigated, not just touched. Evidence chains are auditable. Closure alone isn't.
% of closes that survive analyst re-review Whether triage decisions hold up when looked at fresh, days later. Re-review catches over-aggressive closes. Close-fast pressure shows up immediately in the number.
Time-to-coverage-gap-detected after a telemetry change Whether the SOC notices when a data source breaks before someone exploits it. Coverage gaps don't generate alerts. This is the only metric that surfaces them.

Per-class, not aggregate

The single most important shift in this frame is doing it per alert class. Aggregate metrics let you game the easy classes to lift the average. Per-class metrics force you to look at the high-cost, low-volume classes where the actual risk lives.

Most SOCs run on tickets and queues. Tickets don't tell you what the analyst was thinking when they closed them, which is most of what you actually need to know.

This is also why these metrics are hard to run without an autonomous layer doing the bookkeeping. Tracking time-to-verdict per class, evidence-chain completeness per close, and re-review survival rate requires structured per-alert state that most SOCs simply don't have today.

What to stop and start

// STOP REPORTING

Aggregate MTTR to your board. Aggregate MTTD on a leadership dashboard. Any single "closure rate" number without an evidence-chain qualifier.

// START REPORTING

P95 time-to-verdict on the top three highest-cost alert classes. % of high-cost alerts with full evidence chain. Re-review survival rate. Time-to-coverage-gap.

The numbers will look worse on paper at first. That's the right outcome. The old number was a vanity number, and the new ones are the math you've been quietly aware of all along.


If you're shaping a SOC scorecard for the next planning cycle and want to compare notes, we'd be glad to talk.