The first 100 verdicts: ramp behavior of an autonomous SOC

Most of what gets written about AI SOC products skips the part where you actually deploy one. The marketing arrives clean and steady-state: agent runs, agent closes alerts, analysts have more time. In the real world, the first thirty days are messy in a way that is predictable, fixable, and almost never described.

I've watched the first 100 verdicts of nineteen deployments now. The shape of week one is consistent enough to draw. Here's the curve.

The verdicts-vs-confidence picture

Plot two series across the first 100 verdicts in chronological order. The first is the agent's verdict (close, suspend, escalate). The second is the agent's confidence on each verdict. You expect to see calm equilibrium: roughly 70% close, 10% suspend, 20% escalate, with confidence clustering above the policy floor on the closes and below it on the escalates.

That's not what you see in week one.

What you see is a three-phase ramp. The verdicts and the confidence values both move, in opposite directions, and the equilibrium they're heading for isn't visible until verdict ~70 or so.

// Phase 01

Over-escalation

verdicts 1–25 · escalation 40–60%

// Phase 02

FP surge

verdicts 25–55 · reversal spike

// Phase 03

Equilibrium

verdicts 55–100 · steady-state

Phase one — over-escalation (verdicts 1–25)

The agent has no baseline yet. It doesn't know which OAuth grants are normal for this customer. It doesn't know which service principals run nightly jobs. It doesn't know that the IT team uses an obscure remote-management tool every Tuesday. So almost everything looks slightly suspicious.

Confidence in this phase clusters in the 0.55–0.75 range — below the confidence floor for autonomous closure on most of the alert classes the agent sees. The result is an escalation surge. We routinely see 40–60% escalation rates in the first 25 verdicts, against a steady-state of around 15–25%.

The SOC lead's instinct here is to call this a failure and turn the agent off. The right move is the opposite: every escalation in this phase is a training event for the baseline, and the alert classes that are escalating most often are also the ones the agent is most uncertain about. Every escalation that comes back as "expected, benign, here's why" is worth more than ten correct autonomous closes in this phase.

Phase two — false-positive surge (verdicts 25–55)

The agent has now built enough baseline to be brave. Confidence climbs into the 0.75–0.88 range. Closure rate climbs from 30% to 60% across this window. The escalation rate falls.

This is the phase where the failure mode is exactly the opposite of phase one. The agent has learned the shape of normal but not the exceptions to normal, so it confidently closes things that are actually anomalous-but-benign — and, occasionally, anomalous-and-malicious. The verdict reversal rate spikes in this window: not by much in absolute terms, but the trajectory is steep, and if the team isn't watching for it, they'll only notice the elevated reversal rate at the end of the month.

This is the dangerous phase. It's the one that gets vendors fired, because the team sees autonomous closures landing on cases they would have escalated, and they lose trust.

The mitigation is mechanical. For the first 60 verdicts, the agent runs with a confidence floor about 0.08 higher than its steady-state policy. That puts more borderline cases into the suspend or escalate buckets, generating the explicit human signal the agent needs to calibrate, before it earns the right to act on its own judgment.

Phase three — equilibrium (verdicts 55–100)

Sometime between verdict 50 and verdict 70, the curves flatten. Confidence settles into the 0.85–0.94 band on closures and 0.55–0.75 on escalations. Closure rate stabilises wherever the policy floor places it (in our deployments, 70–82%, depending on the customer's alert mix). The verdict reversal rate drops to its steady-state range, usually under 2%.

By verdict 100, the agent is doing what the demo promised. The team has stopped watching every closure and started watching the dashboard. The conversation in the standup shifts from "what did the agent do last night" to "what's still on the human queue and why."

This is the equilibrium most vendors describe as the day-one experience. It is, in fact, the verdict-70-plus experience.

Three things we instrument from verdict 1

Baseline saturation curve. A running count of unique service principals, user identities, source IPs, parent processes, and SaaS apps the agent has seen. The slope of this curve flattens around verdict 40–60; that flattening is the leading indicator that the agent has enough baseline to ease the confidence-floor uplift.

Phase-aware reversal rate. We compute the reversal rate over a rolling 25-verdict window, not a 30-day window, for the first 100 verdicts. The 30-day window hides phase two; the 25-verdict window doesn't.

Per-class first-touch flag. Every time the agent sees an alert class for the first time, the verdict is flagged for human review regardless of confidence. This is non-negotiable for the first 100 verdicts and gradually relaxes after. The cost is a small amount of human work in the first month; the benefit is no class-level surprises baked into the long-tail accuracy three months later.

What customers should expect

Three things, in order of likelihood that the team will get caught off guard by them.

Verdict 25 will look worse than verdict 1. Closure rate will be lower, escalation rate will be higher, and the team will be doing more work than they did before the agent shipped. This is not a regression. It is the agent recording its first 25 expensive lessons in public.

The phase-two FP spike is real. Plan for a 0.5–1.5% reversal rate over verdicts 30–55, against a steady-state of under 2%. If your SOC dashboard rolls this up monthly, you won't see the spike at all; you'll just see the elevated average and miss the temporal pattern.

Equilibrium takes longer than the pilot. A two-week pilot ends in phase two, not phase three. Most vendor pilots end in phase two, which is one reason "we tried it and it was noisy" is the most common piece of feedback in the category. The honest answer is: you stopped watching at verdict 47.

Why this matters for vendor evaluation

If your vendor doesn't have a ramp playbook — confidence-floor uplift, baseline saturation curve, per-class first-touch flag, phase-aware reversal rate — they don't deploy enough customers to know that the first 100 verdicts are different. That's a real signal.

Ask them. If the answer is "we don't really see a ramp, it just works," they're describing the demo. The honest version of this product is: it works after about 100 verdicts. The first 100 are an investment. Plan for them.

Companion to Closure rate is a vanity metric and What an SLO for a SOC actually looks like.