# Tool-use stability is the real moat, not the model

**Category:** Engineering · **Author:** A. Tomé · **Date:** 2026-05-17 · **Reading time:** 6 min · **Tags:** engineering, integrations, broker, reliability

> *Hidden draft. Not yet published.*

Every demo of an AI SOC looks the same. The agent reads an alert, fans out to three integrations, comes back with a verdict, writes a tidy paragraph. Everyone in the room nods. Everyone in the room is also being shown a happy path that we, the engineers, spent a quarter making look effortless.

The actual work is in the broker layer between the agent and the integrations. The model gets the credit; the broker carries the weight. If you're evaluating a vendor and the conversation never leaves the model, you're being shown a demo, not a product.

## Where production actually breaks

Three real failure modes from the last six months. Vendors anonymised; the shape is the point.

### The field rename

A major EDR vendor renamed a field in their alert API. The old key was `process.parent.commandLine`. The new key was `process.parent.command_line`. They shipped both for a release, then deprecated the old one quietly in a minor version.

Every agent in the field that flattens EDR alerts into its prompt context lost the parent command line for two weeks. The agents kept running. Verdicts kept being issued. Closure rates stayed stable. The only signal anything was wrong was a slow, almost imperceptible drift in the *quality* of escalations — they were arriving with thinner evidence than usual, but no individual case looked broken.

The fix is not "tell the model to handle both keys." The fix is a contract test in the broker that runs against the vendor's sandbox every hour and fails loudly when the schema changes. The fix is also a freshness check on every integration call: if a known field is suddenly null on 100% of recent responses, page the on-call engineer, not the SOC analyst.

### The pagination drift

An identity provider changed the default page size on their audit log API from 1000 to 100. The breaking change was in the docs. The migration guide gave 90 days notice. We were not on the announcement list because the integration was registered to an engineer who had left a year prior.

The agent's hunting workflows kept running, but on roughly 10% of the historical window they thought they had. Coverage gaps opened silently. The detection that used to cover the last 24 hours of OAuth grants was covering the last 2.4 hours.

The fix is to never trust an integration to define its own pagination. The broker pages explicitly, asserts the total record count against a `Has-More` header, and refuses to claim "complete" if the totals don't match.

### The null vs empty array

A widely-used EDR's process search endpoint changed its behaviour for zero-result queries. Old: returned `[]`. New: returned `null`. JSON parsers handled both fine. The agent's prompt template included a line "Here are the matching processes: {{results}}." When `results` was `null`, the template rendered as "Here are the matching processes: null."

Half the time, the model interpreted "null" as "I don't know" and triggered an escalation. The other half, it interpreted it as "no results" and closed. The behaviour was non-deterministic across runs. Two thousand alerts went through this code path before a customer noticed the inconsistency in a quarterly review.

The fix is that no integration response touches the model's prompt window directly. Every response goes through a typed shim that normalises absences, empties, and errors into one of three explicit states — `not_found`, `not_applicable`, `tool_error` — and the prompt template handles each state explicitly.

## Why this is the moat

Anyone with a credit card can call a frontier model. The differentiation in agentic SOC products is not which model you use; it's how robust your broker layer is to the things the model never sees. Schema drift, partial pages, silent timeouts, rate limits, ambiguous nulls, dual-write inconsistencies — these are the daily reality of running an agent against fifteen vendor APIs, and they are the daily reality even before you consider that half of those vendors are going to ship a breaking change in the next quarter.

The team that figures this out is the team that survives a year in production. The team that doesn't ships a beautiful demo, gets to first revenue, then watches their closure rate erode silently as the world drifts underneath them.

## What good looks like

A few concrete patterns we run internally. None of this is novel; all of it is unglamorous, and the absence of any of it is a tell.

**Contract tests against every integration.** Run hourly against the vendor's sandbox. Assert the schema, the pagination behaviour, the rate limit, the auth flow, the timeout. Fail loudly. The test fixtures are versioned alongside the integration code; when a vendor renames a field, the test fails before the customer alert pipeline does.

**Typed response shims.** Every integration response is normalised into a strict internal type before it touches the agent. Absences are explicit. Errors are explicit. Empty results are explicit. The prompt template never sees a raw vendor response, ever.

**Replay harness.** Every case the agent has ever worked is replayable against the current broker. When you upgrade the EDR integration, you replay the last 90 days of cases through it and look at the diff. If the diff is non-zero, you know exactly which cases changed verdict, and you can decide whether the change is improvement or regression *before* you push to production.

**A freshness floor on every integration.** If a known-good field is suddenly null on 99% of recent responses, the broker raises a P1. The agent doesn't get the option to silently start producing degraded verdicts.

**Per-tenant tool versions.** When a vendor ships a breaking change, we move one tenant at a time. Customers do not all hit the new integration on the same day. The blast radius of a bad upgrade is one customer for one hour, not every customer for a week.

## What to ask a vendor

If you're evaluating an AI SOC product, three questions cut to whether they've done this work or not.

1. **"When did your largest integration last ship a breaking change, and how did you find out?"** A confident answer cites a specific recent event, the contract test that caught it, and the hours of lag between vendor change and broker fix. A nervous answer pivots to talking about the model.

2. **"Show me your replay harness."** If they don't have one, they cannot tell you with confidence what the agent would do today on a case it handled six months ago. That gap matters more than benchmarks.

3. **"What happens to my verdicts when your integration to EDR Vendor X goes down for an hour?"** The right answer is "every affected case is marked as `tool_error` and held, none are silently auto-closed." The wrong answer is hand-waving about fallbacks.

The model is the part of the system that gets discussed. The broker is the part of the system that determines whether the agent is trustworthy in production. Get this right and the model upgrades feel like free improvements. Get it wrong and the agent is a lottery ticket whose payout silently changes when its dependencies move.

---

*Engineering questions: <a href="/contact">hello@tandemtrace.ai</a>.*
