A few weeks ago I ran one of the AI automations I use to speed up a chunk of my own work. It started generating immediately. The output was structured, the tone was close, the length was right. For the first forty seconds I thought: this is it.

I should have squinted sooner. The reasoning in the third section was circular. The example I'd needed it to anchor on was gesturally present but never actually used. A phrase appeared three times in slightly different forms, which is what a model does when it doesn't have enough concrete material and fills the gap with variations on the same idea. I was five or six iterations away from something I'd feel comfortable sending to someone who would actually scrutinise it. The output had the shape of good work. It wasn't good work yet.

I've been as guilty as anyone of assuming AI could run before it had learned to walk.


A&O Shearman cut twenty business services roles in London in May, framing it as part of a "tech-driven operations" push following years of AI investment. They're not alone. Headcount reductions and AI investment announcements are arriving in the same breath, as though deploying the tool were sufficient proof that the tool works. The confidence is striking. The evidence underpinning it is thinner than the press releases suggest.

ACC data from early 2026 puts corporate legal AI adoption at 52%, up from 23% the prior year. Only 7% of teams have seen any reduction in total matter cost. That is not a technology problem. I think it's a misunderstanding about what AI actually delivers, and when in its deployment arc it starts delivering it. The adoption has happened without the outcomes following, and rather than investigating that gap, a significant portion of the market has responded by cutting headcount on the assumption the AI will cover it, and announcing the next deployment.

The demos don't help. They're genuinely compelling. When you see a configured agent process your NDA stack and the output looks like something your best paralegal would have produced, the temptation is to assume that quality will materialise across your entire contracts function the moment you say go. It won't. Not without the iterative configuration, stress-testing, and oversight that nobody puts in the press release.


The "just deploy it and see" approach fails in a predictable way that has nothing to do with the technology being bad.

An AI agent doesn't start performing at its ceiling the moment it's switched on. It starts at a baseline shaped entirely by how well you've encoded your own context into it: your playbooks, your preferred positions, your risk tolerances, the questions your legal team would ask before anyone else thought to. Without that encoding, the model answers the question in front of it using general legal knowledge. Which is often impressive, often right, and sometimes subtly wrong in a way that only becomes visible when you're familiar enough with the territory to notice. The output passes a casual read but not scrutiny, and nobody flags the gap because the gap looks like a good answer.

The Icertis survey of 1,000+ corporate legal practitioners published this May found that 47% would not detect an incorrect AI action until after it had occurred, sometimes days or weeks later. Nearly 60 per cent claimed they were "prepared" to govern AI agents. Those two numbers are not compatible. The gap between them is the most honest description of where the market actually sits.


What I see in legal ops is a pattern that reliably produces the worst of both worlds. A head of legal ops comes back from an industry event energised. They build a list: NDAs, DPAs, intake triage, vendor agreements, policy queries, outside counsel briefings. The list isn't wrong. The problem is sequencing. Configuring twelve agents at once means none of them gets the depth it needs, which produces twelve workflows that are "promising but not quite there," and a team of lawyers who've been told to trust the AI and are now quietly not trusting it. You fail at multiple things simultaneously, and it's hard to learn from any of them.

The teams that reach actual production quality got there by doing one thing well before they tried to do two. Pick the highest-volume, most rule-governed, politically visible workflow. NDAs are usually the right starting point: high volume, bounded legal logic, and business users with strong feelings about turnaround time, which means the before-and-after is legible to people outside legal. Deploy a single agent. Give it a dedicated email address. Route real requests through it from week one, not test documents, not sanitised examples. Watch what breaks. Fix it. Watch what breaks again.


What that process actually requires is worth being honest about, because the usual advice stops at "get your hands dirty" without describing what dirty looks like.

Send the agent adversarial inputs, not just the clean documents you expect to see. An NDA where the indemnity clause is buried in a schedule rather than the main body. Counterparty paper with no numbered headings, obligations scattered across multiple sections, a clause that looks standard but imports additional terms from an attached exhibit. The same agreement as a scanned PDF rather than a clean Word document. These are the edge cases that destroy trust at the worst possible moment, when a lawyer reviews the agent's output for the first time and finds it's missed something obvious.

Configure against your own playbook, not a generic one. If your team's position on limitation of liability deviates from market standard for a specific supplier category, that deviation needs to be in the configuration as an explicit rule, with the reasoning and an escalation trigger for the ambiguous cases. The difference between an agent running on your actual positions and one running on a general playbook is often invisible in the output until the edge case where your position would have changed the call and the agent's didn't. That's the moment a lawyer stops trusting the system.

Build measurement in from week one. The escalation rate, the correction rate, the first-pass approval rate, how those numbers move over time. Not because the CFO will ask, though they will, but because those numbers are how you find out whether the agent is improving or quietly getting worse on a specific document type. An agent processing 400 requests per month with a 7% correction rate is a fundamentally different thing from one with a 35% correction rate, and the difference is almost always a configuration gap, not a model limitation.

Before adding the second use case, talk to the lawyers supervising the first. Not a survey. A conversation. Which document types are generating the most corrections? Are there counterparties whose paper reliably causes problems? Is the escalation trigger firing at the right threshold? The answers tell you whether the first deployment is ready to be a foundation or needs six more weeks of tuning. Most need the six weeks. Most teams move on before they're ready, because the first one looks like it's working, and "looks like it's working" is not the same as "working well enough to build on safely."

This is a substantial amount of ongoing work. It requires someone whose job it actually is, not a legal ops leader fitting it in alongside everything else, not a lawyer who became the AI champion because they seemed enthusiastic in a meeting. The organisations I've seen cycle through pilots without ever converting them into structural outcomes share a pattern: they treated deployment as a project with an end date rather than a system requiring ownership. When the project nominally concluded, nobody was left holding the configuration, the agent drifted as edge cases accumulated uncorrected, and at some point the lawyers quietly stopped trusting it. By the time anyone looks closely, the AI budget window has closed, and the headcount that might have funded a proper deployment has already been cut on the assumption that the AI would cover it.


The deployments that actually produce outcomes — where a year in, the agent is processing more volume at higher accuracy than at launch — tend to share one thing: the people running them had learned these lessons on someone else's deployment before they ran yours. Someone who had already seen what breaks, where the general playbook fails, which edge cases the escalation trigger misses. A deployment designed from the start around specific outcomes, with measurement, supervision, and iteration built in rather than bolted on when things go wrong.

The gap between 52% AI adoption and 7% actual cost reduction is not a technology gap. I don't think it's even primarily a sequencing gap, though sequencing matters. It's an expertise gap. The tools work, when the work of deploying them has been done properly. That work is learnable, but learning on your own budget means the organisation absorbs the cost of every mistake. The organisations getting outcomes, quietly and without press releases, are the ones who found a shorter path to the expertise they needed rather than building it from scratch while the CFO watches the clock.

Squint at the detail. The distance is deceiving.