A few weeks ago I watched a general counsel run a clause past Claude over breakfast. She had a redline from a counterparty, an indemnity carve-out that did not sit right, and forty-five minutes before her first call. She pasted the clause in, asked the model to stress-test it against the original commercial intent, and got something useful back. Not a finished answer. A list of three plausible interpretations of the carve-out, ranked by which one the counterparty was most likely to be reaching for, and a suggested redraft of the version she should accept. She got on with her day.
That afternoon, she spent twenty-five minutes on the phone with her bank, trying to dispute a charge through an AI agent that could not work out, after she had explained it twice, that she was the account holder and not someone called "yesterday."
Two AI interactions, one day, the same handful of underlying models in the background of both. One was sharp enough to change how she spent her morning. The other was bad enough to make her wonder whether anyone at the bank had ever used the thing they had shipped.
What separated those two experiences was not the model. It was everything wrapped around it.
This, I think, is the sharp question for the next twelve months of legal AI procurement. The interesting variable is no longer the model. It is everything the vendor decided to put around the model, and whether any of that judgment is any good.
The harness is no longer the hard part
Engineers have a word for the scaffolding that wraps a model and turns it into a product. They call it the harness. The model is the engine. The harness is everything else: the prompts, the routing, the tools the model can call, the guardrails, the logging, the retry logic, the rules that decide when the system asks a human and when it acts on its own.
A year ago, building a harness was a serious engineering project. It took a team and several months. Today, an experienced engineer with the right coding assistant can put together a working harness over a long weekend. Not a toy. A real harness that takes an email, classifies it, drafts a response, and either sends it or queues it for review. I have watched colleagues do it. I have done it myself.
That fact has not really sunk in, certainly not in legal AI procurement. The mechanics of getting an AI product to look impressive in a demo are no longer hard. They are not free, but they are no longer where the value sits. If the harness was the moat, the moat has been drained.
Which leaves the buyer with a question the industry has not started answering openly. If the harness is cheap and the model is shared, what is the buyer actually paying for?
The answer, I find, is judgment. Every AI product is, underneath, an accumulated set of small choices the vendor's team has made about how the system should behave when the easy path runs out: how to handle a request the model is not confident about, when to escalate rather than answer, when to stop entirely, what to remember after the inbox has emptied. None of these choices surface in a demo. All of them surface by week six.
The Monday inbox
It is five o'clock on a Monday morning. The lawyer will be at her desk at nine. By the time she sits down, the inbox of sixty-something requests that accumulated over the weekend will either have been worked through or it will not, and the state of the inbox when she opens it is the first real test of the AI system the team installed six weeks ago.
The usual mix is in there: a few NDAs, a couple of vendor agreements, three queries about template language, an urgent regulatory question from product, a long thread that has been running for two weeks where someone is still chasing a contract back from outside counsel, two thank-you emails, one phishing attempt, and a quiet, well-disguised request from finance that needs to be read carefully because the commercial terms have moved since the team last looked at this customer.
The vendor demoed the system impressively six weeks ago. The team approved it. The pilot group has been running it for a fortnight. Monday morning is the first time the system has met the full weekend backlog at scale, and it has four hours to do something useful with it before anyone is around to help. This is where the small decisions begin to matter, and where two systems built on identical underlying models will produce wildly different products.
It begins with the question of attention. Sixty emails. Some short, some long, several with attachments, two with attachments that are themselves zip files. A naive build reads everything with the same care, spends a small fortune in tokens on the thank-you emails and the phishing attempt, and arrives at the urgent regulatory question with the same flat attention it gave to "thanks!" A considered build burns tokens disproportionately on the inputs that warrant it. It reads the two-week-old thread twice, because the thread length and the named participants tell it that this is the kind of request where misclassifying is expensive. It skims the obvious noise.
Half an hour in, the system meets the request from finance, the one with the shifted commercial terms. The model can produce a plausible classification, but its confidence is low. A naive system commits to its first answer. A considered system has been built to recognise its own uncertainty and route accordingly. It might run a second pass with a different prompt. It might quietly flag the email for the lawyer's attention rather than respond. The user-visible difference, by the time the lawyer opens the inbox at nine, is that one system has already sent a brisk, slightly wrong response to finance overnight, and the other has set that email aside in her queue with a one-line note explaining why.
A little later, buried three lines into the body of a routine vendor agreement, is a question about whether the company would consider an unusual payment-term arrangement on a different contract entirely. The vendor agreement itself is fine. The question is not. A naive system answers the agreement and ignores the question, or answers both and gets the second one subtly wrong. A considered system has been built to notice that the email is doing two things, decouple them, and answer the easy one while escalating the hard one. The decoupling is the judgment. The model does not require it. The system has to be designed to want it.
Now and then the queue surfaces something the system really should not touch. A subpoena draft, attached to an email that looks routine. A trademark question that turns out to involve a counterparty in active litigation. A request from a junior employee that, on closer reading, would commit the company to something the legal team did not authorise. A naive system completes its task. A considered system has a notion of "stop": this is no longer my problem, this is a human problem, I am going to surface it and pause. Whether the system has that notion is not the model's call. It is whether the people who built it have seen enough of these cases to know that the worst failure mode is a system that confidently helps when it should have stopped.
By eight, the inbox is mostly clear, and the question becomes what to keep. The borderline classification on the finance email is a learning opportunity. So is the moment the system stopped on the subpoena. So is the moment it nearly answered the unusual payment-term question. A naive system forgets these moments by the end of the day. A considered system logs them, reviews them, and adjusts. Over a quarter the gap between the two compounds into something you can measure on a graph.
By nine o'clock, when the lawyer sits down, what ought to be true is this: a meaningful proportion of the sixty emails has already been handled cleanly, a smaller proportion is waiting in her queue with a one-line note about why, and her morning starts with judgment work rather than triage. What actually happens depends on whose harness has been working through the small hours. Two products, the same model, two completely different inboxes.
What the small decisions are actually worth
The reason any of this matters, beyond the visible difference in user experience, is that the small decisions are what determine whether the underlying economic problem in a legal department actually moves.
The problem in most in-house teams, when you look at the budget, is not that the team is slow. It is that work which does not require a senior lawyer is consistently consuming a senior lawyer's attention. The contract that follows a template. The vendor agreement that has been turned twenty times. The email that asks for something the team can answer in two paragraphs. For every pound spent on legal software in a department, somewhere between ten and fifty are spent on legal services. The tools have been competing for the small line. The services have been doing the work.
An AI system that gets the small decisions right takes that work off the senior lawyer in a way that holds. The inbox is shorter at ten. The morning starts with the question the lawyer is uniquely paid to answer. Over a quarter, the budget line on outside counsel for routine work starts to move, because the routine work is no longer being routed up. The economic problem moves.
An AI system that gets the small decisions wrong does the opposite. The first wave of misclassifications puts the wrong things in front of the lawyer, the right things in front of nobody, and the team develops a sceptical relationship with the system within a fortnight. The lawyer ends up doing the work the system was supposed to do, and the work of correcting the system on top. The budget line does not move. The team blames AI. The CFO concludes AI does not work in legal. None of that is the model's fault. The model worked. The harness did not.
This is the test, I think, that will start to matter more than any demo. It is also the only test that horizontal AI products, the ones not built specifically for the work a legal team actually receives, cannot reliably pass. Not because their models are worse. Because the judgment required to make those small decisions correctly is domain-specific, and you only build it by sitting with the work long enough to know which kinds of failure are tolerable and which kinds end up in front of the regulator.
The next twelve months
Twelve months ago, the legal buyer's question about AI was "does this work." Six months ago it was "where does this work." This year, increasingly, it is "how does this work, when something goes wrong." That last question is the one the harness has to answer.
A useful exercise, if you are evaluating an AI vendor right now, is to ignore the demo and ask for a walkthrough of three things: a request the system would refuse to act on, a request the system would handle but flag, and a request the system handled last week that turned out to have been wrong. The first answer tells you whether the vendor has a notion of "stop." The second tells you whether they have a notion of uncertainty. The third tells you whether they have a notion of memory. Between them, the three answers tell you whether the vendor has any judgment under the demo, or whether the demo is the whole product.
The reason this kind of question will start to land is that buyers themselves are getting much sharper. The GC who used Claude over breakfast and ran into the bank chatbot in the afternoon is, in a small way, already running the experiment in her own life. She knows what it feels like when the AI is good and when it is bad. She knows the model is not the variable. She has not yet articulated what is, but she will, because every day she is being given more data.
The next twelve months are going to sort AI vendors in legal more harshly than the last twelve did, and I think for a good reason. The harness is no longer a moat. Anyone can build it. The product is now the judgment built into the harness, and judgment cannot be vibe-coded over a weekend. It comes from being deep enough in the work to know when to spend tokens, when to escalate, when to stop, and when, in the small hours of a Monday morning, the right answer to an email is no answer at all.


