The harness matters more than the model

30 Apr, 2026

When my agent produces bad output, I don't switch models. I open a markdown file.

I don't comparison-shop. ChatGPT might be better than Claude this week — I genuinely don't know, because checking would cost me more than it would earn me. Same with Gemini, same with whatever ships next quarter.

Here's why I think the model question is mostly a distraction. In my experience, output quality is maybe 30% the model and 70% everything around it — how context loads, what rules fire automatically, how memory persists, which tools are wired in, when a second model double-checks the first. That layer barely exists for most users. They paste a prompt, take what comes back, and judge the model.

The industry has started calling that layer the harness. The gap I see widening in AI right now isn't between people on different models. It's between people who have invested in their harness and people who haven't.

What "harness engineering" actually means

The vocabulary has caught up to the practice over the past year.

Andrej Karpathy popularized context engineering in mid-2025 — the work of constructing the entire context window the model sees on each turn, rather than crafting one perfect prompt. He framed the LLM as a CPU and the context window as RAM, and once you internalize that mental model it's hard to go back to thinking of prompts as one-shot inputs.

Martin Fowler put the harness framing in cleaner form: agent = model + harness. The harness is everything else — guides, sensors, toolchain management, memory systems, lifecycle management. Mitchell Hashimoto (HashiCorp co-founder) coined harness engineering for the actual discipline: editing the harness when the agent makes the same mistake twice, instead of arguing with the model in chat. By early 2026 the term was everywhere — Addy Osmani, LangChain, multiple substacks all using it the same way.

So this isn't a term I made up. It's a vocabulary that emerged because enough people doing real work with these systems noticed the same thing: the model is one component in a larger system, and the larger system is what determines whether the output is good.

For me, on Claude Code, the harness includes:

Identity files that auto-load every session — global rules, working philosophy, public-writing policy, who I am
A persistent memory system — ~30 fact files the agent reads, updates, and prunes itself
Hooks that fire on tool use — formatters, dangerous-command blockers, file protectors
Skills the model can invoke — bundles of procedure plus context, triggered by intent
A knowledge graph of project documents in Obsidian, traversed when context demands
A two-phase review system — one pass before drafting catches scope and architecture mistakes, one pass after catches correctness bugs
Cross-model review — Codex CLI running independent checks on Claude's output

None of that is the model. All of it shapes what the model produces.

Why I don't model-shop

There's a cost to switching models that almost nobody talks about: every harness is tuned to a model.

Claude reads markdown rules differently than GPT does. Claude's tool-calling has different failure modes than Gemini's. The way you write a CLAUDE.md is not the way you'd write an instruction file for ChatGPT's harness, even if both are nominally markdown. Hooks, skills, MCP servers, subagents, deferred-tool loading — most of this ecosystem was built around Claude Code first, with cleaner integration than the alternatives. Anthropic ships the cleanest agent infrastructure I've used, full stop.

When someone tweets "I switched to GPT, it's better now," what they often mean is: "I left my Claude harness behind and started fresh with ChatGPT's defaults." That's not a fair comparison. They're putting months of accumulated configuration on one side and zero on the other.

But there's a sharper point. Both labs are shipping at a pace that makes "which model is better today" almost meaningless. Anthropic and OpenAI are releasing new models, new agent features, new tooling on a cadence measured in weeks. Whatever gap exists today between Claude and ChatGPT will probably close — in either direction — before I'd be done adapting to a new ecosystem. Even if I were convinced Codex was structurally better right now, I'd be betting on a window that's already closing by the time my migration is finished.

I have direct evidence of this from inside one model family. When Claude Opus 4.7 shipped in mid-April 2026, the new model interpreted instructions noticeably more literally than 4.6 did. Skills I'd written for 4.6's looser style broke quietly — not in obvious ways, but as subtly worse output: an extra clarifying question I didn't need, a slight drift from the spec, the model resolving an ambiguity differently than I'd have wanted. I had to rewrite my CLAUDE.md, my SOUL.md, every user-authored skill, and two memory files to be literal-compatible with 4.7. That migration took the better part of a day. Same model family. Same provider. Same product line.

Now imagine that cost paid not on a point release, but on a full provider switch — different prompt-format conventions, different tool-calling shapes, different agent infrastructure, different system-prompt idioms, different defaults around memory and context. By the time I'd finished, both labs would have shipped two more model versions, and I'd be doing it again.

So I commit. I bet on the model whose harness I'm willing to invest in, and I trust that whichever lab is "ahead" right now will lose that lead and reclaim it three more times before my next harness rewrite.

Why Anthropic, specifically

There's a pragmatic case for staying on Claude — the harness math I just walked through. There's also a values case, and it weighs more for me than the harness math does. I'm going to start with the messy parts because they matter for whether you should trust the rest of the argument.

In late February 2026, Anthropic published an updated Responsible Scaling Policy (v3.0). The headline change was a real walkback: the original 2023 RSP committed Anthropic to pausing model training if capabilities outstripped its ability to control them — a hard, public commitment that was the cornerstone of the policy. Version 3.0 replaced that hard pause with non-binding "safety roadmaps" — public documents describing ambitious goals, but not committing the company to actually stop. CNN's headline was accurate: "Anthropic ditches its core safety promise." TIME called it dropping the "flagship safety pledge." That's not spin. It's what happened.

Anthropic's stated reason was competitive: pausing while less careful labs accelerated would, they argued, produce a less safe world overall. I find that argument plausible and not entirely convincing — voluntary safety commitments only mean something if they survive their first real pressure test, and this one did not. So if you came in expecting me to claim Anthropic is the responsible AI company that holds every line, I'd rather lose you here than oversell.

Now the rest.

A week before the RSP walkback, Anthropic refused something genuinely costly. The Pentagon offered Anthropic a contract that would have permitted Claude to be used for "any lawful purpose" — the same blanket language OpenAI would sign onto a few days later. Anthropic refused on two specific carve-outs: no mass domestic surveillance of Americans, and no fully autonomous weapons systems. Dario Amodei said publicly that frontier AI is not reliable enough to drive autonomous weapons today, and that surveillance of U.S. citizens was a line Anthropic would not cross.

The cost was immediate and severe. On February 27, 2026, Trump directed all federal agencies to "IMMEDIATELY CEASE" use of Anthropic technology. Defense Secretary Pete Hegseth designated Anthropic a "supply chain risk to national security" — the first U.S. company ever given that designation, previously reserved for foreign adversaries. Trump publicly called Anthropic an "out-of-control, Radical Left AI company." Anthropic sued the administration, won a temporary injunction in late March, and as of late April 2026 the White House is drafting guidance to reverse the ban after introductory meetings between Dario Amodei, White House Chief of Staff Susie Wiles, and Treasury Secretary Scott Bessent.

OpenAI signed the same kind of "all lawful purposes" Pentagon deal less than 24 hours after Anthropic was blacklisted. Their February 28 agreement uses careful language about safeguards — autonomous weapons not used "where law, regulation, or Department policy requires human control"; no surveillance of U.S. persons "intentionally" — but it's the contract Anthropic wouldn't sign, with the carve-outs Anthropic insisted on as policy floors converted into qualifiers a Pentagon lawyer can interpret. Sam Altman has spent the months since defending the deal against backlash.

So the honest summary, all in one place: Anthropic is not a saint. They took a $200M Department of Defense CDAO contract in July 2025. They have a Palantir/AWS GovCloud partnership delivering Claude to defense and intelligence customers (announced November 2024). They walked back the central pause commitment of their flagship safety policy in February 2026. They are imperfect on multiple visible fronts.

But when their values bumped up against the largest contract opportunity in their company's history — the federal market, end to end — they held the meaningful lines and ate the cost. They got blacklisted. They sued the government. They are still fighting that fight.

OpenAI, faced with the same offer, signed.

That is the direction-of-drift comparison I can actually observe in the public record. Both labs do defense work. Both labs have walked things back at various points. The difference is that one of them, when it counted most, refused the deal that would have let the model be turned on Americans.

If I'm going to spend months building a harness around someone's model, I want it to be the company that took the L on the federal contract rather than the one that signed the paper. Right now, that's Anthropic — imperfect, walked-back-the-pause, sued-the-government, still-blacklisted-pending-appeal Anthropic. Not because they're saints, but because their direction of drift is one I can live with, and OpenAI's isn't.

The skill gap is in the harness

The discourse around AI is fixated on benchmarks, leaderboards, and which lab shipped what. The actual gap I see widening — between people who get useful output and people who don't — has almost nothing to do with model choice.

It has everything to do with whether you've:

Written down what you actually want, in rules the model reads every time, in language literal enough that it can't drift
Set up your environment so the model can't make the obvious mistakes — deleting things you'd never approve, ignoring conventions you've stated fifty times, hallucinating file paths
Built a memory layer so you don't repeat the same context every session
Wired in the tools the model needs to verify its own work, instead of letting it bluff
Created a review loop where another agent — or you, with a checklist — catches what the first pass missed

This work is unsexy. It doesn't trend. There is no benchmark for "well-tuned harness." But it's where almost all the leverage is.

Two builders with access to the same model: one ships 3-5x more useful output than the other, because one has built a harness and the other is still typing prompts one at a time. That gap will keep widening over the next five years, and I think it'll be bigger than the gap between any two frontier models.

What harness engineering looks like in practice

The single discipline that matters: when your output drops, don't re-prompt — edit the harness.

When the model does something I didn't want, I don't argue with it in chat. I open the relevant config file. I find the rule that wasn't literal enough — usually some instruction the model interpreted differently than I meant — and I rewrite it more precisely. Or I find there was no rule for the situation, and I add one. Or I notice I keep doing the same thing manually that should be a hook.

Every fix is small. Together they compound.

After two months of treating my agent's config as software, my harness has:

A global rules file with destructive-operation gates, code standards, and project-specific workflow triggers
A working-philosophy file that defines my quality bar
An auto-memory system with ~30 fact files the agent maintains itself
Custom skills for the workflows I do weekly — session start, session handover, weekly setup audit, model-migration drills
A two-phase review doctrine that auto-prompts a second session window when I touch migrations, auth, or row-level security
A native macOS menu bar app I built so I can see context-window usage in real time and avoid hitting compaction at the wrong moment

That's not because I'm clever. It's because I treat every bad output as a signal that the harness has a gap.

Most people treat bad output as a signal to switch models.

What to do

If you're using AI seriously and you've never edited an instruction file:

Start a CLAUDE.md, or whatever your tool's equivalent is. Write down five non-negotiables — code standards, language preferences, things you never want the model to do.
The next time the model does something you didn't want, don't argue with it in chat. Open the file. Add a rule. Make it literal.
Add a memory layer. Make the model write things down — what your project is, who's involved, what you decided last time.
Wire one hook. Block one dangerous command. Auto-format one language.
Build one skill — a procedure you'd otherwise re-explain every conversation.

Do that for two weeks. You will feel the model getting better.

It's not the model. It's you, finally telling it what you want.

The hardest part of harness engineering is that it's invisible. You won't post a screenshot of your CLAUDE.md to LinkedIn. There's no benchmark for it. There's no leaderboard. Nobody will tweet about your hook configuration.

But it's where the gap is opening.