heart-in-cortes-bank:mind-in-sf

Pixel art surfer in barrel wave at sunset

git log --oneline

apr 10 2026 anthropics-glasswing-concentrates-what-the-internet-needs-distributed

> anthropics-glasswing-concentrates-what-the-internet-needs-distributed.md

Published: April 10, 2026

Anthropic’s Glasswing concentrates what the internet needs distributed

The most capable vulnerability researcher ever built lives inside one company, forty organizations have a key, and the rest of the internet has a blog post.

That is the entire shape of cybersecurity for the next two years unless someone changes it, and the window to change it is measured in months rather than years, because the attackers on the other side of this asymmetry are not waiting for any partner program to onboard them and they are not bound by any responsible disclosure timeline.

The model is called Mythos and it was built by Anthropic, whose work on this should be understood for what it is before any of the rest of the argument lands. Mythos saturated every benchmark the security field had built to measure agentic vulnerability discovery, surfaced thousands of previously unknown zero days across every major operating system and every major web browser, chained four separate kernel bugs together into a working Linux root exploit without human guidance, and turned a JavaScript engine bug into a cross origin read primitive that can drain a victim’s bank session from a malicious page.

The underlying capability is amazing, the frontier red team that produced it is doing serious work, and the four million dollars Anthropic is donating to OpenSSF and Alpha-Omega and the Apache Software Foundation alongside the launch is going to organizations that actually need it.

This article is not about whether Mythos should exist. Mythos should exist. This essay is about the release strategy that wraps it, which is called Project Glasswing, and which I think is wrong in a way that is going to cost the rest of the internet a great deal of money and a great deal of trust before the people who designed it notice.

The Glasswing program ships Mythos to roughly forty curated organizations, with AWS and JPMorgan and a small list of critical infrastructure operators on the inside, the model itself running inside their environment behind their logging and their enterprise contracts, and everyone else on earth on the outside. The framing is morally defensible if you accept the premise, and the premise is that the right defenders of the internet are a handful of large institutions that one company’s enterprise team can sell to directly, brief in private, and trust to handle a model this capable on terms set unilaterally from one building in one country. Take that premise seriously for a second, because it is the most consequential assumption in the entire AI safety discourse right now, and it is wrong in a way that compounds across the stack rather than failing in any single place.

The internet is not forty companies, and it has not been forty companies for a very long time. The internet is millions of repositories, indie maintainers shipping libraries that quietly hold up half the world without ever meeting anyone in the Bay Area, browser extensions written by two people on a weekend, embedded firmware running on medical devices and industrial controllers, and tens of billions of dollars sitting in code that has never been audited by anyone with an enterprise contract attached to it. Mythos is not coming for any of those, because Mythos is, by design, not allowed to.

The systematic risk of the Glasswing rollout is not that any single organization gets excluded from the list, it is that the entire defender layer of the internet now structurally depends on the release calendar and the business model and the political alignment of one lab, and the day that lab decides to slow down or charge differently or get told what to do, the defense side of the asymmetry blinks off in a way the offense side never will.

This is the part worth being precise about, because the word “permissionless” gets thrown around so loosely in this industry that it has almost stopped meaning anything. So let me actually break it down across the stack, from the bottom up, because the case for permissionless security is not one argument, it is three arguments stacked on top of each other, and each layer fails in a different and more dangerous way the moment you let a single organization control it.

## Compute

The bottom of the stack is inference compute, and inference compute is where the first gate quietly closes. Mythos lives on the lab’s clusters, billed through their enterprise contracts, served from regions they choose, under usage policies they enforce, with rate limits they set. If you are not on the partner list you do not get inference, and even if you are on the partner list you are still running someone else’s binary in someone else’s datacenter on someone else’s terms. That arrangement is fine for a productivity tool. It is structurally wrong for the thing that is supposed to defend your codebase, because the moment your defender stops being available you stop being defended, and the conditions under which it stops being available are entirely outside your control and entirely inside someone else’s quarterly planning cycle.

Permissionless compute looks completely different at this layer. Inference is served by a peer to peer network of operators with spare GPU cycles, distributed across geographies and legal jurisdictions and individual incentives, with no single party able to turn it off, no single party able to set the price unilaterally, and no procurement cycle standing between a maintainer and the model they need. The compute layer becomes a commodity that the network coordinates rather than a product that one company gates, which means a solo developer in London and a security team in New York and a protocol in Tokyo can all reach for the same defender on the same evening, on equal terms, without anyone needing to recognize their name or approve their use case.

## Model

One floor up is the model itself, and for security work specifically, weights matter for one reason that decomposes into two symmetric halves: with open weights, neither the code under analysis nor the corpus you train the analyst on ever has to leave the building.

Mythos’s weights are not released to anyone. Partners get API access to a model running inside their environment, which means every line of code the model looks at has to leave the machine it lives on, get logged under someone else’s retention policy, and pass through a third party’s threat model on the way to a verdict. That constraint is fatal in defense, finance, healthcare, critical infrastructure, and every regulated environment whose threat model refuses to send crown jewel code anywhere, which is the half of the world where the highest stakes code actually lives.

The symmetric constraint sits on the training side. The ability to train on a corpus the vendor will never see is the only thing that has ever produced a genuinely specialized security model, and the highest leverage data for security work is a specific stack’s history of internal vulnerability disclosures, n-day patch sets, postmortems, and verifier traces from previous audits. That data turns a general code reasoner into a system that actually understands the bug classes that show up in your codebase and not someone else’s. None of that training is happening at scale right now because the only models worth training on it are the ones nobody is allowed to train.

## Harness

The top of the stack is the harness, and the harness is the part that the closed release strategy is keeping to itself, even though it is the part that matters most. The strongest evidence for that claim is one number from our own work.

On EVMBench, the OpenAI-built benchmark for autonomous smart contract vulnerability discovery, our agentic security system scored 64.2% detect recall across 40 real audit contests and 120 known vulnerabilities. The next best system on the leaderboard is Claude Opus 4.6 used directly with Claude Code, at 45.6%. it is built on top of Claude Opus 4.6. It is the same underlying model, used inside a different harness, finding 19 percentage points more real vulnerabilities in real audit contests. GPT-5.3 Codex is at 39.2%. Most other systems are below 30%.

It means the harness around the model is responsible for nearly twenty points of detect recall on a benchmark that measures whether a system can find real vulnerabilities in real audit contests, and that gap is the load-bearing piece of evidence for the central technical claim of this entire essay.

A frontier security model on its own is not yet a vulnerability researcher, and the gap between “model” and “vulnerability researcher” is exactly where those nineteen points come from. What makes a model a vulnerability researcher is the loop around it: the sandboxed runtime that lets the model probe a binary without escaping, the proof of concept generator that turns a hypothesis into an exploit you can actually run, the calibrated evaluator separated from the generator so the model cannot self approve its own findings, the language specific knowledge of taints and sinks and frameworks and idioms, the patch verifier that confirms a fix actually closes the bug, and the long tail of plumbing that turns “the model said something” into “the model proved something.” That is where the months of iteration live, and that is what the closed release strategy is keeping to itself.

When the lab decides not to release the harness, the lab is not just keeping a model from the rest of the internet, it is keeping the part of the system that does most of the work. A closed harness means the entire security research community has to wait for one company’s roadmap to support their language, their framework, their runtime, their threat model, and their stack. An open harness inverts that incentive completely. When the harness is in the open, researchers extend it for the platforms they care about, maintainers integrate it into the CI of the projects they own, communities add support for the runtimes that enterprise sales would never prioritize, and the rate of progress stops being a function of one team’s quarterly roadmap and starts being a function of how many people in the world have a codebase they care about. This is exactly the asymmetry that worked for Linux and LLVM and every other piece of infrastructure that ended up under the entire industry, and there is no good reason it should not work for the agentic security harness layer too.

Across all three layers the argument is structurally the same. Centralized compute fails on availability, centralized weights fail on confidentiality, and centralized harnesses fail on coverage, and any one of those failures alone is enough to leave most of the internet structurally undefended. Mythos under Glasswing fails on all three at once, by design, and the failure is not a bug in the rollout, it is the whole shape of the rollout. The picture you should keep in your head looks like this.

Both sides have exactly the same three layers. On the left every layer is hatched and sealed and the whole stack is wrapped in a single API boundary that opens for forty keys. On the right every layer is open and visible at the layer where you want to look at it, the harness is a row of components anyone can extend, the model is a stack of layers anyone can probe and train and distill, the compute is a peer to peer mesh anyone can join and anyone can call. Neither side is more secure than the other in any meaningful sense. One side is a defended fortress for the few. The other side is a defended commons for everyone. Those are different products, and only one of them is shaped the way the actual internet is shaped.

## But won’t this just help attackers

The objection that gets raised at exactly this point is the dual use one, and it is the right objection to raise, because it is the only objection in the entire conversation that takes the underlying capability seriously. If you ship a state of the art open source security model on a permissionless compute network with an open harness, you are by construction shipping that capability to attackers as well as defenders. There is no membrane that lets defenders in and keeps attackers out. There has never been such a membrane for any tool of this kind, and pretending one exists is the failure mode the closed approach is built on.

The question is which side of the asymmetry the closed approach actually protects, and the honest answer is that it protects neither side as well as its proponents claim and it harms one side considerably more than the other.

Attackers are not waiting. They are already using current generation open weight models for vulnerability research, they have been doing so for at least a year, and they will use whatever leaks or distills out of Mythos the moment any of it becomes possible, which is a timeline measured in months at most. The constraints in Glasswing are not constraints on attackers. Attackers do not go through procurement. Attackers do not sign enterprise contracts. Attackers do not get added to or removed from partner lists.

The constraints in Glasswing are constraints on defenders, and the sum of those constraints is the asymmetry the rest of this essay is trying to talk about. Concentrating frontier security capability in forty hands does not raise the cost of offense, because offense was never the side that needed institutional access in the first place. It raises the cost of defense, by gating access to the same capability for the people who have to use it at scale across millions of repositories.

The workload asymmetry between offense and defense is the part that most people who have not done this work for a living miss. An attacker needs to find one bug in one codebase. A defender needs to find every bug in every codebase they own. Those are not symmetric problems. They are different problems by orders of magnitude in the amount of work each side has to do per unit of value at stake, and that asymmetry is exactly why automation matters more on the defender side.

A closed model used by forty companies handles forty defender workloads. An open model used by anyone with a codebase handles every defender workload that exists. The offense side gets nothing extra from the open release that it was not already getting on its own timeline, and the defense side gets the only thing that has ever closed an asymmetry of this shape, which is a tool that scales with the number of people who care.

The n-day battlefield is where most of the actual exploitation in the world happens. Zero days are dramatic and they make the headlines and they are what frontier model evaluations test for, but the great majority of successful attacks against real systems target vulnerabilities that have been publicly disclosed and patched but not yet rolled out to most of the systems running the affected software.

The limiting factor on getting those patches deployed is not the patch, the patch is usually a few lines of code. The limiting factor is the work of identifying which downstream projects are affected and getting them updated, and that work is exactly what an open agentic security tool is good at, and it is exactly what a closed agentic security tool cannot do for any codebase its vendor has not partnered with. The n-day battlefield is the one where automation matters most, and it is the one the closed release strategy abandons by design.

The historical record is the last piece. The entire history of offensive security tooling, from Nmap in the late nineties to Metasploit in the early two thousands to Burp Suite to Ghidra to every fuzzer worth using, has been an open ecosystem, and the consensus position among people who actually do this work for a living is that the open ecosystem made defense stronger rather than weaker, because it gave defenders the same tools attackers were going to use anyway and made the defender community far larger than the attacker community in the process. The argument that “we should keep this category of capability out of the open because attackers might use it” was made about every single one of those tools when they were new, and every single time it was made it was wrong, and the field is unanimous on that now in a way it almost never is unanimous on anything. Agentic security is the next layer of the same pattern. There is no good first principles reason to believe this layer will be the one where the historical asymmetry suddenly inverts.

Put all four pieces together and the dual use objection cuts in the opposite direction from the way it gets used in safety discourse. The question is not whether attackers will get agentic security capability. They will, on a timeline measured in months at most, regardless of what any single lab decides about its release calendar. The question is whether defenders will get it on the same timeline. Right now the answer is no. Keeping a defender capability behind a partner program does not reduce the offense side of the asymmetry, it just delays the defense side until the attackers have caught up. The responsible disclosure framing for this category of capability is the one that recognizes which side is bottlenecked on access and which side is not.

## Who actually needs this

The case for permissionless agentic security gets a lot more concrete when you stop talking about “the long tail” in the abstract and start naming the specific places where the systematic risk lands hardest. There are at least four of them, and each one is a thesis worth taking seriously on its own.

Smart contracts are money in code, and that one sentence is worth sitting with for a second, because it is the cleanest description of why onchain code has the highest bug to dollar ratio of any software category in human history. There is no rollback, there is no customer support line, there is no insurance fund big enough to cover the worst case, and the entire history of the space is a long ledger of nine figure incidents where a single missed edge case in a few hundred lines of code moved more value than most enterprise breaches ever do. And exactly zero of those codebases are on the Glasswing partner list.

The EVMBench result is the cleanest possible demonstration that agentic security at this level is real and useful for this category right now. Across forty real audit contests on Code4rena and Sherlock, our agent Kai surfaced findings worth $74,707 in actual bounty value, including a $20,252 finding in the Wildcat protocol withdrawal batch that turned on a single rounding mismatch between half-up and floor rounding, and a multi-hop oracle pricing bug in the Noya protocol that could have drained any vault relying on multi-hop price routes.

But money in code is just the most legible version of a much wider pattern, and the same logic shows up in three other places that are at least as important and arguably more so.

The open source supply chain is critical infrastructure with no owner:

Every major company on earth is built on a tower of libraries maintained by people who are not employees of any of the partner organizations on the Glasswing list, and every few years the world gets a sharp reminder of what that actually means, whether it is Heartbleed or Log4j or the xz utils backdoor that came within one release cycle of compromising every Linux server on the planet.

The maintainers of those projects are precisely the population that frontier security models would help the most, and they are precisely the population that cannot afford a managed enterprise contract, cannot legally send proprietary review work to a third party, and have no realistic path to ever being on any lab’s partner list. An open weight model on a permissionless network is not a nice gesture for this group, it is the only delivery mechanism that has ever worked for tools at this layer of the stack, which is exactly why every piece of infrastructure they already use has the same shape.

AI agents are the new attack surface, and the industry is building it faster than it is defending it:

Every team shipping an agent right now is creating a new category of bug that did not exist eighteen months ago, where a prompt injection in a calendar invite can trigger a tool call that drains a wallet, where an MCP server with loose permissions can be talked into exfiltrating the wrong file, where an agent with shell access can be social engineered through a markdown document it was asked to summarize. The volume of agentic code shipping into production right now dwarfs the volume of human review available to look at it, the threat model is barely a year old, and the attackers are already iterating on it in public. This is exactly the kind of fast moving, high variance, long tail security problem that frontier agentic models are uniquely good at, and it is exactly the kind of problem that does not get solved by handing the model to forty companies and waiting for the rest of the ecosystem to catch up.

Most of the world is not on any US partner list, and never will be:

The implicit geography of Glasswing is one country, one regulatory regime, and one set of strategic alignments, and that is fine as a business decision and unworkable as a security posture for anyone outside of it. The maintainers and protocols and small teams and national infrastructures that fall outside the partner program are not going to get a second tier version of Mythos in a year, they are going to get whatever the open ecosystem manages to ship, and the speed at which the open ecosystem manages to ship it is the variable that decides how the next few years of this play out for most of the planet.

## How we are solving this at three layers

The manifesto above broke the problem into compute, model, and harness, and argued that all three have to be open at the same time for any of them to matter. We are working on all three at the same time, and the rest of this section is the disclosure of what that actually means right now.

At the compute layer, the substrate is Dria, the peer to peer inference network we have been running for more than a year. Operators contribute spare GPU cycles, the network coordinates inference across them, no allowlist sits between a client and the model they want to call, and no central party can throttle or revoke access. Dria already exists, it is already live, and it is the substrate the open security stack will be served on.

At the model layer, the part that does not exist yet at frontier capability is exactly the part the manifesto argued was missing. We are training open weight security models on the corpus of real exploitation traces gathered from months of Kai engagements, including the verifier traces from the audits and use-cases. The models will be released open weight, under licenses that let them be inspected, trained on private vulnerability corpora, and run inside airgapped environments where the code under analysis never has to leave the machine it lives on. That is the symmetric property the model section of the manifesto was about.

At the harness layer, the loop is Kai.

Kai is the continuous codebase engineer. Every night it runs across your repository, and every morning you wake up to verified pull requests instead of more work, across four pillars: security, optimization, hygiene, and most recently memory. The part of Kai this essay is about is the security pillar, because that is the part where the manifesto and the work meet. The security harness inside Kai today is what produced the 64.2% on EVMBench against Claude Opus 4.6 plus Claude Code at 45.6%, along with findings like the Coinbase x402 signature bypass, the Apple password manager XSS, and many others. The new version of that harness is the work landing in the next several releases. Researchers in any language can extend it for the platforms they care about and maintainers can integrate it into the CI of the projects they own.

We are building this in public and we want help building it. The people we most want to hear from are: security researchers sitting on vulnerability corpora or n-day patch sets that an open agentic tool could be evaluated against, GPU operators with idle capacity who want to run Dria nodes specifically for security workloads, maintainers of upstream open source projects who would make good early integrations and whose codebases would benefit from being among the first the open security stack is pointed at.

The problem is too big and the timeline is too short for any one team to be precious about it, including ours.

mar 24 2026 context-engineering-is-half-the-harness

> context-engineering-is-half-the-harness.md

Published: March 24, 2026

There is a pattern in computing that repeats roughly once a decade. A new capability appears and for the first year or two the entire conversation is about the capability itself, about benchmarks and demos and possibility space. Then somewhere around year two or three the conversation undergoes a phase transition and starts being about how to make the thing work reliably in production, how to coordinate many instances of it, how to give it the kind of operational guarantees that real systems require. We are at that inflection point with AI agents right now, and the infrastructure conversation has started in earnest. But it has started lopsided, focused almost entirely on one half of the problem while largely ignoring the other half. And the half being ignored is, I think, where the more important breakthroughs will come from.

The term that has coalesced around this infrastructure layer is “harness.” An agent harness is the software that wraps around a language model to manage its lifecycle, context, and interactions with the outside world. It is not the brain that does the thinking but the environment that gives the brain tools, memories, and structure. Over the past year, a remarkable amount of excellent engineering work has gone into the harness layer, and I want to acknowledge that work specifically because it sets the stage for what I think is missing.

@AnthropicAI published their research on effective harnesses for long-running agents, showing that even Opus 4.5 running on the Claude Agent SDK could not reliably build a production web app from a high-level prompt, and that the fix was not a smarter model but a structured two-agent architecture with feature lists, progress files, and session startup rituals that imposed the discipline of a good software engineer on a model that would otherwise try to do everything at once.

@ManusAI published their context engineering lessons after rebuilding their agent framework four times, sharing that KV-cache hit rate is the single most important metric for production agents, that dynamically adding or removing tools destroys cache and confuses the model, that the file system is the ultimate context, and that leaving errors visible in the trace actually helps the agent learn rather than repeating the same mistakes. They called their iterative process of architecture searching and empirical guesswork “Stochastic Graduate Descent” and it is one of the most honest descriptions of applied AI engineering I have read.

@OpenAI’s Codex team published their harness engineering experience, describing a repository of roughly a million lines of agent-generated code maintained by a small team, where they discovered that 20% of their engineering time was spent cleaning up “AI slop” until they started encoding “golden principles” directly into the repository and built recurring cleanup agents that scan for deviations and open targeted refactoring PRs.

@LangChain published the State of Agent Engineering survey covering over 1,300 practitioners, finding that 57% of organizations deploy multi-step agent workflows in production, 32% cite quality as the top barrier, 89% have implemented observability, but only 52% have adopted evals, and that for large enterprises the biggest challenge remains hallucinations and consistency of outputs.

All of this work is genuinely excellent and I am not here to critique it. But if you look at the totality of what the harness conversation has been about, a pattern emerges. It is almost entirely about managing text. About what tokens go into the context window, how to compress them, how to persist them across sessions, how to route them to the right model, how to keep the KV-cache stable. The entire intellectual energy of the harness community has been poured into building a better library for agents, giving them better indexing, better shelving, better ways to find and organize the information they need before generating their next token.

What is almost completely absent from the conversation is the lab. Agents do not just read code and write code. They act on systems. They run code against real environments, observe the results, interact with services that have state and side effects, and make decisions based on what actually happened rather than what they predict would happen. And the harness infrastructure for this, for execution and verification and cross-system experimentation, barely exists. Everyone built the library. Nobody (yet) built the lab.

A scientist with an excellent library but no laboratory can produce literature reviews but not discoveries. I believe the execution harness, the lab infrastructure that lets agents test hypotheses against real systems and produce knowledge that did not exist before anything was executed, is one of the largest gaps in the current agent infrastructure landscape, and the place where the most important work will happen over the next two years.

## Self-Testing Is Not Experimentation

Before going further I want to draw a distinction that might not be obvious, because the immediate objection to this framing is “agents already run code.” And that is true. Claude Code runs the code it writes and observes whether it works. Anthropic’s Puppeteer integration lets Claude test the web app it built by clicking through it in a real browser. Manus agents interact with real file systems and real APIs. These are all forms of execution.

But there is an important difference between an agent that tests its own output and an agent that designs and executes experiments against a system it did not build, probing for emergent behaviors that no one anticipated. The first is quality assurance: did the thing I just made work the way I intended? The second is research: what does this system actually do under conditions its builders never considered? Current harnesses support QA reasonably well. Almost none support research. And the research mode is where the most valuable findings live, because the bugs that matter in production are not the bugs the developer thought to test for. They are the bugs that emerge from interactions, timing, and load patterns that nobody anticipated.

## This Is Not a New Idea, The Integration Is

I want to be clear about intellectual lineage here, because the concept of execution as discovery is not new. The fuzzing community has been doing this for decades. AFL, libFuzzer, and Hypothesis represent sophisticated execution harnesses for finding bugs that cannot be found by reading code. The chaos engineering community, from Netflix’s Chaos Monkey to Gremlin to LitmusChaos, has been injecting failures at service boundaries and observing what happens for years. The security research community at places like Google Project Zero and Trail of Bits has been combining automated program analysis with execution-based verification since before large language models existed.

What is new is the opportunity to connect these execution capabilities to the LLM agent loop as a first-class harness primitive. Right now, fuzzing, chaos engineering, and automated testing are separate tools that happen to be invoked by the same developer. They are not integrated into the agent’s reasoning loop in a way that lets the agent form a hypothesis, design an experiment, observe the results, update its understanding, and design the next experiment in a continuous cycle. The innovation is not execution itself but the tight integration of execution into the agent’s cognitive loop, treating the lab as infrastructure that is as native to the agent’s workflow as the context window is.

## The Difference Between Reading and Running

Consider a scenario that I’ve deliberately simplified to make the argument clear, with the caveat that real cross-service bugs are messier and harder to trace than this. An agent is reviewing a web application that has a shopping cart, a coupon service, and a payment processor. These are three separate services maintained by different teams, deployed on different cycles, communicating through message queues. The agent reads the code and identifies a suspicious interaction: a coupon that grants a negative discount could, in theory, result in a negative total price, and the payment processor might interpret this as a credit rather than a charge. The agent writes up this finding and suggests it might be exploitable.

This is useful. But it is also fundamentally a literature review. A hypothesis generated by reading static code, subject to all the limitations of static analysis: maybe the payment processor validates for negative amounts and rejects them, maybe the message queue serialization truncates negative values, maybe there is middleware between the services that normalizes the price. The agent has no way to know without actually running the scenario.

Now consider what it would mean for the agent to have a lab. It spins up the three services in a sandboxed environment, generates a valid user session, applies a coupon that creates a negative discount, submits the order, and watches what actually happens at each service boundary. It discovers that the coupon service happily applies the negative discount, the cart service passes the negative total through to the payment processor, but the payment processor does in fact reject the charge, returning an error code. However, the error handling in the cart service interprets this specific error code as “retry with saved payment method,” which triggers a second charge attempt that succeeds because the retry path skips the validation. The agent has now found a real exploit, not by reading code but by running an experiment, and the finding includes the exact request sequence, the service-level responses, and the specific error handling path that enables it.

The difference between these two scenarios is the difference between a literature review and a lab result. Both have value, but the experimental result is categorically more useful, more trustworthy, and more actionable.

I should be honest about this particular example, though. The negative price exploit is deterministic. Given the same inputs, the same code always produces the same outputs. A sufficiently advanced model with a large enough context window could, in theory, trace the logic through all three services and predict the outcome without running anything: if the total is -$5, the payment processor returns NEG_AMT, the cart service maps NEG_AMT to the retry path, the retry path skips validation. That deduction is hard but it is not impossible through pure reasoning. I chose this example for pedagogical clarity, not because it represents the strongest case for the lab.

The strongest case is the class of bugs that cannot be hypothesized from reading code at all, regardless of how smart the reader is. A performance cliff that appears at exactly 487 concurrent connections because of a kernel TCP buffer size interaction that is nowhere in the application code. A memory leak that accumulates over six hours of normal operation and eventually causes an OOM kill that cascades through a service mesh, where the specific accumulation pattern depends on the specific sequence of real requests the system received. None of these are deducible from static analysis regardless of how much code the agent reads, because the information that reveals them is produced by the physical act of running the system under specific conditions. The agent needs a lab not just to verify hypotheses, but to discover what the library could never predict.

## Where the Library Ends and the Lab Begins

I want to be precise about the relationship between context engineering and execution, because the wrong version of my argument is easy to attack and the right version is important to get right.

For deterministic logic errors that trace through predictable code paths, increasingly powerful models with larger context windows will eventually find them through reasoning alone, and it would be dishonest to claim otherwise. If a bug is caused by a specific sequence of function calls that always produces the same output given the same input, then a model that can read all the relevant code and reason carefully through the execution path can, in principle, find that bug without running anything. The negative price exploit is exactly this kind of bug. Context engineering is making models better at precisely this kind of multi-step deductive reasoning, and the trajectory is clearly toward more capability, not less. I am not arguing that the library is insufficient for this class of problems. I am arguing that this class is not where the hardest production bugs live.

Many of the hardest production bugs are non-deterministic, load-dependent, or time-dependent. And for these specifically, context engineering is not currently insufficient but structurally unable to help, because the information that reveals them does not exist in the source code in any form.

## The Bitter Lesson, Applied to the Lab

Manus’s Peak Ji articulated the most important design principle in harness engineering: your harness should be the boat on a rising tide, not a pillar stuck to the seabed. Boris Cherny, the creator of Claude Code, made the same point through the lens of Richard Sutton’s Bitter Lesson, keeping Claude Code deliberately unopinionated so that model improvements translate directly into product improvements. The diagnostic they both point to is critical: if your agent’s performance doesn’t improve when you swap in a stronger model, your harness is hobbling the agent rather than helping it.

This principle applies to the lab as much as it applies to the library, but with an important asymmetry. A better library can become less necessary if models develop genuinely better memory, longer effective attention, or native persistence across sessions. A better lab cannot become less necessary through model improvements because hypothesis and experiment are fundamentally different computational processes. The model does one. The execution environment does the other. No advance in the first eliminates the need for the second.

Now, there is a reasonable counterargument here: maybe the model will eventually generate its own lab on the fly. Claude Code already does a primitive version of this when it writes a test script, runs it, and interprets the results. Maybe in two years a frontier model can spin up its own docker-compose environments, write its own concurrent test harnesses, and interpret its own results without any purpose-built infrastructure. This is plausible and worth taking seriously.

But the value of purpose-built lab infrastructure is not in raw capability. It is in reliability and iteration speed. A model can probably cobble together a test environment from scratch using docker and bash scripts. But it will take fifteen minutes and the environment will be fragile and specific to that one experiment. Purpose-built lab infrastructure provides the same advantage that CI/CD provides over “the developer can run tests manually”: it is not that it does something impossible, it is that it does something necessary reliably, repeatedly, and fast enough that the agent can iterate at the speed of thought rather than the speed of infrastructure provisioning. The difference between an agent that can run one experiment per hour and an agent that can run fifty experiments per hour is not quantitative. It is qualitative, because the agent that iterates fifty times can explore the hypothesis space in a way that the slow agent fundamentally cannot.

This makes the execution harness one of the most Bitter-Lesson-safe investments you can make in the agent stack. It is infrastructure that gets more valuable as models get smarter, because smarter models generate better hypotheses that are more worth testing, which creates more demand for lab infrastructure, not less. A model that can hypothesize about increasingly subtle race conditions or increasingly complex cross-service interactions needs a more sophisticated lab to verify those hypotheses, not a simpler one. The relationship between model intelligence and lab demand is not zero-sum. It is positively correlated.

Anthropic’s use of Puppeteer for browser-based testing in their long-running agent work is the closest existing example of execution as a harness primitive, and it is telling that they called it out as one of the biggest improvements to their agent’s reliability. When Claude could actually see and interact with the web application it was building, rather than just reasoning about the code, the quality of its output improved dramatically because it could verify its own work through experiment rather than introspection. But Puppeteer for web UI testing is QA, not research. It verifies that the agent’s own work is correct. The general form we need is broader: give the agent a laboratory where it can test any hypothesis about any system’s behavior by actually running the scenario, observing the result, and feeding structured findings back into its reasoning loop.

## What Needs to Be Built

I want to be specific about what the lab requires, because staying at the level of “we need better execution infrastructure” is not useful to anyone.

First, you need fast environment provisioning. An agent that has to wait five minutes for a test environment to spin up will not iterate fast enough to be useful. The lab needs to maintain warm pools of pre-configured environments that can be specialized for a specific experiment in seconds rather than minutes, in the same way that a JIT compiler maintains hot code paths. This is a hard infrastructure problem involving container orchestration, dependency caching, and state management, and it is entirely distinct from anything happening at the model layer.

Second, you need intelligent observation design. When an agent runs an experiment, the raw output, every log line, every network packet, every database query, could easily be tens of thousands of tokens. You cannot feed all of that into context. The lab needs to be opinionated about what gets captured and how it gets compressed, preserving the causal chain that matters (this request triggered this error which caused this retry which succeeded through this unexpected path) while discarding the noise. This is where the lab and the library actually meet: the observation compression layer is a library problem (context engineering) applied to lab outputs (execution results). It is one of the more interesting subproblems here precisely because it sits at the intersection of the two halves.

Third, you need concurrent and stateful test orchestration. The most important bugs in production systems are not reproducible with a single sequential request. They require concurrent load, specific timing conditions, or accumulated state from a sequence of operations. The lab needs to generate realistic workloads, not toy examples, and manage the complexity of multi-step experimental scenarios where the outcome of step three depends on the specific timing of steps one and two. This is where the connection to decades of fuzzing and chaos engineering research becomes directly relevant: those communities have developed sophisticated techniques for generating interesting inputs and managing stateful test campaigns that the agent harness community should be drawing on rather than reinventing.

Fourth, and this is the piece that connects to the cross-system problem, you need topology-aware orchestration. The lab needs to understand that service A talks to service B through a message queue and that service B talks to service C through a REST API, and it needs to be able to instrument each of these boundaries to capture the cross-service behavior that reveals emergent bugs. This requires a model of the system topology that lives outside any individual agent’s context and that the lab maintains as persistent infrastructure.

These are infrastructure problems that require sustained engineering investment, and they create durable value that compounds over time as the lab accumulates knowledge about how to test specific kinds of systems effectively.

## The Convergence

The harness conversation is going to converge. The teams building the library will eventually need lab capabilities, and the teams building the lab will eventually need sophisticated context management. The complete agent harness is one that manages both what the agent knows and what the agent can prove, both its reading and its experimentation. Right now these are being built by different communities with different assumptions, but they are two halves of the same layer.

The teams that understand this convergence early and build for it will have a structural advantage, because they will be able to offer agents that do not just hypothesize about system behavior but prove it, that do not just suggest changes but verify them, and that do not just read code in isolation but test interactions across service boundaries in a real laboratory. That is a qualitatively different kind of agent from what exists today, and building the harness infrastructure that enables it is, I believe, some of the most important engineering work happening in AI right now. We are building toward it, and we think others should be too.

Originally published as a thread on X/Twitter

history

Hey, I'm Kerim.

Building Dria, an applied AI lab crafting Kai, the Continuous Codebase Engineer.

In tech, I'm drawn to long-horizon agents, evolutionary coding, and making sure the coming abundance of software is high in both quality and beauty.

Alongside wonderful teammates, I've previously built state-of-the-art agents and frameworks across memory, tool use, and evolutionary coding, and helped operate one of the world's largest crowdsourced P2P inference platforms, serving over 100,000 daily active users.

A few things we've shipped along the way:

Mem Agent — a small model that gives LLMs persistent, human-readable memory through Obsidian-style markdown.
Dria Agent Alpha — a compact tool-use agent that reasons and calls tools by writing Python, not JSON.
Dnet, Distributed LLM Inference for Apple Silicon Clusters — running large models across networked Macs by sharding them over Apple Silicon.
P2P Inference Network

Before Dria, I co-founded an ops startup that scaled to 200 countries, and back in 2019 an AI agency where we built ML products on GPT-2 embeddings. We had a lot of fun.

> whoami

The beauty of being human
I'm endlessly grateful for John Coltrane, Esbjörn Svensson, Jaco Pastorius, Roy Hargrove, Stevie Wonder, Brad Mehldau, and Antônio Carlos Jobim.

The beauty of feeling alive
Surfing, diving, and breathwork.

The beauty of sharing
Family, childhood friendships, and cinematography. Different lives, different struggles, each deeply subjective, yet somehow meeting on common ground where life feels nourished.

> contact

Twitter: @kerimrocks

LinkedIn: kerim-kaya