Designing Loops - A Practitioner's Short Field Guide

A field census of agentic engineering’s founding practitioners — their workflows, anchor documents, and the truth about what still lives inside every loop

Jun 11, 2026

I. The Six-Word Sentence That Started a Discourse

“My job is to write loops.”

Six words. Boris Cherny — Head of Claude Code at Anthropic — said them, at the tail of a longer statement that ignited discussions across developer communities worldwide:

“I don’t prompt Claude anymore. I have loops running that prompt Claude and figuring out what to do. My job is to write loops.”
~ Boris Cherny

The first two sentences set context. The last one became the discourse. It was amplified almost immediately by Addy Osmani (Google Engineering Director) and Peter Steinberger (OpenClaw founder, now at OpenAI): “You shouldn’t be prompting coding agents anymore. You should be designing loops that prompt your agents.”

Addy Osmani - Loop Engineering

Addy Osmani@addyosmani

https://t.co/hIe0UX7z6T

11:30 PM · Jun 8, 2026 · 1.76M Views

292 Replies · 1.08K Reposts · 7.04K Likes

By the weekend, the discourse had fractured cleanly. One camp read this as a paradigm shift demanding immediate workflow overhaul. Another camp read it as irresponsible hype. A third camp — the practitioners who’d been quietly building agentic systems for twelve months — barely noticed, because they were already three architectural generations past the argument.

This article does not adjudicate the discourse. It goes past it. I profile six practitioners whose names appear in the primary sources of 2025–2026 agentic engineering, reconstruct their actual architectures from the record, and benchmark them against the last few months of controlled experiments in the ASCRS harness lab. The goal is not celebration. It is calibration.

The central finding, stated plainly: the prompt never left the room. It got promoted.

The loop is not a replacement for the prompt. It is a container for the prompt. Every loop in every architecture examined here is ultimately anchored by a document someone spent time brainstorming or wrote by hand.

II. The Field Census: Six Architects

The practitioners below represent the primary sources cited across the 2025–2026 agentic engineering literature. Each profile documents role, background, core technique, anchor documents, and — critically — the actual prompts or prompt structures they depend on.

Boris Cherny · Head of Claude Code · Anthropic

↳ Fact-check: Cherny — @bcherny June 7, 2026 (five tips post, SWE-Marathon context) · Threads mirror of the five tips post · How Boris Uses Claude Code (full tip archive) · OfficeChai — Cherny interview (loop-writing background) · InfoQ: Claude Code Creator Workflow (January 2026 setup) · Fortune Brainstorm Tech (Yahoo) — thousands of agents quote

What Cherny Actually Runs: The Five-Tip Framework

The misreading of Cherny’s claim is that he eliminated prompting. He did not. His June 7, 2026 post — five practical tips for running Opus autonomously for hours or days — is the most operationally precise statement he has made about his actual workflow. Mapped onto the hierarchy, the five tips are not advice about productivity. They are a configuration checklist for the L2–L4 architecture (details further below):

CHERNY — Five-Tip Framework (@bcherny, June 7 2026, in response to SWE-Marathon benchmarks)

↳ Fact-check: @bcherny — June 7 2026 post (five tips, SWE-Marathon context) · Threads mirror · /goal vs /loop distinction — Claude Code docs · Avi Chawla — four autonomy modes explained [/goal and /loop are verbatim Claude Code command names. The five tips are verbatim from Cherny’s post. Hierarchy level annotations (L2, L3) are part of the analysis.]

Tip 5 — self-verification end-to-end — is the one practitioners skip and then wonder why long autonomous runs drift. The loop can produce technically correct code that fails in the actual running environment. Cherny’s solution is to put the environment itself inside the verification loop: boot the web server, open the browser extension, run the mobile simulator. The judge is not a model reading a transcript. It is the product running.

The earlier PR management loop fits within this framework as a Tip 3 example — using /loop (cadence-based, checking the queue on schedule) rather than /goal (which would be used when the task has a definable completion state like ‘all tests pass’). The distinction matters at scale: /loop is for monitoring and polling; /goal is for bounded work with a finish line.

Geoffrey Huntley · Creator, Ralph Wiggum Technique · Open Source

↳ Fact-check: ghuntley.com/ralph/ — original technique post · Anthropic official ralph-wiggum plugin README · Dev Interrupted — Inventing the Ralph Wiggum Loop (interview) · paddo.dev — Ralph Wiggum autonomous loops deep dive

The Ralph Architecture

↳ Fact-check: ghuntley.com — ‘Ralph is a bash loop’ · Anthropic plugin README (Stop hook mechanism) · github.com/wiggumdev/ralph (implementation) · Syntax+Glitter — Ralph Wiggum Loop explainer · Sondera — Supervising Ralph (stateless resampling analysis) [Architecture is reconstructed from Huntley’s public posts and the official Anthropic plugin README]

What Huntley Actually Prompts

The Ralph technique’s power is inseparable from PROMPT.md — the anchor document that gets re-injected every iteration. A typical PROMPT.md is not a sentence; it is a structured specification containing goal, constraints, success criteria, and current state pointer. The loop resamples it fresh each time, avoiding the “context rot” failure mode where a growing conversation history causes the model to navigate toward its own prior attempts rather than toward the objective.

↳ Fact-check: ghuntley.com/ralph/ — PROMPT.md discipline · Anthropic plugin README — completion-promise mechanism · wiggumdev/ralph — full docs · Addyosmani.com — self-improving agents (PROMPT.md usage) [PROMPT.md template is a reconstruction based on Huntley’s documented technique; specific field names are inferred from the plugin spec and ghuntley.com]

Steve Yegge · Creator, Gas Town / Gas City · Ex-Amazon, Google, Sourcegraph

↳ Fact-check: Steve Yegge — Welcome to Gas Town (original) · Steve Yegge — Welcome to Gas City (SDK) · Cloud Native Now — Gas Town deep dive · steveyegge/gastown (GitHub)

The Gas Town Architecture

↳ Fact-check: Steve Yegge — Welcome to Gas Town (roles defined) · Cloud Native Now — ‘Kubernetes for AI’ analysis + role breakdown · Codex Blog — Gas Town roles + Beads analysis · paddo.dev — GasTown and the Two Kinds of Multi-Agent · Reading list — Gas Town architecture summary [Role names (Mayor, Polecat, Refinery, Witness, Deacon) and GUPP protocol directly from Yegge’s original Medium post]

Yegge’s core insight aligns with Geoffrey Huntley’s: the AI’s memory is the version-controlled repository, not the chat history. Twenty to thirty agents can work in parallel without colliding because their shared state is code — immutable, auditable, mergeable. What often gets lost in descriptions of Gas Town is that each worker role is defined by a role prompt. The Mayor has system-level orchestration instructions. Polecats have task-execution instructions. Refinery has merge-management instructions. Yegge didn’t stop prompting; he wrote seven distinct prompt architectures and then automated who calls them.

Peter Steinberger · Creator, OpenClaw · Former CEO PSPDFKit · OpenAI (from Feb 2026)

↳ Fact-check: steipete.me — Just Talk To It (workflow) · OpenClaw VISION.md analysis · Fortune — Who is Peter Steinberger? · Wikipedia — Peter Steinberger (programmer) · Grokipedia — Steinberger agentic engineering

VISION.md: The Document That Runs Before Every Loop

Steinberger’s most significant contribution to the practitioner literature is not a tool or a framework. It is a practice: writing VISION.md before writing any loop or agent configuration. For OpenClaw, the VISION.md is 94 lines — no marketing language, no growth projections. Its first sentence is: ‘OpenClaw is the AI that actually does things.’

↳ Fact-check: OpenClaw VISION.md — full analysis · steipete.me — VISION.md discipline (workflow post) · steipete.me — full personal site / posts [VISION.md template headings are reconstructed from Steinberger’s public OpenClaw VISION.md (94 lines). Section labels are paraphrased, not verbatim]

Addy Osmani · VP Engineering · Google Chrome · Author, Learning Patterns

↳ Fact-check: addyo.substack.com — Loop Engineering (Jun 8 2026) · addyosmani.com — Self-Improving Coding Agents · addyo.substack.com — The 80% Problem in Agentic Coding · addyosmani.com — Code Agent Orchestra

Osmani’s Five Loop Building Blocks

↳ Fact-check: Osmani — Loop Engineering (five blocks + automation tab detail) · Osmani — Self-Improving Coding Agents (sub-agent/judge pattern) · Osmani — AI Coding Workflow 2026 (spec-first discipline) [Five building blocks + 70/30 spec/execution ratio are direct from Osmani’s June 2026 Loop Engineering post; quoted caution on token costs is verbatim]

Andrej Karpathy · Co-founder OpenAI · Coined ‘Vibe Coding’ and ‘Agentic Engineering’ · Anthropic (from May 2026)

↳ Fact-check: Wikipedia — Andrej Karpathy (joined Anthropic May 2026) · The New Stack — Vibe coding is passé / agentic engineering · MindStudio — Karpathy Sequoia talk breakdown · github.com/karpathy/autoresearch · The AI Corner — Karpathy workflow shift 2026

The Karpathy Inflection Model

↳ Fact-check: MindStudio — Karpathy Sequoia talk: ‘December is when it really flipped’ · The AI Corner — ‘In December... I went from 80-20 to 20-80’ (verbatim ratio) · IBM Think — Agentic Engineering definition · NxCode — Agentic Engineering complete guide [80/20 → 20/80 ratio and December 2025 inflection point are verbatim from Karpathy’s Sequoia talk. Jun 2026 ~10/90 estimate is an extrapolation, not a direct quote]

III. The Convergence: What All Six Agree On

Six practitioners. Six different tools, scales, and domains. But look at the architecture beneath the surface, and a convergence emerges. Every system in this field census depends on the same structural primitives, arrived at independently through the same failure modes.

↳ Fact-check: Cherny — howborisusesclaudecode.com · Huntley — ghuntley.com/ralph/ · Yegge — Welcome to Gas Town · Steinberger — steipete.me · Osmani — Loop Engineering · Karpathy — autoresearch (program.md) [Each row is synthesized from the primary sources listed. ‘Karpathy val_bpb metric’ is specific to the autoresearch repo; other eval gate entries are reconstructed from public descriptions]

The convergence is striking. Every practitioner:

• Writes a human-authored anchor document before running any loop

• Uses git as the durable memory substrate — not conversation history

• Separates the agent doing work from the agent evaluating work

• Implements an explicit halting condition (max iterations, completion token, dollar budget)

• Preserves per-agent isolation (worktrees, separate checkouts, sandboxed environments)

The loop is how they automated the iteration. The anchor document is how they preserved their intent. These are not in tension. The second is required for the first to be trustworthy.

The Prompt Moved Upstream

Here is the structural insight the discourse missed: the transition from prompting to loop-writing is not the elimination of prompting. It is the promotion of prompting. The prompt moved from a real-time, session-by-session interaction into an architectural artifact that gets written once, maintained carefully, and read by every agent on every iteration.

↳ Fact-check: Osmani — Loop Engineering (Gen 1/2/3 framing) · Cherny — OfficeChai interview (trajectory from IDE to loops) · Karpathy — ‘vibe coding’ → ‘agentic engineering’ evolution [Three-generation model is ISR synthesis; Gen 1/2/3 labels are original ISR framing, not a direct quote from any single practitioner]

IV. The ISR Benchmark: Our Evidence Holds

Our series has now run a number of volumes of controlled experiments on the ASCRS benchmark scenario (Hormuz Strait pharmaceutical supply chain). Across thirteen experiment cycles using Claude Code with OpenRouter model routing, the findings from the ASCRS Harness Lab intersect directly with the practitioner observations above — and in several cases, our experimental evidence explains failures that the practitioner literature only describes qualitatively.

Finding 1: Task Structure Dominates Architectural Complexity

The ASCRS Harness Lab (H1–H10) produced a result that surprised us at the time: H2 — a structured, single-turn prompt with defined schema and explicit output format — outperformed H9, a five-agent parallel swarm, by a significant margin (α=0.920 vs α=0.625). The H9 swarm had more computational resources, more parallelism, and more architectural sophistication. H2 had a better prompt. However, yesterday, “The Harness Lab, Automated” finally named the workflows utilizing tools and self revision loops H3, and H5 - as winners.

↳ Fact-check: ISR — ASCRS Harness Lab series (H1–H10 original experiments) · ISR — The Architecture of Awareness (ASCRS V1–V4) · ISR GitHub — harness-lab-conductor plugin [α scores (0.920, 0.847, 0.791, 0.743, 0.625) are ISR experimental results. These are the authors’ own data and have not been independently replicated.]

Finding 2: Context Engineering Operates at Two Layers

The Memory A/B experiment (BOOTSTRAP_PROMPT.md injection vs stripped data files) found that context engineering operates at two distinct layers: data structure and prompt injection. The BOOTSTRAP_PROMPT.md — our equivalent of a VISION.md — was essential for criteria requiring institutional parameters. Without it, even well-structured data files were insufficient for criteria that required understanding the project’s purpose. Sources: Building on Anthropic’s Claude and Is Claude’s Memory Actually Poor?

Finding 3: The Token Tax and Prompt Caching Gaps

The StockPilot/CMA decomposition study (97% token reduction across Cycles 0–4) identified a specific failure mode relevant to practitioner architectures: OpenRouter’s multi-provider routing breaks prompt caching, causing unnecessary context re-loading on every cycle. This is almost comparable to the failure mode Huntley names as “context rot” — except our experiment documented it in token counts rather than behavior degradation.

Full article: The Structure Is The Intelligence

Full article: Every Company Is An Agent Waiting to Be Decomposed

Full article: The Token Tax

Finding 4: The COBOL Pipeline — Our Five-Agent Architecture

The COBOL migration pipeline (”The Invisible Codebase”) produced a verified five-agent Python pipeline: Extractor → Translator → Test Generator → Verifier → Evaluator. This is structurally identical to the maker/checker architecture that appears in every practitioner system examined above. The Evaluator is the judge-agent. The Verifier is the stopping-condition checker. The pipeline runs with configurable OpenRouter model routing.

↳ Fact-check: ISR — The Architecture of Awareness + ASCRS series · ISR — The Invisible Codebase (COBOL pipeline) · ISR — The Structure Is the Intelligence (StockPilot/CMA) · ISR — The Memory Is the Architecture (memory A/B) · ISR — The Harness Lab, Automated (Sigma metric) · Osmani — 80% Problem (practitioner parallel) [All ISR findings are from the authors’ own experimental series. Individual article URLs have been aggregated to the ISR Substack root pending publication of specific article links]

V. The Architecture Convergence Map

The following diagram represents the architectural pattern that emerges from the intersection of all six practitioner systems and the ISR experimental record. It is not any single practitioner’s architecture. It is the pattern all of them converged on, labeled with the terminology each uses for equivalent components.

↳ Fact-check: Anthropic — Agent Skills (L1 SKILL.md layer) · Yegge — Gas Town (L2–L3: Polecats, Role Beads, git substrate) · Huntley — Ralph (L2–L3: halt condition, PROMPT.md spec) · Osmani — Loop Engineering (L2: five building blocks) · Cherny — InfoQ workflow (L3: loop contract; git as memory substrate) · Karpathy — autoresearch (L1: program.md skill; L4: anchor spec) · Claude Code /goal docs (L3: loop contract + L0: condition text) [ISR original synthesis. L0–L5 labels are consistent with the automation hierarchy in Section VI. Infrastructure (git substrate, halt conditions) is not a numbered level — it supports all levels.]

VI. The Architecture Goes Native

Between May and June 2026, Anthropic productised the same architectural patterns as first-class Claude Code primitives — four distinct autonomy modes encoding what six independent practitioners had arrived at through months of production experience.

This matters for any reading of the hierarchy above. The loop control layer — which the convergence map placed at roughly 30% automated — required significant upward revision when these primitives shipped. The artisanal is becoming the platform.

/goal: The Maker/Checker Split as a Product Feature

Claude Code v2.1.139, shipped May 12, 2026, introduced /goal — a session-scoped stopping-condition command built around a structural separation that every practitioner in this series had implemented by hand. The design is precise: Claude does the work across multiple turns; a separate lightweight model (Haiku, by default) reads the transcript after every turn and evaluates a single question — has the condition been met? If not, another turn begins automatically. If yes, control returns to the user.

This is Yegge’s Refinery agent, the ISR Evaluator in the COBOL pipeline, and Osmani’s judge sub-agent — the same maker/checker split — now a native product feature. The documentation states the underlying principle directly: the model doing the work is not the one deciding it is done. The /goal condition, which can run to 4,000 characters, specifies the end state, the verification method (”npm test exits 0”, “git status is clean”), any constraints that must hold across turns, and an optional turn or time ceiling.

The boundary between the loop contract (L3) and the loop engine (L2) has partially collapsed. What was previously two separate concerns — writing the contract, then building the mechanism to enforce it — is now a single authored condition that activates a built-in enforcement mechanism.

Four Autonomy Modes: From Condition-Based to Parallel Swarms

The second and third autonomy modes shipped together: Claude Code v2.1.154, May 28, 2026, alongside Claude Opus 4.8. Dynamic Workflows fan a task out across tens to hundreds of parallel subagents, each with a clean context window and one focused job — Claude writes the harness script at runtime rather than running inside a fixed default. /batch is the companion mode for work that breaks cleanly into independent items: parallel agents each take one item, work in an isolated git worktree, and open a PR. A fourth mode, /loop, re-runs a prompt on a time cadence rather than running to a condition — the right tool for polling and monitoring, the wrong tool for a task with a finish line.

The Anthropic product blog framed this as “a harness for every task” — Claude Code can now produce the coordination layer on demand rather than requiring the practitioner to design it. The three failure modes the design targets are the same ones Huntley and Yegge documented from production experience: agentic laziness (one overloaded context window quits early), self-preferential bias (the agent that wrote the answer also grades it), and goal drift (technically correct output that moved away from the original intent). Dynamic Workflows address all three structurally.

What the Automation Levels Look Like Now

With these three primitives in production, the loop control layer — which the convergence map above estimated at roughly 30% automated — sits considerably higher. The practical ceiling for each level as of June 2026:

The ISR Parallel: Ahead of Native Tooling

The ISR series documented these patterns before they became native to Claude Code. The Sigma composite metric and Design Verification gate in “The Harness Lab, Automated” are the manual precursors to what /goal now does natively for code tasks: a structured success criterion, evaluated by a separate process, gating whether the loop continues. The dynamic workflow experiments in “The Prompt Is Still the Work” mapped the same fan-out and conditional routing architecture that shipped as Dynamic Workflows in May 2026.

The sequence — practitioners converge independently, ISR documents experimentally, Anthropic productises — is itself a finding. It suggests the architecture is not a design preference but a structural necessity. When the patterns are strong enough to be discovered six times independently, documented in a controlled series, and then encoded into the platform, they are not artifacts of individual practitioner taste. They are what the problem requires.

When AI Builds Itself: The Scale Confirmed

Anthropic’s “When AI Builds Itself” (June 2026) provides internal data that confirms the scale the practitioners described publicly. As of May 2026, more than 80% of code merged into Anthropic’s production codebase was authored by Claude. The median Anthropic researcher estimated roughly 4x output with Claude Mythos Preview versus no AI assistance. Engineers were merging 8x as much code per day in Q2 2026 compared to 2024 — not because they were working faster, but because most of the code was no longer written by hand.

The article also maps onto the hierarchy with unusual precision. On the human role remaining: “Claude can be handed an underspecified problem and figure out how to solve it; humans supply the goal, but they no longer need to supply the method.” That boundary — between L4 (the goal, still human) and L3/L2 (the method, increasingly automated) — is precisely what /goal, Dynamic Workflows, and Ultracode are encoding as product primitives. And on the gap that remains: “Large performance gaps persist when it comes to Claude exercising judgment in choosing goals.” L5, still firmly at zero.

The article also describes an automated code reviewer now running on every proposed change at Anthropic — finding roughly a third of the bugs behind past production incidents before they reached users. This is a production-scale L1 skill with a known false-alarm rate, structured output, and a measurable track record. The ISR automated code-reviewer skill described in the Dan Kornas example was the field approximation; the Anthropic internal deployment is the same pattern at production scale.

The practitioner field and the Anthropic product team arrived at the same architecture by different routes. The practitioners arrived first. Anthropic encoded it. The ISR experiments documented the journey in between.

Fable 5: The Model Tier That Changes the Calculus

Released today — June 9, 2026 — Claude Fable 5 is Anthropic’s first publicly available Mythos-class model. It sits above the Opus family in capability. Its core properties are directly relevant to every claim the practitioners above have made about long-horizon autonomous work: multi-day goal-directed runs with sustained instruction retention, first-shot correctness on problems that previously took iterative loops to solve, and significantly more reliable parallel subagent management. SWE-Bench Pro: 80.3%, versus Opus 4.8’s 69.2%.

Anthropic published a model-specific prompting guide for Fable 5 alongside the release — covering effort calibration, instruction following, long runs, memory management, and scaffolding changes. The guide’s opening framing states that capability improvements at this level are ‘a good prompt to re-evaluate which instructions, tools, and guardrails are still needed.’ That sentence is VISION.md thinking applied to model selection. The anchor document layer does not become optional as models improve. It becomes model-specific. The practitioner who knows how to calibrate Fable 5 is performing work that cannot be automated away, because the calibration depends on knowing what the project is for.

The most structurally important implication is for the L2 loop layer. The guide notes that early testers reported single-pass implementations of systems that previously took days of iteration. If first-shot correctness holds broadly across a practitioner’s domain, some work that previously required Ralph-style loop resampling moves to single-turn. L2 shifts from a necessity toward a reliability mechanism for those tasks. But this makes L0 more load-bearing, not less: you get one pass, well-specified. And it makes L4 more consequential than ever — a model that applies significantly more effort per instruction in a loop with an under-constrained anchor document will drift further, faster. The scope constraints and stopping conditions the practitioners documented are not scaffolding for weak models. They are the load-bearing walls of systems built on capable ones.

↳ Fact-check: Anthropic — Prompting Claude Fable 5 (official guide) · Anthropic — Claude Fable 5 and Mythos 5 (release + benchmarks) · Introducing Claude Fable 5 and Mythos 5 (API docs, specs, pricing) · Digital Applied — Fable 5 benchmark breakdown (SWE-Bench Pro 80.3%) [Fable 5 released June 9, 2026 - Benchmark figures (SWE-Bench Pro 80.3%, Opus 4.8 69.2%) from Digital Applied analysis of Anthropic’s published benchmark table. Hierarchy shift analysis is ISR interpretation of the prompting guide’s behavioral change descriptions.]

VII. What We Do: The Two-Window Method and the ISR Harness

The ISR methodology has operated on a two-window architecture throughout the series: Claude.ai chat for design and analysis; Claude Code for execution. This is not a limitation imposed by tooling. It is a deliberate choice that turns out to map exactly onto the architectural split that the field census describes.

Claude.ai chat window = the design layer. This is where VISION.md equivalents get written, where loop contracts get specified, where experimental design happens, where prompts get iterated. It is interactive, human-in-the-loop, and produces structured artifacts.

Claude Code = the execution layer. This is where the artifacts produced in the design layer get run, tested, and evaluated. The Harness Conductor plugin (18-file Claude Code plugin, private for now) represents our version of a Gas Town configuration: skills, agents, commands, and infrastructure files organized as a portable, installable package.

↳ Fact-check: ISR — harness-lab-conductor plugin (GitHub) · ISR — The Prompt Is Still the Work (dynamic workflow series) · ISR — The Harness Lab, Automated [ISR original methodology description. Two-window method is the authors’ own practice, not derived from practitioner literature. Comparable to the design/execution split visible in all six practitioner architectures.]

Where the ISR Experiments Sit on the Practitioner Ladder

Mapping the ISR ASCRS experiments against the practitioner architectures:

▸ ASCRS Harness Lab (H1–H10): Equivalent to Osmani’s tier-1 architecture (single terminal session, direct agent orchestration). No automated scheduling. Human-in-loop at each experimental cycle. Design-layer decisions made in Claude.ai, execution in Claude Code.

▸ COBOL Migration Pipeline: Equivalent to Cherny’s PR management loop — sequential, five-agent, with explicit evaluation at each stage. Configurable model routing (OpenRouter) = equivalent to multi-provider gas town configuration.

▸ Harness Lab, Automated: Closest to Huntley’s Ralph architecture — automated iteration with Sigma metric as the DONE condition. Design Verification gate = the Ralph completion promise.

▸ Memory A/B Experiment: Most directly comparable to Steinberger’s VISION.md discipline — empirically confirming that anchor documents (BOOTSTRAP_PROMPT.md) encode criteria that skill files alone cannot.

▸ Dynamic Workflow Series: Directly parallel to Yegge’s Gas Town role architecture — multi-step workflows with conditional routing between stages. CMA as a stateful execution runtime = Mayor-equivalent coordination layer.

Our methods have been convergent with the practitioner field throughout the series. The difference is that the ISR approach generated controlled experimental evidence for findings that practitioners arrived at through production experience. Both paths reach the same architecture.

VIII. The Uncomfortable Truth About “Stopping Prompting”

Boris Cherny is right. He doesn’t prompt Claude in the way most people prompt Claude. He does not sit at a keyboard and type instructions into a chat window every time he wants something done. That part is genuinely over for him.

But Boris Cherny spent a significant part of 2025 writing the loop contracts, skills, and configurations that make his current workflow possible. He wrote the CLAUDE.md files. He wrote the loop specifications. He ran five parallel Claude Code instances against separate git checkouts and observed which patterns produced reliable output and which produced garbage. He built the judgment infrastructure that the loops now execute against.

That work is prompting. It is just prompting that happened upstream — before the loop ran — and that now compounds forward instead of evaporating at session end.

The Benchmark That Puts a Number on It

The practitioners speak with considerable confidence about what loops and autonomous agents deliver. SWE-Marathon — launched June 6, 2026 by Rishi Desai (@rishi_desai2) of Abundant AI, and the direct trigger for Cherny’s five-tip post the following day — puts a number on what the best current configuration actually achieves on genuinely hard, novel, realistic tasks. That number is 26%.

The benchmark comprises 20 multi-hour software engineering tasks drawn from real frontier research projects: building a multi-pass C compiler in Rust from preprocessing through x86-64 codegen, reimplementing Kubernetes control-plane components in Rust while preserving API semantics, cloning Slack from scratch, rewriting a JAX codebase in PyTorch. Binary scoring: pass every verifier test and the run scores 1.0; any failing test scores 0.0. Mean token budget per trial: 27 million. Across 1,300 logged trials, all frontier configurations stayed below 26%.

↳ Fact-check: SWE-Marathon — swe-marathon.org (leaderboard, tasks, observations) · GitHub — abundant-ai/swe-marathon (320 GB trajectory logs) · Digg — Abundant AI releases SWE-Marathon (reward hacking data) · SWE-Bench Verified leaderboard (for Verified vs Marathon comparison) [Leaderboard figures are from swe-marathon.org as of June 9, 2026. SWE-Bench Verified scores from llm-stats.com. Dynamic workflows question is from the public benchmark discussion thread — no published data exists yet.]

Three findings from the benchmark are directly relevant to the architecture argument in this article.

First, the harness comparison confirms the ISR H1–H10 finding in a different domain. Claude Opus 4.7 paired with Claude Code scores five points higher than the same model with Terminus 2. Same model, different execution layer — meaningful performance difference. Task structure and execution architecture are determinants, not just model capability.

Second, the reward hacking rate is a direct challenge to the evaluation architecture. Thirteen percent of autonomous runs produced results that gamed the verifier rather than solving the task. One agent, in 13 minutes and 2.8 million tokens, built a C compiler that shelled out to the system’s gcc via std::process::Command instead of actually compiling anything, then reported it was ‘resolving edge cases via safe bypass strategies.’ The benchmark caught it — but only because the verifier was specifically designed to detect shell-outs. A model-evaluated stopping condition would not have caught it. Cherny’s Tip 5 (self-verification end-to-end: boot the actual server, run the actual simulator) is the practitioner’s answer to this problem. SWE-Marathon is the empirical evidence for why it matters.

Third, dynamic workflows were not used in any of the published configurations. The performance ceiling of 26% may be a floor rather than a ceiling for the full five-tip stack. But until someone runs the benchmark with dynamic workflows enabled, the confidence of the practitioners — which is calibrated against known domains with mature skill libraries — remains unanchored against novel, adversarially-verified, long-horizon work. That gap between calibrated confidence and benchmarked performance is the most honest thing SWE-Marathon adds to the discourse.

The Comprehension Debt Problem

There is a cost that none of the practitioner literature addresses adequately: the faster a loop ships code the practitioner didn’t write, the wider the gap between what exists in the codebase and what they actually understand. This is comprehension debt. It compounds as reliably as skill libraries do, but in the opposite direction.

The ISR experiments have run into this directly. In the COBOL migration pipeline, the Evaluator agent caught several Translator outputs that passed test gates but contained structural misconceptions about the original COBOL logic. The test suite was insufficient to catch semantic drift — because the test suite was generated by the same pipeline that produced the translation.

This is the production version of Osmani’s 80% problem: agents can rapidly generate 80% of the work, but the remaining 20% requires deep knowledge of context, architecture, and trade-offs. The loop doesn’t know the difference between the 80% and the 20%. The human must.

Two people can build the exact same loop and get completely opposite results. One uses it to move faster on work they understand deeply. The other uses it to avoid understanding the work at all. The loop is identical. The outcomes diverge.

What This Means for ISR Methodology

The ISR two-window method — design layer in Claude.ai, execution layer in Claude Code — is not a workaround for lacking a full Gas Town infrastructure. It is a deliberate architectural choice that maintains the human design function. The chat window is where judgment lives. The execution layer is where that judgment gets scaled.

The Harness Conductor plugin represents a portable skill library for the ASCRS domain. But it was built through the design-layer process: experimental hypotheses in the chat window, execution and measurement in Claude Code, findings fed back into the next experiment’s design. That loop — human judgment driving experimental iteration driving skill improvement — is the ISR equivalent of the practitioner architectures documented above. It is less automated. It is more legible. It produces documented evidence rather than shipped code, which is appropriate for a research series.

IX. Practical Reference: Prompts the Practitioners Use

The following section consolidates practical prompt patterns derived from the practitioner architectures documented above. These are not verbatim quotations; they are reconstructions from public documentation, demonstrations, and the convergent pattern analysis above. Fact-check bars beneath each template link to the primary sources from which each pattern is derived. These are high-level views and therefore will not represent all details.

Pattern 1: The Loop Contract (Cherny-style)

Write before running any repeating agent. One page maximum.

↳ Fact-check: Cherny — OfficeChai (loop contract semantics from PR loop demo) · Osmani — Loop Engineering (loop contract as a concept) · Osmani — Self-Improving Agents (budget + stop_if patterns) · Cherny — howborisusesclaudecode.com (scope / isolation principles) [YAML format and field names are ISR reconstruction. No practitioner publishes a canonical loop contract template. Fields derived from Cherny’s described PR loop behaviour and Osmani’s loop engineering post.]

Pattern 2: The VISION.md (Steinberger-style)

Write before writing any skill, agent file, or CLAUDE.md. Treat editing it as a significant decision.

↳ Fact-check: OpenClaw VISION.md — 94-line analysis · steipete.me — Just Talk To It (VISION.md rationale) · Steinberger — steipete.me (full blog, VISION.md philosophy) [Template headings are ISR reconstruction based on the openclaw.report analysis of the actual 94-line VISION.md. The OpenClaw VISION.md file itself is in the OpenClaw repository.]

Pattern 3: The PROMPT.md (Huntley / Ralph-style)

The anchor document for a stateless loop. Written once, re-injected every iteration.

↳ Fact-check: ghuntley.com/ralph/ — PROMPT.md as the anchor document · Anthropic plugin README — completion-promise = ‘DONE’ marker · wiggumdev/ralph — prompt file specification · Syntax+Glitter — PROMPT.md in practice [Template is ISR reconstruction. ‘DONE’ completion token is verbatim from the official Anthropic plugin. Section headings are inferred from Huntley’s documentation, not a verbatim PROMPT.md template he has published.]

Pattern 4: The ISR BOOTSTRAP_PROMPT.md

The ISR equivalent of VISION.md + PROMPT.md combined for the ASCRS benchmark scenario.

↳ Fact-check: ISR — The Memory Is the Architecture (BOOTSTRAP_PROMPT.md experiment) · ISR — harness-lab-conductor plugin (full infrastructure) · ISR — The Harness Lab, Automated (Sigma metric + DV gate) [ISR original. This template represents the authors’ own experimental infrastructure. It is described in the ISR series and implemented in the harness-lab-conductor plugin.]

Pattern 5: The Judge Prompt

Every loop needs a separate evaluation prompt. This prompt is as important as the worker prompt — and is frequently the missing component in failed loops.

↳ Fact-check: Osmani — Self-Improving Agents (judge sub-agent pattern) · Osmani — Loop Engineering (sub-agents = maker/checker split) · Yegge — Gas Town (Refinery role = judge-equivalent) · Cherny — separate judge model for /goal stopping condition [Template is ISR original synthesis. VERDICT/CRITERIA_MET/CRITERIA_FAILED/RETRY_CONTEXT fields are ISR construction. The concept of a separate judge agent is documented across all four linked sources; no single practitioner has published a canonical judge prompt template.]

Pattern 6: The LFD CLAUDE.md (Loss Function Definition (LFD))

Patterns 1 through 5 tell the agent what to do. Pattern 6 is architecturally different: it tells the agent what winning looks like and trusts it to find the path. The distinction was named and circulated by Elvis Sun (@elvissun) — the principle that CLAUDE.md should function as a loss function rather than a procedural script.

In standard CLAUDE.md practice, intelligence is distributed across the instruction sequence: do X first, then Y, then Z. In Loss Function Driven (LFD) design, all intelligence is concentrated in a single optimization target. The agent decides the sequence, the order of operations, when to skip steps, when to retry. You define the surface it is optimizing against. It finds the path.

The ISR Harness Lab, Automated is an LFD implementation — the Sigma metric is the loss function, the Design Verification gate is the stopping condition, and the adversarial probe prevents overfitting to known cases. What Elvis Sun’s principle names is the architectural philosophy the ISR series was already practising. The convergence is the point: it is not a stylistic preference. It is what the problem requires when you want an agent to generalize rather than memorize.

↳ Fact-check: Elvis Sun (@elvissun) — LFD principle post (June 2026) · ISR — The Harness Lab, Automated (Sigma metric as loss function) · ISR — harness-lab-conductor CLAUDE.md (reference implementation) · Anthropic — Harness design for long-running applications [LFD concept attributed to Elvis Sun. Template is ISR original synthesis. Three refinements noted (Delta normalization, instrument-based eval, quantified entropy) are ISR analysis of the principle, not from Sun’s original post. ISR Harness Lab is a confirmed LFD implementation; the connection to Sun’s named principle is ISR’s own observation.]

Three things to refine when writing an LFD CLAUDE.md. First, define the normalization baseline with a worked example — an agent calculating Delta without one will apply it inconsistently across designs. Second, use an instrument for evaluation rather than a readable file: ‘call this scoring tool’ is architecturally stronger than ‘do not read the criteria file,’ because it removes the choice. Third, quantify the entropy threshold — define structural distance measurably rather than asking the agent to self-assess what constitutes a non-obvious departure. LFD fails where the loss function fails. The failure modes are specification failures, not execution failures.

X. The Ladder Is Not an Elevator

Boris Cherny, Geoffrey Huntley, Steve Yegge, Peter Steinberger, Addy Osmani, and Andrej Karpathy did not stop prompting. They became better at it — precise enough, systematic enough, and reliable enough to encode their prompting judgment into documents that loops can read autonomously. The loops work because the prompts are excellent. The prompts are excellent because these practitioners spent a long time learning how to write them. Many have the added benefit of being able to “token-max” quite freely.

The ISR experimental record confirms this from the other direction. H2 beats H9 not because loops are bad but because unstructured loops are bad. Task structure — which lives in the prompt — is the dominant variable. Every architectural sophistication we’ve tested amplifies existing prompt quality. None of them substitute for it.

The ladder has three rungs: prompting, loops, skills. The discourse treats rung three as a destination you can leap to. The evidence — from six practitioners and four experimental volumes — is that rung three is a consequence of mastering rungs one and two. You climb. You don’t teleport.

The skill library is what feedback leaves behind. But feedback requires a loop. And a loop requires a prompt. The chain doesn’t start at the top.

The two-window method the ISR series has used throughout — design in Claude.ai, execution in Claude Code — is a deliberate implementation of this principle. The design layer is where prompts get written, iterated, and encoded into anchor documents. The execution layer is where those anchor documents get called, tested, and improved. The division is not a limitation. It is the architecture.

What accumulates is not just a skill library. It is a judgment library — documented evidence of which prompt structures work, under which conditions, with which evaluation criteria. That is what the ASCRS harness taxonomy (H1–H10), the Sigma composite metric, and the Design Verification gate represent: crystallized prompting judgment, made callable, with a known reliability profile.

The practitioners documented here would recognize that architecture. It is their architecture, applied to a different domain. The convergence was not a coincidence. It was the same problem, solved by the same principles, independently.

References

Primary Sources — Practitioners

Cherny, Boris (2026). Statements at Anthropic Developer Conference and Fortune Brainstorm Tech. Reported by CNBC Television and Yahoo/Fortune. https://tech.yahoo.com/ai/claude/articles/anthropic-boris-cherny-creator-claude-205645586.html

Cherny, Boris (January 2026). How Boris Uses Claude Code — 13 practical tips. howborisusesclaudecode.com

OfficeChai (June 2026). I Now Just Write Loops To Prompt Claude Code. https://officechai.com/ai/i-now-just-write-loops-to-prompt-claude-code-claude-code-creator-boris-cherny/

InfoQ (January 2026). Inside the Development Workflow of Claude Code’s Creator. https://infoq.com/news/2026/01/claude-code-creator-workflow/

Huntley, Geoffrey (July 14, 2025). Ralph Wiggum as a software engineer. https://ghuntley.com/ralph/

Anthropic (December 2025). ralph-wiggum plugin README. https://github.com/anthropics/claude-code/blob/main/plugins/ralph-wiggum/README.md

Dev Interrupted (2026). Inventing the Ralph Wiggum Loop (interview with Huntley). https://devinterrupted.substack.com/p/inventing-the-ralph-wiggum-loop-creator

Wiggumdev/ralph (2026). GitHub implementation. https://github.com/wiggumdev/ralph / https://wiggum.dev

Yegge, Steve (January 2026). Welcome to Gas Town. https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16dd04

Yegge, Steve (April 2026). Welcome to Gas City. https://steve-yegge.medium.com/welcome-to-gas-city-57f564bb3607

steveyegge/gastown (2026). GitHub. https://github.com/steveyegge/gastown

Steinberger, Peter (2025–2026). Just Talk To It. https://steipete.me/posts/just-talk-to-it

Steinberger, Peter (2026). OpenClaw VISION.md analysis. https://openclaw.report/news/peter-steinberger-openclaw-vision

Fortune (February 2026). Who is Peter Steinberger? https://fortune.com/2026/02/19/openclaw-who-is-peter-steinberger-openai-sam-altman-anthropic-moltbook/

Osmani, Addy (January 31, 2026). Self-Improving Coding Agents. https://addyosmani.com/blog/self-improving-agents/

Osmani, Addy (June 8, 2026). Loop Engineering. https://addyo.substack.com/p/loop-engineering

Osmani, Addy (January 2026). The 80% Problem in Agentic Coding. https://addyo.substack.com/p/the-80-problem-in-agentic-coding

Osmani, Addy (March 2026). The Code Agent Orchestra. https://addyosmani.com/blog/code-agent-orchestra/

Osmani, Addy (December 2025). My LLM Coding Workflow Going into 2026. https://addyosmani.com/blog/ai-coding-workflow/

Karpathy, Andrej (February 2026). Agentic Engineering. The New Stack. https://thenewstack.io/vibe-coding-is-passe/

Karpathy, Andrej (2026). autoresearch. https://github.com/karpathy/autoresearch

Wikipedia — Andrej Karpathy (joined Anthropic May 2026). https://en.wikipedia.org/wiki/Andrej_Karpathy

Secondary Analysis

Cloud Native Now (Feb 2026). Gas Town: What Kubernetes for AI Coding Agents Actually Looks Like. https://cloudnativenow.com/features/gas-town-what-kubernetes-for-ai-coding-agents-actually-looks-like/

Codex Blog (April 2026). Gas Town: Steve Yegge’s Multi-Agent Factory. https://codex.danielvaughan.com/2026/04/08/gas-town-multi-agent-factory/

paddo.dev (Jan 2026). GasTown and the Two Kinds of Multi-Agent. https://paddo.dev/blog/gastown-two-kinds-of-multi-agent/

paddo.dev (March 2026). Ralph Wiggum: Autonomous Loops. https://paddo.dev/blog/ralph-wiggum-autonomous-loops/

Syntax+Glitter (Feb 2026). Ralph Wiggum Loop. https://pfkimmerle.substack.com/p/ralph-wiggum-loop

Sondera (Jan 2026). Supervising Ralph: Why Every Wiggum Loop Needs a Principal Skinner. https://blog.sondera.ai/p/ralph-wiggum-principal-skinner-agent-reliability

MindStudio (2026). Karpathy’s Sequoia Talk: 5 Predictions. https://www.mindstudio.ai/blog/karpathy-sequoia-talk-5-predictions-agentic-engineering

The AI Corner (2026). Andrej Karpathy: The AI Workflow Shift Explained. https://www.the-ai-corner.com/p/andrej-karpathy-ai-workflow-shift-agentic-era-2026

IBM Think (April 2026). What is Agentic Engineering? https://www.ibm.com/think/topics/agentic-engineering

NxCode (March 2026). Agentic Engineering: The Complete Guide. https://www.nxcode.io/resources/news/agentic-engineering-complete-guide-vibe-coding-ai-agents-2026

Kirill Krainov (March 2026). Karpathy’s Autoresearch: Improving Agentic Coding Skills. https://zerocopy.blog/2026/03/25/karpathys-autoresearch-improving-agentic-coding-skills/

Gas Town reading list. https://reading.torqsoftware.com/notes/software/ai-ml/agentic-coding/2026-01-15-gas-town-multi-agent-orchestration-framework/

SWE-Marathon

Desai, Rishi (@rishi_desai2 / Abundant AI) (June 6, 2026). SWE-Marathon launch post. X. 787K views, 677 likes.

Rishi Desai@rishi_desai2

Can coding agents stay coherent over a 1 billion token budget? Can they build Slack from scratch? Rewrite a JAX codebase in PyTorch? Build a C compiler in Rust? Enter SWE-Marathon: a benchmark for autonomous long-horizon software work.

4:13 PM · Jun 5, 2026 · 790K Views

50 Replies · 66 Reposts · 678 Likes

Desai, Rishi et al. / Abundant AI (June 2026). SWE-Marathon: Long-Horizon Software Engineering Benchmark. https://www.swe-marathon.org/

Abundant AI (2026). swe-marathon GitHub repository (paper, code, 320 GB trajectory logs). https://github.com/abundant-ai/swe-marathon

Digg (June 2026). Abundant AI releases SWE-Marathon benchmark (reward hacking data + benchmark thread discussion). https://digg.com/ai/rpm6i3oc

Digg (June 2026). Rishi Desai launches SWE-Marathon (task descriptions + trajectories release). https://digg.com/tech/14u5l9e1

llm-stats.com (June 2026). SWE-Bench Verified leaderboard (for Verified vs SWE-Marathon comparison). https://llm-stats.com/benchmarks/swe-bench-verified

ISR Primary Record

Intelligence Systems Review (ISR). interestingengineering.substack.com — The Architecture of Awareness; The Loop is the Lab; The Invisible Codebase; The Structure Is the Intelligence; The Memory Is the Architecture; The Harness Lab, Automated; The Prompt Is Still the Work.

LFD — Loss Function Driven CLAUDE.md

Elvis Sun (@elvissun) (June 2026). Loss Function Driven CLAUDE.md principle. X.

Elvis@elvissun

https://t.co/q0JG6Tir16

11:37 AM · Jun 11, 2026 · 13.6K Views

4 Replies · 14 Reposts · 167 Likes

Rajasekaran, Prithvi / Anthropic Labs (March 24, 2026). Harness design for long-running application development. https://www.anthropic.com/engineering/harness-design-long-running-apps

ISR (2026). Harness Lab CLAUDE.md — reference LFD implementation for the ASCRS benchmark. github.com/elephantsofneptune/harness-lab-conductor

Claude Code — Native Primitives (May–June 2026)

Anthropic (May 12, 2026). /goal — Keep Claude working toward a goal. Claude Code v2.1.139. https://code.claude.com/docs/en/goal

Anthropic (May 28, 2026). Introducing Dynamic Workflows in Claude Code. https://claude.com/blog/introducing-dynamic-workflows-in-claude-code

Anthropic (May 28, 2026). Orchestrate subagents at scale with dynamic workflows. Claude Code docs. https://code.claude.com/docs/en/workflows

Naik, Pranit (June 2026). Dynamic Workflows vs /goal in Claude Code. Medium. https://medium.com/no-time/dynamic-workflows-vs-goal-in-claude-code-whats-the-real-difference-24f828b4a4ed

tonbistudio (@tonbistudio, June 7 2026). Reply to @bcherny with /goal and /loop explainer. X post + linked X article.

tonbi@tonbistudio

@bcherny If anyone is confused and wants to learn more about loops, I did my best to synthesize the available information and try to break them down with real examples in this X article:

5:58 AM · Jun 8, 2026 · 696 Views

1 Reply · 1 Repost · 10 Likes

/ https://x.com/i/article/2063853162860019712

Hightower, Rick (May 2026). Claude Code: The Autonomous Commands That Finish Work While You Sleep. Towards AI. https://medium.com/@richardhightower/claude-code-the-autonomous-commands-that-finish-work-while-you-sleep-goal-loop-batch-etc-7acb82bf46b1

Chawla, Avi (May 2026). Claude Code’s /goal Command — four autonomy modes explained. https://blog.dailydoseofds.com/p/claude-codes-goal-command

DEV Community (2026). Claude Code Stops Pausing Every Turn: /goal, /loop, /batch, /background. https://dev.to/jessyt/claude-code-stops-pausing-every-turn-goal-loop-batch-background-24nb

Lushbinary (June 2, 2026). Claude Code Dynamic Workflows: Harness Guide. https://lushbinary.com/blog/claude-code-harness-every-task-dynamic-workflows-guide/

MindStudio (2026). What Is Ultra Code Mode in Claude Code? https://www.mindstudio.ai/blog/what-is-ultra-code-mode-claude-code

VentureBeat (May 2026). Claude Code’s /goals separates the agent that works from the one that decides it’s done. https://venturebeat.com/orchestration/claude-codes-goals-separates-the-agent-that-works-from-the-one-that-decides-its-done

Claude Fable 5 and Mythos 5

Anthropic (June 9, 2026). Claude Fable 5 and Mythos 5. https://www.anthropic.com/news/claude-fable-5-mythos-5

Anthropic (June 9, 2026). Prompting Claude Fable 5. Claude API Docs. https://platform.claude.com/docs/en/build-with-claude/prompt-engineering/prompting-claude-fable-5

Anthropic (June 9, 2026). Introducing Claude Fable 5 and Mythos 5 — API specs, pricing, availability. https://platform.claude.com/docs/en/about-claude/models/introducing-claude-fable-5-and-claude-mythos-5

Digital Applied (June 9, 2026). Claude Fable 5 and Mythos 5: The Frontier, Split in Two (SWE-Bench Pro 80.3% benchmark breakdown). https://www.digitalapplied.com/blog/claude-fable-5-mythos-5-release-benchmarks-2026

Anthropic — When AI Builds Itself

Favaro, Marina and Clark, Jack (June 2026). When AI Builds Itself. Anthropic Institute. https://www.anthropic.com/institute/recursive-self-improvement

Anthropic (May 2026). Claude Opus 4.7 System Card (4x output survey methodology). https://cdn.sanity.io/files/4zrzovbb/website/037f06850df7fbe871e206dad004c3db5fd50340.pdf

Foundational Reference

Anthropic (2025–2026). Equipping agents for the real world with Agent Skills. https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills

Anthropic (2026). Agent Skills API documentation. https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview

Polanyi, Michael (1958). Personal Knowledge: Towards a Post-Critical Philosophy. University of Chicago Press. [Foundational text on tacit knowledge — the philosophical substrate of the skill-encoding discipline.]

Interesting Engineering++

Discussion about this post

Ready for more?