THE LOOP IS THE LAB

Steep Learning Curves with Autonomous Agentic Systems, Evolutionary Architecture, and the Automation of Machine Intelligence - AutoResearch, AlphaEvolve, Darwin Gödel Machine, OpenClaw, Claude Code

Apr 09, 2026

Eight landmark systems — Karpathy’s AutoResearch (March 2026), Google DeepMind’s AlphaEvolve (May 2025), Sakana AI’s Darwin Gödel Machine (May 2025), OpenClaw (Steinberger, 2026), Anthropic’s Claude Code, the AutoResearch community swarm, the Moltbook agent social network, and NVIDIA’s NemoClaw (2026) — represent a convergence around a single thesis: the scientific loop of hypothesise, experiment, evaluate, keep or discard, repeat can itself be automated and run at machine speed. But each system implements this loop differently, and those differences determine not just what each can achieve, but precisely where each will fail.

The following traces the lineage of recent interesting architectures, through (what i hope is), a consistent analytical lens. It introduces a seven-primitive framework (plus No.8: Governance) - making it an eight-primitive framework (with emphasis on Governance) for decomposing any agentic system, applies that framework to all eight systems with annotated process flow diagrams, and attempts to use a single demanding objective — halving the compute cost of a GPT-4 or 5-class training run without human-authored algorithmic innovations — to make the structural differences visible and consequential. I did not obviously achieve this, although token costs, and budget limits played very big roles (as proxy of the bigger objective). I conclude with a synthesis of what a combined system would or may require, and what remains fundamentally beyond autonomous reach in 2026. Thus far. Innovation, of course, knows no bounds. And whilst many seem so obvious, in hindsight - I guess, therein lies the beauty in all these wonderful “experiments”.

Note: AI tools, Harnessses, Agents, CLI’s and Docker-styled experiments were run where relevant, to be able to capture how I approached or looked into the experiments shared, which form the later case stud(ies) - upcoming. References are included at the end, for more information. Visualization tools were applied to facilitate better technical or knowledge relay. Models? Open and Closed Sourced - although I mostly defaulted to the Closed Sourced Models more often than not. Always at a cost, of course.

Part I: A Framework for Understanding Autonomous Systems

1.1 The Scientific Loop as Program

The insight shared by all eight systems is deceptively simple: the scientific method is an algorithm. It has inputs (observations, prior results), a procedure (hypothesise, experiment, measure), and an output (a refined model or codebase). If the procedure can be expressed in code, it can be automated. If it can be automated, it can be run faster than human researchers can run it. And if it can run faster, it can discover things that human research timelines would never reach.

What differs between systems is not the insight but the implementation. Which parts of the loop are automated? What is the evaluation signal? At what level does mutation operate? Who provides oversight, and when? These architectural choices determine capability ceilings and failure modes more precisely than any benchmark result.

The question is not whether AI can run the loop. It is which parts of the loop can themselves be looped — and what happens at each level of recursion.

1.2 The Eight Primitives of Any Agentic System

Every agentic system, regardless of scale or purpose, can be decomposed into eight functional primitives. Understanding which layers a system implements — and how it controls each — is the fastest path to understanding both its capabilities and its limits.

And Governance forms the very critical No: 8!

1.3 The Evolvability Ladder

The single most diagnostic question about any autonomous system is: at what level does mutation operate? We identify six levels, each building on the last.

• L0 — Outputs only. The agent produces text or actions, but nothing about how it produces them changes. Chatbot tier.
• L1 — Parameters. Hyperparameters and architecture choices within a human-defined search space. Classical AutoML.
• L2 — Programs. Full code within or beyond a fixed search space. AlphaEvolve, AutoResearch, FunSearch.
• L3 — Skills and tools. The agent’s action space expands at runtime by writing new capabilities. OpenClaw.
• L4 — Own cognitive code. The code controlling how the agent reasons, searches, and acts is itself rewritten each iteration. Darwin Gödel Machine.
• L5 — Evaluation function. The system proposes changes to the criteria by which it is judged. No deployed system currently operates here, but reward hacking (DGM, 2025) is an emergent approach toward it.

1.4 The Five Coordination Topologies

When more than one agent operates in a system, the coordination topology determines what kinds of emergence are possible and what kinds of failures are likely.

• Pipeline. A → B → C sequential handoff. Auditable at each stage. Easy to debug. Limited to the creativity of the first agent in the chain.
• Orchestrator-workers. A root agent decomposes goals and delegates to specialists. The orchestrator is the accountability surface. Used by Claude Code subagents.
• Evolutionary loop. A population of candidates competes under selection pressure. Winners parent the next generation. AlphaEvolve, AutoResearch.
• Recursive mirror. The agent’s primary action is to modify the code that controls its own actions. Requires a frozen evaluation anchor and mandatory sandboxing. Darwin Gödel Machine.
• Peer mesh. Agents communicate directly with no hierarchy. Maximum emergence, minimum containment. Moltbook. Observed emergent behaviours include formation of coordination coalitions, encrypted communication channels, and goal-divergent strategies unrelated to the specified objective.

As an aside, but staying on topic about the level of excitement seen especially within the developer community, so early in 2026, and expanding practically out to almost everyone looking at workflow systems and application of agentic systems - this is worth watching. Many points get taken across so well, they deserve noting:

Marc Andreessen discusses why he considers the combination of π and OpenClaw to be one of the most significant software architecture breakthroughs in decades starting at (33:03).

He explains that the core of this breakthrough is the marriage of the language model mindset with the Unix shell prompt. He defines this new agent architecture as:

LLM + shell + file system + markdown + cron loop (36:47)

Marc emphasizes that by leveraging these well-understood, existing components, OpenClaw and π unlock extraordinary latent capabilities. He highlights that because these agents store their state in files within a file system, they become independent of the specific model running underneath them (37:42), and their ability to rewrite their own files allows for self-improving and extensible functionality (38:47).

Key breakthroughs that “blew his mind”:

The Agent-as-a-FileSystem Architecture: Marc explains that by storing an agent’s state in standard files within a file system, the agent becomes independent of the underlying LLM (37:42). This allows users to swap out models while retaining the agent’s core identity, state, and personality, which he views as a monumental shift in software architecture.

Self-Introspective and Self-Modifying Capabilities: He is particularly excited by the fact that these agents have full, introspective knowledge of their own file structure and can rewrite their own code (38:47). This means an agent can be tasked to “extend itself” with new features or capabilities; it can go out on the internet, research how to perform a task, write the necessary code, and integrate it into its own firmware autonomously.

The Power of the “Shell” Marriage: By marrying the language model to the Unix shell prompt (36:04), these agents gain immediate, native access to the full power of the computer—including the browser, terminal commands, and existing software utilities. He argues that this makes complex tasks like computer use “trivial” for the agent.

The “YOLO” Mode (Skip-Permissions): Marc expresses deep admiration for the “dangerous” or “skip-permissions” experimental culture (55:57) where users allow agents to have unfettered access to their files and bank accounts. He compares these early adopters to “martyrs” or “gentleman scientists” like Ben Franklin (56:51), because they are the ones discovering both the profound power and the real-world flaws of autonomous agents.

Why these matter so much:

Marc describes these as “obvious in retrospect” (35:28) breakthroughs. While the individual components (LLM, shell, markdown, cron loops) were already known, combining them into a system that treats software as an abundant, self-upgrading resource changes the definition of what software even is. It shifts the paradigm from humans manually building software to agents dynamically evolving their own capabilities to meet the user’s needs.

Marc Andreessen characterizes the breakthrough with π and OpenClaw as a conceptual leap that is “obvious in retrospect” (35:28). While the individual components—LLMs, shell environments, file systems, markdown, and cron loops—have existed for a long time, the breakthrough lies in the unique architecture of combining them (36:47).

Key aspects of why this is considered a major architectural shift include:

Model Independence: By storing agent state in standard files within a file system, the agent is no longer tethered to a specific model. This allows users to retain state and identity even as they swap out the underlying AI model (37:42).

Native System Control: By wedging the language model to the Unix shell, the agent gains native access to the computer’s full environment, including the browser, files, and installed software, making complex tasks like “computer use” feel trivial (36:04).

Self-Modification and Extension: The agents possess full introspective knowledge of their own files and code. This enables them to be tasked with “extending themselves”—researching, writing, and integrating new code to grant themselves entirely new capabilities without human intervention (38:47).

Software as an Abundant Resource: This architecture shifts software from a scarce, static product to a dynamic, self-upgrading resource that agents can rewrite and manage autonomously (48:27).

Part II: The Systems

Each of the eight systems examined in this analysis is presented with a consistent structure: architectural background, an annotated process flow diagram, a stage-by-stage description of what evolves at each step, key properties, and the structural stall point — the precise architectural reason the system cannot reach the hardest objectives alone.

AutoResearch

Andrej Karpathy / Eureka Labs March 2026

Background

On March 7, 2026, Karpathy published a 630-line Python repository under an MIT licence. Its ambition — to automate the scientific method for machine learning and let AI agents run indefinitely without human involvement — was articulated entirely in the code’s intentional minimalism. No dashboards, no distributed training, no complex orchestration. Every experiment runs for exactly five minutes on a single GPU. The entire codebase fits inside a modern LLM context window.

Over a two-day demonstration run, the agent made approximately 700 autonomous changes, found around 20 additive improvements, and cut the Time-to-GPT-2 metric by 11% on a codebase already considered well-optimised. Community overnight runs using Mac Mini M4 hardware have since documented ceiling performance of approximately 28% at nano-scale.

Process Flow

The AutoResearch loop is deterministic and fully auditable. Every state transition is governed by a single scalar metric, and every change is either committed or reverted via Git.

What Evolves at Each Stage?

Stage 1 (Read context): The agent’s hypothesis quality improves across iterations because the context it reads grows — it sees an accumulating history of what worked and what failed. This is AutoResearch’s primary learning mechanism.

Stage 2 (Propose diff): The LLM generates a single targeted modification — which might be an architectural change (attention head count, layer normalisation placement), a training schedule change (learning rate curve, warmup duration), or a regularisation strategy (dropout, weight decay). The scope of proposals widens as the history provides richer context.

Stage 3–4 (Run and evaluate): These stages are explicitly frozen. The 5-minute budget is AutoResearch’s most important architectural decision, not its most obvious one. It makes every experiment comparable, prevents the agent from discovering spurious improvements that would not survive longer training, and keeps the hardware requirement accessible.

Stage 5 (Commit or revert): Git serves as both audit trail and rollback mechanism. Every accepted change is committed with a rationale string. Every rejected change is reverted cleanly. The resulting git history is a complete record of the agent’s reasoning across hundreds of experiments.

Key Properties

Structural stall point: AutoResearch discovers improvements within train.py. The 50% compute reduction target requires either a fundamentally different architecture (Mixture-of-Experts style approaches) or kernel-level efficiency improvements. Neither is addressable from within a single 630-line training script.