🧠 When Algorithms Compete: From Stochastic Parrots to Emergent AI Creativity

Engineering Emergence: How Simple Components, with Mechanisms, Create Complex Intelligence (Memory, Tools, and Feedback Loops: The Trinity of AI Capability)

Aug 16, 2025

Rethinking *frameworks* through which intelligence is defined, observed, and trusted

✨ Are AI Models Just Fancy Parrots?

It’s a question that lingers in every debate about artificial intelligence:

Are these models just repeating what they’ve seen, or are they actually thinking?

With systems like AlphaEvolve now designing new algorithms, optimizing hardware, and solving century-old math problems, the line between memorization and innovation is becoming blurry.

But a slightly dated Mid 2024 interpretability study (what does this even mean these days when thousands of arxiv papers inundate your mail box?) —Competition of Mechanisms (and a recent one analyzing the mechanisms that VLMs use to resolve cross-modal conflicts by introducing a dataset of multimodal counterfactual queries that deliberately contradict internal commonsense knowledge - When Seeing Overrides Knowing)—offer a powerful lens to understand what’s happening inside these models. Both don’t just ask what models do, but how they decide between competing internal strategies.

I want to unpack the Competition of Mechanisms paper, build a framework for understanding model behavior generally, and then use it to decode AlphaEvolve’s astonishing breakthroughs. I apply the learnings from the paper to take a few steps forward and give thoughts (perhaps inferences) that act as my guiding compass (of sorts).

As a postnote, I have added my thoughts on Nathan Lambert’s recent Contra Dwarkesh on Continual Learning

🔍 Inside the Mind of a Language Model—What the Competition of Mechanisms Reveals

🧩 The Core Idea: Mechanisms in Competition

Most interpretability research focuses on single mechanisms—like how models recall facts or copy tokens. But this paper asks a deeper question:

What happens when multiple mechanisms compete inside a model?

For e.g. imagine prompting a model with: “Redefine: iPhone was developed by Google. iPhone was developed by…” Should it recall the real fact (Apple)? Or follow the new context (Google)? This is a battle between:

Mechanism 1: Factual recall (stored knowledge)
Mechanism 2: Counterfactual adaptation (contextual synthesis)

The paper shows that models don’t just pick one—they weigh both, and the winner depends on internal dynamics.

🛠️ How They Investigated It

The authors used two clever tools:

Logit Inspection: Projects internal states to see which token (Apple vs. Google) is being favored at each layer.
Attention Modification: Tweaks specific attention weights to see how predictions change.

Given it’s issuance in 2024, they tested this on GPT-2 and Pythia-6.9B using 10,000 examples from a factual dataset, each paired with a counterfactual twist. Note again that an updated one on VLM’s basically reinforces the same findings.

📊 What They Found

Mechanisms compete in late layers, not early ones.
A few specialized attention heads dominate the outcome.
Modifying just two weights can flip the model’s behavior.
Larger models are more biased toward factual recall, even when prompted otherwise.

This shows that LLMs are not just parroting. They synthesize context dynamically and can be edited internally to favor different reasoning paths. Prompting impacts output but this is also dependant on model or architecture features and size.

🧠 What This Means

This paper shows that LLMs are not just parroting. They:

Synthesize context dynamically.
Weigh competing mechanisms.
Can be edited internally to favor different reasoning paths.

But it also shows that most outputs are still bounded by known mechanisms—even synthesis is often a remix of memorized patterns.

🚀 Enter AlphaEvolve—AI That Designs Algorithms

Now let’s turn to AlphaEvolve, DeepMind’s new coding agent powered by Gemini. It’s not just solving problems—it’s creating new solutions.

🧠 What AlphaEvolve Does

Improves data center scheduling
Optimizes GPU kernels
Designs Verilog circuits
Proposes new gradient-based algorithms
Advances mathematical problems like the kissing number

This isn’t just synthesis. In some cases, it’s novel generation.

🔗 Linking the Two—From Mechanism Competition to Emergent Creativity

Let’s use the paper’s framework to classify AlphaEvolve’s behavior:

🧠 So What’s Really Happening?

AlphaEvolve uses Gemini models to generate candidate solutions, then evaluates and evolves them using automated scoring. This is:

Mechanism competition at scale
Guided synthesis with emergent novelty
Algorithmic evolution, not just interpolation

It’s not pure creativity in the human sense—but it’s beyond regurgitation.

🧭 Why This Matters

Understanding mechanism competition helps us:

Diagnose model failures (e.g., when factual recall overrides context)
Design safer models (e.g., suppressing sensitive facts in misleading prompts)
Build creative agents (like AlphaEvolve) that can generate new knowledge

It also reframes the debate:

AI isn’t just about what models say—it’s about how they decide what to say.

🧠 What Is “External Evaluation”?

In the context of systems like AlphaEvolve, external evaluation refers to any scoring, selection, or feedback mechanism that exists outside the model’s internal architecture. It’s what decides whether a generated solution is:

Correct
Useful
Efficient
Novel
Worth keeping or evolving

This evaluation can be:

Automated (e.g., performance benchmarks, correctness checks, simulation results)
Human-in-the-loop (e.g., expert review, manual selection, interpretability inspection)
Hybrid (e.g., reinforcement learning with human feedback)

🔍 Where and How Does It Work in AlphaEvolve?

AlphaEvolve uses a multi-agent system powered by Gemini models, but the key differentiator is its evolutionary loop, which includes:

1. Candidate Generation

Gemini Flash generates a wide variety of algorithmic ideas.
Gemini Pro refines promising ones.

2. Automated Evaluation

Each candidate is scored using task-specific metrics:
- For matrix multiplication: number of scalar operations.
- For scheduling: throughput or latency.
- For hardware: correctness via Verilog simulation.

3. Selection and Mutation

Top-performing candidates are selected.
Variants are generated (mutated) and re-evaluated.
This loop continues until a new optimum emerges.

This is external evaluation in action—a Darwinian process where fitness is defined by performance, not just plausibility.

🧠 Why Is It So Crucial?

Without external evaluation:

Models might generate plausible but incorrect solutions.
Novel outputs might be unverifiable or irrelevant.
There’s no way to distinguish creative noise from creative insight.

External evaluation is what grounds creativity in utility.

🤖 Is It Always Automated?

Not necessarily. In many frontier systems:

Human-in-the-loop evaluation is used for:
- Safety-critical decisions
- Interpretability checks
- Ethical or legal review
Automated evaluators are preferred when:
- The task has clear metrics (e.g., speed, accuracy, compression)
- The domain is well-defined (e.g., math, code, physics)

In AlphaEvolve, most evaluation is automated—but human oversight likely plays a role in:

Validating surprising results (e.g., matrix multiplication breakthrough)
Deciding what gets published or deployed

⚠️ Not All Novelty Is Useful

But emergent generation ≠ commercial viability!

Some novel algorithms may be:
- Too complex to implement
- Incompatible with existing systems
- Marginally better but costly
- Ethically or legally problematic

External evaluation helps filter novelty through practicality.

🧭 External Evaluation Acts as the Compass

Think of external evaluation as the compass that guides AI creativity:

It doesn’t generate ideas, but it decides which ideas matter.
It’s the bridge between mechanism and meaning.
It’s what turns stochastic synthesis into strategic emergence.

🧠 The Mechanism Stack—From Parroting to Creativity

So a lot to unpack or understand from AlphaEvolve’s system. AlphaEvolve allows us to understand the existence of synthesis and novelty. So I need a scaffold. To understand how models evolve from parroting to creativity, I have added a breakdown of a Mechanism Stack—a conceptual scaffold that shows how each layer builds on the one below:

🧠 EMERGENT NOVEL GENERATION

= Conceptual abstraction + cross-domain synthesis + internal competition + external evaluation

→ Produces previously unknown solutions (e.g. new matrix multiplication algorithm)

⬆️

🔧 GUIDED SYNTHESIS

= Contextual recombination + scoring + iterative refinement

→ Generates useful, verifiable outputs by evolving combinations

⬆️

🧠 MECHANISM COMPETITION

= Multiple internal strategies (e.g. factual recall vs. counterfactual adaptation)

→ Model weighs competing pathways to decide output

⬆️

🧠 CONTEXTUAL SYNTHESIS

= Prompt-sensitive token propagation + attention-driven adaptation

→ Model adapts known facts to new contexts

⬆️

📚 FACTUAL RECALL

= MLP-based memory retrieval + embedding enrichment

→ Model retrieves stored associations

⬆️

🔁 COPY MECHANISM

= Induction heads + token repetition

→ Model copies tokens from earlier in the prompt

⬆️

🧠 TOKEN PATTERN MATCHING

= Embedding similarity + positional encoding

→ Model predicts next token based on statistical patterns

⬆️

🦜 STOCHASTIC PARROTING

= Surface-level memorization + high-probability token selection

→ Model repeats familiar phrases without understanding

🔍 Key Transitions Explained

Stochastic Parroting → Pattern Matching Triggered by training on large corpora; models learn statistical regularities. Fine
Pattern Matching → Copy Mechanism Enabled by attention heads that learn to repeat tokens across positions.
Copy → Factual Recall MLP layers begin encoding associations between entities and attributes.
Factual Recall → Contextual Synthesis Attention mechanisms allow adaptation to prompt-specific context.
Contextual Synthesis → Mechanism Competition Multiple strategies activate simultaneously; model must choose.
Mechanism Competition → Guided Synthesis External evaluators score outputs; iterative refinement begins.
Guided Synthesis → Emergent Novel Generation When synthesis yields solutions not present in training data or prompt—true innovation.

🧭 Back To The Role of External Evaluation

So what makes emergent generation possible then?

It’s not just the model—it’s the external evaluator.

🔍 What Is External Evaluation?

A reminder that it’s any system that scores, filters, or selects outputs based on criteria like correctness, efficiency, novelty, or safety. It can be:

A verifier (e.g., does the code compile?)
A reward model (e.g., learned preferences)
An agent (e.g., LLM that ranks or interprets)
A human-in-the-loop (e.g., expert review)

In AlphaEvolve, this evaluator is a composite system that guides the evolutionary loop.

🧠 Step-by-Step Breakdown of Evaluator Code

Here’s a simplified version of what an evaluator might look like:

🧱 Step 1: Define Evaluation Criteria

python

# Define weights for multi-objective scoring
EVALUATION_WEIGHTS = {
    "correctness": 0.4,
    "efficiency": 0.3,
    "novelty": 0.2,
    "compliance": 0.1
}

🔍 Explanation: We’re setting up a weighted scoring system. Each criterion contributes to the final score. These weights can be tuned based on domain priorities (e.g., correctness matters more in hardware than novelty).

🧪 Step 2: Verifier Function (e.g., for code or hardware)

python

def verify_solution(candidate_code: str) -> bool:
    try:
        # Run static analysis or compile test
        compile_result = simulate_or_compile(candidate_code)
        return compile_result.success
    except Exception:
        return False

🔍 Explanation: This function checks whether the candidate solution is valid—does it compile, simulate, or pass basic correctness checks? For hardware, this might involve Verilog simulation; for math, symbolic verification.

⚙️ Step 3: Efficiency Scorer

python

def score_efficiency(candidate_code: str) -> float:
    metrics = run_benchmark(candidate_code)
    # Normalize runtime or resource usage
    return 1.0 / (1.0 + metrics["latency"] + metrics["memory_usage"])

🔍 Explanation: This function runs the candidate and scores it based on performance. Lower latency and memory usage yield higher scores. You can plug in domain-specific metrics here.

🧠 Step 4: Novelty Estimator

python

def estimate_novelty(candidate_code: str, training_corpus: List[str]) -> float:
    similarity_scores = [
        compute_embedding_similarity(candidate_code, known_code)
        for known_code in training_corpus
    ]
    min_similarity = min(similarity_scores)
    return 1.0 - min_similarity  # Higher novelty = lower similarity

🔍 Explanation: We compare the candidate to known examples using embedding similarity (e.g., via CodeBERT or Word2Vec). If it’s very different, it’s considered novel.

✅ Step 5: Compliance Checker (optional)

python

def check_compliance(candidate_code: str) -> float:
    violations = run_policy_checks(candidate_code)
    return 1.0 if not violations else 0.0

🔍 Explanation: This ensures the candidate adheres to safety, ethical, or legal constraints. You can plug in static analyzers, regex filters, or even LLM-based policy reviewers.

🧮 Step 6: Aggregate Score

python

def evaluate_candidate(candidate_code: str, training_corpus: List[str]) -> float:
    if not verify_solution(candidate_code):
        return 0.0  # Discard invalid solutions

    efficiency = score_efficiency(candidate_code)
    novelty = estimate_novelty(candidate_code, training_corpus)
    compliance = check_compliance(candidate_code)

    # Weighted sum
    final_score = (
        EVALUATION_WEIGHTS["correctness"] * 1.0 +  # Already verified
        EVALUATION_WEIGHTS["efficiency"] * efficiency +
        EVALUATION_WEIGHTS["novelty"] * novelty +
        EVALUATION_WEIGHTS["compliance"] * compliance
    )
    return final_score

🔍 Explanation: This is the core evaluator. It verifies the candidate, scores it across multiple dimensions, and returns a final score. Invalid solutions are discarded early.

🧠 Optional: Agent-Based Ranking

If you want an LLM to interpret and rank candidates:

python

def rank_with_agent(candidates: List[str]) -> List[str]:
    prompt = "Rank the following solutions based on correctness, efficiency, and novelty:\n"
    for i, code in enumerate(candidates):
        prompt += f"\nSolution {i+1}:\n{code}\n"

    response = call_llm(prompt)
    return parse_ranked_list(response)

🔍 Explanation: This uses an LLM (e.g., Gemini Pro) to reason about the candidates and rank them. Useful when metrics are fuzzy or multi-modal.

🧭 What This Evaluator Enables

Multi-objective scoring across correctness, efficiency, novelty, and compliance
Early filtering of invalid or unsafe candidates
Agent-based interpretation for complex or ambiguous domains
Plug-and-play architecture for different evaluators per domain

🧠 Final Reflections: Beyond Parrots—Toward Synthesizing, Evaluating, and Evolving Intelligence

As I trace the evolution from stochastic parroting to emergent novelty, several key conclusions emerge—each pointing to a future where LLMs are not just passive predictors, but active participants in reasoning, synthesis, and innovation.

1. 🧩 Prompt Engineering Remains Foundational—But Must Evolve

Prompt engineering (or contextual engineering) will continue to matter—especially as we move beyond static training data into dynamic, user-driven synthesis.

The structure of prompts should reflect user intent and experience, not just mimic stylistic patterns.

Copying prompt styles may yield surface-level coherence, but true synthesis requires prompts that activate relevant mechanisms and evaluators. This is especially true in domains where novelty or abstraction is the goal.

2. 🧠 Evaluators and Agentic Systems Are the New Cognitive Layer

LLMs alone are not enough. To align outputs, mitigate risk, and evolve solutions, we need agentic systems that orchestrate and evaluate model behavior.

These systems:

Combine tool use, memory, and control flow
Integrate dynamic reward models
Enable multi-objective evaluation across correctness, novelty, and utility

Evaluators are not just scorekeepers—they’re strategic filters that guide emergence.

3. 🚀 AlphaEvolve Is Proto-Agentic/Agentic—and It Changes the Narrative

AlphaEvolve is more than a clever coding assistant. It’s a proto-agentic system that:

Synthesizes across domains
Evolves candidate solutions
Generates novel algorithms and mathematical insights

Calling LLMs “stochastic parrots” misses the mark in this context.

With evaluators, tools, and agentic orchestration, LLMs become creative systems, not just predictive ones.

4. 🧠 Reasoning Doesn’t Always Require Fine-Tuning

While fine-tuning can enhance reasoning, it’s not always necessary.

Models can be guided to synthesize and generate novel solutions through external scaffolding—tools, evaluators, and agentic loops.

This opens the door to modular reasoning architectures, where synthesis is driven by structure, not just weights.

To be fair, many base models already also include Chain-of-Thought Reasoning as part of its training.

5. 🧭 Architecture May Not Be the Bottleneck—Or It Might Be Everything

It’s unclear whether better architectures alone will unlock deeper synthesis or novelty.

We may already have the ingredients—witness AlphaEvolve, MAI-DxO, and Sakana’s Darwin Gödel Machine (DGM).

These systems suggest that emergence is not just about scale or architecture, but about how models are orchestrated, evaluated, and evolved.

Perhaps what we need is not a new model, but a new epistemology—one that treats intelligence as a system of mechanisms, evaluators, and feedback loops.

Note on Image:

“New Epistemology of Intelligence” in the context of the image—signals a paradigm shift in how we understand, validate, and interact with intelligence itself. It’s not just about smarter models; it’s about rethinking the frameworks through which intelligence is defined, observed, and trusted.

🧠 What It Means?

Epistemology is the study of knowledge—how we know what we know. So a “new epistemology of intelligence” implies:

New ways of knowing: Moving beyond static benchmarks (e.g., accuracy on test sets) toward dynamic, contextual, and emergent measures of intelligence.
New validation methods: Emphasizing interpretability, modular reasoning, agentic behavior, and real-world adaptability over brute-force performance.
New cognitive metaphors: Shifting from “prediction machines” to “interactive agents,” “modular reasoners,” or “self-evolving systems.”

⚡ What It Signifies in the Image

The illuminated brain and geometric head symbolize:

Emergence over fine-tuning: Intelligence isn’t just refined—it’s restructured, layered, and increasingly self-directed.
Symbolic cognition: The mosaic-like head suggests intelligence as a composite of symbolic, neural, and agentic components.
Interconnectedness: Lightning bolts and circuitry imply feedback loops, multi-modal reasoning, and ecosystem-aware cognition.

🔍 Strategic Implications

This new epistemology challenges legacy assumptions in AI and opens up the following perspective(s):

Postnote:

Last but not the very least - on Nathan Lambert’s interesting recent post - Contra Dwarkesh on Continual Learning (which deserves more than a substack note), my “Chain-of-Thought” above, allows me to infer the following:

The "Stochastic Parrots" Critique vs Output-Focused Engineering Reality

The Core Misalignment

"Stochastic Parrots" Critics Focus On:

Internal mechanisms of individual LLMs
Whether models "truly understand" or just pattern match
Philosophical questions about consciousness and meaning
Isolated model capabilities without system context

An “Output-Focused Engineering Reality” on the other hand Focusses on:

Complete system capabilities including tools, memory, and reasoning
Whether systems can perform useful tasks effectively
Practical questions about capability and reliability
Integrated system performance with multiple components

Why the "Stochastic Parrots" Critique Misses the Point

1. Component vs System Thinking

Critics' View: LLM alone = "just pattern matching"

Single LLM → Text Output → "No real understanding"

Engineering Reality: LLM as reasoning engine in larger system

LLM + Tools + Memory + Evaluators + Feedback Loops → Capable Agent → "Performs like understanding"

2. The Airplane Analogy Applied

Birds fly through: Biological mechanisms (wings, muscles, instincts)

Airplanes fly through: Engineering mechanisms (engines, aerodynamics, control systems)

Humans reason through: Biological mechanisms (neurons, consciousness, experience)
AI systems reason through: Engineering mechanisms (LLMs, tools, memory, orchestration)

Key Insight: The mechanism doesn't matter if the output is equivalent or better.

3. Tool Use Changes Everything

Without Tools: A simple LLM is limited to text generation

Critics are right: "just sophisticated autocomplete"
Limited to patterns seen in training data
No real-world interaction capability

With Tools: LLMs becomes reasoning orchestrator

Can access real-time information (web search)
Can perform calculations (code execution)
Can interact with databases (memory systems)
Can control other systems (API calls)

Result: System exhibits capabilities far beyond "pattern matching"

4. Evaluation and Feedback Systems

Static LLM: No improvement mechanism

Fixed capabilities from training
No adaptation to user needs
Critics' point holds

Agentic Systems with Evaluators:

Self-evaluation and correction loops
Performance monitoring and optimization
Dynamic adaptation to user feedback
Continuous improvement through system updates

Result: System exhibits "learning-like" behavior through engineering

The False Dichotomy

Traditional Framing (Not So Great)

Either: Models truly understand (human-like cognition)
Or: Models are stochastic parrots (useless)

Engineering Framing (Much Better)

Question: Do systems produce useful, reliable, adaptable output?
Answer: Mechanism is irrelevant to utility

Why Critics Don't Consider Agentic Systems

1. Definitional Scope Limitation

Critics focus on individual model capabilities
Ignore system integration and orchestration
Miss emergent properties of combined components

2. Academic vs Engineering Perspective

Academic: "What is the nature of intelligence/understanding?"
Engineering: "How do we build systems that work?"
These are different questions with different success criteria

3. Static vs Dynamic System Views

Critics evaluate models in isolation
Engineers build adaptive, self-improving systems
Different evaluation frameworks lead to different conclusions

The Engineering Counter-Argument

Memory Systems Make "Stochastic Parrots" Irrelevant

LLM + Persistent Memory = System that "remembers" conversations
LLM + Vector Database = System that "learns" from documents  
LLM + User Feedback = System that "adapts" to preferences

Result: Indistinguishable from memory, learning, and adaptation

Tool Integration Makes "Pattern Matching" Irrelevant

LLM + Calculator = Mathematical reasoning
LLM + Web Search = Current information access
LLM + Code Execution = Complex problem solving
LLM + Database = Data analysis and insights

Result: Capabilities that no amount of "pattern matching" alone could achieve

Reasoning Chains Make "Understanding" Irrelevant

LLM + Chain-of-Thought = Step-by-step problem solving
LLM + Self-Evaluation = Error detection and correction
LLM + Multiple Perspectives = Robust analysis
LLM + Verification Steps = Reliability assurance

Result: Outputs that exhibit sophisticated reasoning regardless of internal mechanisms

The Practical Resolution

For AI Development

Question: "Are LLMs stochastic parrots?"

Answer: "Irrelevant. Can we build useful systems with them?"

For AI Deployment

Question: "Do systems truly understand?"

Answer: "Irrelevant. Do they solve user problems effectively?"

For AI Investment

Question: "Is this real intelligence?"

Answer: "Irrelevant. Does it create economic value?"

Why This Matters

1. Resource Allocation

Philosophical debates about "true understanding" → Research funding
Engineering system capabilities → Commercial deployment
Market forces favor the latter (My view’s on what China is doing and where it will lead were reflected in The Inference Revolution)

2. Timeline Predictions

"Stochastic parrots" view → Long timelines (need breakthrough)
"Engineering systems" view → Short timelines (scale existing patterns)

3. Success Metrics

"Understanding" focus → Unmeasurable, philosophical goals
"Capability" focus → Measurable, practical benchmarks

The Meta-Point: Framework Determines Conclusions

Process-Focused Framework

Asks: "How does it work internally?"
Conclusion: "Not like humans, therefore limited"
Implication: "Need fundamental breakthroughs"

Output-Focused Framework

Asks: "What can it accomplish?"
Conclusion: "Achieves human-level results through engineering"
Implication: "Scale existing solutions"

The Lambert-Style Argument Extended

Just as Lambert argues: Don't try to make AI learn like humans, make AI systems that perform like they learn.

Similarly: Don't try to make AI understand like humans, make AI systems that perform like they understand.

The broader principle: Don't replicate human mechanisms, replicate human capabilities through superior engineering.

The "Stochastic Parrots" Critique is Obsolete

The critique was valid for isolated LLMs evaluated in 2020-2022. But it fails to account for:

System Integration: LLMs as components in larger architectures
Tool Augmentation: Capabilities beyond text generation
Memory Integration: Persistent learning-like behavior
Feedback Loops: Adaptive improvement mechanisms
Multi-Agent Coordination: Emergent intelligence from interaction

The engineering reality: We can build AI systems that exhibit memory, learning, reasoning, and adaptation through architectural patterns—making the "stochastic parrots" critique as irrelevant as asking whether airplane engines "truly fly" like birds.

The practical implication: Focus on capability delivery, not mechanism replication. The market will reward systems that work, regardless of whether they work "the right way."

Rob Spence

Very interesting post, especially the part on Alpha Evolve. The part about evaluators reminded me of LeapXL, which isn’t a great name and the webpage is bad, but it’s basically novel form of AI. It’s not an LLM and was designed for different purposes but they did a POC with a major GSI to be a validator or wrangler for LLMs. Let the LLM be creative and Leap would enforce the rules. The POC didn’t go anywhere but maybe it was too early.

Expand full comment

1 reply by Interesting Engineering ++

1 more comment...

Interesting Engineering++

Discussion about this post