The Evolution of AI: From Text Prediction to Autonomous Reasoning

Prediction Machines -> Thinking Machines -> Autonomous Thinking Machines

Aug 04, 2025

I have been jotting some thoughts, which appear and feel more like High-Level Generalizations. So a little patch of notes, of sorts, follows below. It keeps my views grounded in expectation setting. Given the fast-evolving (Proto) Agentic-Multimodel-Language space with an overwhelming amount of model releases (typically accompanied by amazing technical papers), consider the below my “Bird’s Eye View” that helps and “Guides to Understanding Modern AI Development and the Path to Truly Intelligent Systems”. I especially find Nathan Lambert’s Framework helpful and have generously applied it here.

There are many expert guides out there, wonderful books and especially wonderful technical substacks (which i follow in a very long list of followings) by subject matter experts, in too many fields to mention. Thank you for writing and sharing in your space. I appreciate it. I take inspiration and interpret information helpful to me. We are fortunate to be living in such an exciting space, as imperfect as it is. But it inspires me to no end.

Note: When AI Models are provided as examples, the list is limited and non-exhaustive.

A likely longer piece will follow at some point.

A High-Level View of The Three-Stage Evolution of AI

Note that timelines blend into each other, and there is no hard cut-off. As mentioned, I find these views helpful. Views my own.

Stage 1: Prediction Machines (2018-2022)

What they do: Next-word prediction, pattern matching

Examples: Early GPT models, BERT

Capabilities: Language understanding, basic text generation

Limitations: No reasoning, poor instruction following, unpredictable outputs

Stage 2: Thinking Machines (2022-2024)

What they do: Multi-step reasoning, chain-of-thought processing

Examples: OpenAI’s GPT-4, o3, o1, Claude Sonnet 3.7, Claude Opus, Gemini 2.0 Flash/Pro, Gemini 2.5 Flash/Pro, DeepSeek-R1, Qwen3, Kimi etc

Capabilities: Mathematical reasoning, code generation, complex problem solving (many in narrow domains, but the space is evolving)

Breakthrough: RLHF (Reinforcement Learning from Human Feedback) aligned models with human preferences, Inference Time Scaling, Post Training Optimizations (SFT, RL, RLVR, RLHF etc)

Stage 2a: Thinking Machines (Reasoning + Tool Use) (2024 - Present)

Stage 3: Autonomous Thinking Machines (2024-Present)

What they do: Plan, execute, reason about reasoning

Examples: Google's Project Mariner, AlphaEvolve, emerging agentic systems. SakanaAI’s DGM

Goal: Self-directed problem solving with strategic planning

The Four Pillars of AI Intelligence

Read Nathan Lambert’s A Taxonomy of Next Generation Reasoning Models for inspiration, which I have applied below:

1. Skills 🔧

Definition: Basic task execution and benchmark performance

Current State: ✅

Mature - Most frontier models excel here

Examples:

Mathematical calculations
Code generation
Language translation
Content creation

There are subjective differences and output limitations in artistic creative domains. In mathematics and science, models are getting really good.

2. Calibration 🎯

Definition: Self-awareness of knowledge and limitations

Current State: ⚠️

Weakest Link - Most models are overconfident

Why Critical: Essential for safe autonomous operation

Progress: OpenAI o1 shows some self-correction capabilities

3. Strategy 📈

Definition: Long-term planning and goal maintenance

Current State: ⚠️

Emerging - Prototype systems show promise

Examples:

Google Project Mariner (web navigation)
DeepMind SIMA (game environments)
Multi-step reasoning in o1 models

4. Abstraction 🧩

Definition: Meta-reasoning and problem decomposition

Current State: ❌

Early Research - Domain-specific successes only

Gold Standard: AlphaEvolve, AlphaGeometry 2

Challenge: Generalizing across domains

Training Methodologies: The Hybrid Approach

RLHF: Human Preference Learning

Best For: Subjective tasks requiring human judgment Process:

Supervised fine-tuning with human examples
Train reward model on human preferences
Optimize model behavior using reinforcement learning
Strengths: Alignment with human values
Weaknesses: Subjective, expensive, doesn't scale

RLVR: Objective Verification

Best For: Tasks with verifiable outcomes

Process: Use automated verifiers to check correctness

Examples: Mathematics, code compilation, scientific calculations

Strengths: Objective, scalable, consistent

Weaknesses: Limited to verifiable domains

The Future: Strategic Combination

Winning Formula: RLHF for alignment + RLVR for skills + Agentic training for strategy

Case Study: Alpha Models as Intelligence Benchmarks

Key Insight: The newest Alpha models achieve all four pillars, but only in narrow domains. But AlphaEvolve did solve algorithmic/coding problems in diverse areas as DeepMind disclosed here. So while it is “coding-based”, its application goes cross-specialization.

The Global AI Landscape

Eastern Approach: Scale and Openness

Leaders: DeepSeek, Qwen, GLM (China)

Philosophy: Rapid iteration, open source, massive scale

Strengths: Fast capability development, global collaboration

Models: DeepSeek-Coder, Qwen3 Thinking, GLM-4.5

Western Approach: Safety and Control

Leaders: OpenAI, Google, Anthropic

Philosophy: Careful deployment, safety-first, proprietary development

Strengths: Advanced safety research, ethical frameworks

Models: GPT-4, o3, o1, Claude, Gemini 2.5

The Synthesis: "American DeepSeek" Vision

Goal: Combine Eastern scale/openness with Western safety focus

Requirements:

<100B+ parameter open model
Hybrid RLHF+RLVR training
Agentic capabilities
Full transparency

The "DeepSeek Moment": A Catalyst for Open AI Innovation 🤖

The AI community is still buzzing from the "DeepSeek moment," a term that describes the significant and unexpected success of the DeepSeek model. This achievement set a new standard for open models and has since spurred further advancements in the field. DeepSeek R1, in particular, is widely regarded as a "canonical recipe" for models focused on reasoning.

Despite this landmark achievement, other open Chinese and American models have yet to replicate the "DeepSeek moment". According to researcher Nathan Lambert, several challenges stand in the way of achieving this level of success in the open community:

Complex Industry "Recipes": Frontier labs like OpenAI and Anthropic utilize highly intricate post-training processes with numerous iterations and feedback loops, making it difficult for the open-source community to distill and replicate these methods.
Infrastructure & Data Bottlenecks: The open research community often lacks the necessary computational power (flops) and access to specific, legally available datasets that large labs possess.
Immature Preference Data: The open community's reliance on single-source datasets like "ultra feedback" for preference tuning is a limitation. There is a pressing need for more mature and diverse preference data to effectively scale models.
Human Preference Data: Top labs incorporate human preference data, which is a major factor in model development, but this data is not openly accessible, and its exact impact is hard to measure.
Organizational Hurdles: For larger projects, a lack of access to user data makes it difficult for open models to identify and fix "weird behaviors," a task that large labs can address with their vast data resources.
High Computational Costs: Achieving state-of-the-art performance is often compute-intensive, requiring models to process millions of tokens per query. This "compute burn" makes it challenging for academic or smaller open initiatives to compete on performance benchmarks.

The "American DeepSeek" Vision: A Roadmap for Open AI 🇺🇸

Nathan Lambert's goal is to see an "American DeepSeek" emerge—a fully open and easily modifiable model that can rival its proprietary counterparts. To achieve this, he outlines a clear set of objectives:

Increased Resources: A substantial increase in computational resources, including a significant number of GPUs, is critical.
Scaling and Sparsification: Lambert suggests scaling up existing dense models, like mode 32B, and transitioning them to sparse architectures.
Large-Scale Reasoning: The development of models capable of large-scale reasoning is a key priority.
Overcoming Organizational Inertia: Addressing the organizational challenges and bureaucracy that can slow down progress is essential. He notes that the DeepSeek story was built on having "great people" who tackled incremental technical problems.
Impactful Artifacts: Academics should focus on creating tangible artifacts, such as models, datasets, and evaluation frameworks, that users can actively leverage, rather than just publishing papers.
Customizable Base Models: There's a need for open base models that can be easily fine-tuned and customized for specific applications, such as roleplay or character personalization.
Exploring Novel Architectures: Lambert also encourages the exploration of cutting-edge AI concepts and new architectures that could potentially move beyond the current transformer paradigm.

In essence, Lambert’s vision is to cultivate an environment where open models can achieve the same level of integrated performance, reproducibility, and accessibility as DeepSeek through a combination of increased resources, focused technical development, and collaborative effort. The article you provided serves as a roadmap for this journey, combining the best insights from various AI approaches to create a unified vision for the future of artificial intelligence.

Current State: Who's Closest to Autonomous Intelligence?

Skills Leader: Chinese open models (DeepSeek, Qwen)

Strategy Pioneer: OpenAI o1, Google Gemini 2.0 Thinking

Abstraction Frontier: DeepMind's Alpha series (domain-specific)

Integration Champion: Google's agentic systems (Mariner, SIMA)

Critical Research Challenges

Technical Priorities

Reward Function Robustness: Preventing gaming and exploitation
Long-Horizon Planning: Maintaining goals across extended timeframes
Computational Architecture: Scaling inference for discovery, not just reliability
Transformer Limitations: Finding architectural alternatives for efficiency

Safety Imperatives

Superhuman Oversight: How to control systems smarter than humans
Value Preservation: Ensuring AI systems maintain human values as they scale
Calibration Research: Teaching AI when to stop, ask for help, or express uncertainty

The Path Forward: Key Takeaways

For Researchers

Focus on calibration - it's the critical missing piece
Hybrid training approaches will outperform single methodologies
Open development accelerates progress while maintaining safety focus

For Organizations

Current models excel at skills but struggle with strategy and abstraction. There have been “wins” in narrow domains. So I am watching the space eagerly.
Agentic capabilities are emerging but still experimental or at least confined to limited domains. I have written recently about Google DeepMind’s AlphaEvolve and SakanaAI’s Darwin Godel Machine (DGM). Also appreciating Microsoft’s MAI-DxO.
The winning approach combines multiple training paradigms. And new ones will evolve.

For the Future

True autonomous intelligence requires all four pillars working together
Domain-specific successes (Alpha models) must generalize across tasks
The synthesis of Eastern scale with Western safety principles offers the most promising path

Bottom Line: We're at the inflection point between Thinking Machines and Autonomous Thinking Machines. The next 2-3 years will determine whether we achieve genuine generalizable AI autonomy or remain stuck with sophisticated but limited reasoning systems. From my perspective - every, single day holds so much promise and is worth watching!

The race isn't just about capability - it's about building AI that can reason about its own reasoning while remaining aligned with human values and under meaningful human control.

References

Simon Willison, "Chinese Open-Weight LLMs" (Jul 2025)
LUFFY (2024)
Tulu 3
RLHF: Christiano et al., 2017 (revised 2023)
Adversarial Reward Design
Penalizing Side Effects
Diffusion LMs
Nathan Lambert Reasoning Taxonomy
DeepSeek-Coder: When the Large Language Model Meets Programming
Qwen3: Think Deeper, Act Faster
Kimi-K2-Instruct on DeepInfra
GLM-4.5: Reasoning, Coding, and Agentic Abililties
Llama 3.1 Guide

Interesting Engineering++

Discussion about this post