Beyond Formalism: A Rebuttal to Limits on LLM Reasoning
Why The Framework Misses the Broader Potential of LLMs and Agentic Systems

The paper titled “Why Cannot Large Language Models Ever Make True Correct Reasoning?” by Jingde Cheng presents a rigorous philosophical and logical critique of the reasoning capabilities attributed to large language models (LLMs). Its conclusions are starkly different from the synthesis-oriented and latent-space reasoning frameworks discussed in both “When AI Proves Theorems And Tunes Detectors” and “Prompting, Templates And The Epistemology of AI Insight” work. I am reminded also of DeepMind’s AlphaEvolve and Sakana’s DGM
Here's a breakdown of his work, which I thoroughly enjoyed reading (hats off), but also, challenge. These discussions i take as healthy evolutions the space has to make, to move forward:
🧠 Salient Conclusions of the Paper
LLMs cannot perform “true correct reasoning”—not now, not in principle.
Reasoning requires conclusive relevant evidence between premises and conclusion, which LLMs cannot guarantee due to their statistical nature.
LLMs simulate reasoning forms but lack embedded correctness evaluation mechanisms, making their outputs epistemically unreliable.
True reasoning must be grounded in a logic system that satisfies three essential criteria: relevance, ampliativity, and paraconsistency. Only Strong Relevant Logics (SRLs) meet these.
LLMs are fundamentally incompatible with SRLs, and therefore cannot be trusted for reasoning in domains requiring logical rigor.
🔍 How the Paper Arrives at These Conclusions
1. Strict Definition of Reasoning
Reasoning is defined as a process where premises provide conclusive relevant evidence for a conclusion.
This definition diverges from traditional logic texts by emphasizing relevance as a core criterion.
2. Evaluation of Logic Systems
Classical Mathematical Logic (CML) and its extensions are rejected for failing to meet three criteria:
Relevance: Premises must be meaningfully connected to conclusions.
Ampliativity: Reasoning must produce new knowledge, not tautologies.
Paraconsistency: Reasoning must tolerate incomplete or contradictory premises.
3. Proposal of Strong Relevant Logics (SRLs)
SRLs (e.g., Rc, Ec, Tc) are presented as the only logic systems that satisfy all three criteria.
SRLs reject implicational paradoxes and enforce variable-sharing between premises and conclusions.
4. Critique of LLM Architecture
LLMs are described as statistical token predictors trained on human text, lacking any embedded logic system.
They cannot evaluate the correctness of their outputs, only their plausibility.
The paper argues that LLMs simulate reasoning behavior but cannot guarantee logical validity.
❌ Why It Conflicts with My Conclusions
🧭 Philosophical Divergence
Your View: LLMs traverse latent concept space, and prompting can guide them toward synthesis or even emergent novelty. Reasoning is probabilistic, contextual, and exploratory.
Cheng’s View: Reasoning must be formally correct, with embedded logic systems ensuring validity. Without this, LLMs are epistemically hollow—no matter how impressive their outputs appear.
🧭 Layered Map: Why Cheng’s Framework Misses the Broader Potential of LLMs and Agentic Systems
1. Epistemic Boundary: Formal Logic vs. Emergent Reasoning
🔍 Why Cheng misses the mark: He treats reasoning as a static, rule-bound process. But LLMs operate in a dynamic latent space where reasoning is emergent, not encoded. This allows for novel connections, abductive leaps, and creative synthesis that formal logic cannot anticipate.
2. Mechanistic Assumptions: Token Prediction vs. Embedded Reasoning Modules
🧠 Why this matters: Cheng critiques static LLMs, but ignores the evolution toward hybrid architectures—where symbolic logic, retrieval, and planning are fused with generative fluency. These systems do reason, albeit differently than SRLs.
3. Relevance and Discovery: Cheng’s Relevance vs. Latent Relevance
🌌 Why Cheng’s relevance is too rigid: In latent space, relevance is gradient-based, not binary. LLMs can surface weak signals, analogical bridges, and cross-domain insights that formal logic would discard as irrelevant.
4. Ampliativity and Novelty: Logic vs. Generative Discovery
5. Paraconsistency and Contradiction: Rejection vs. Embrace
🌀 Why Cheng’s paraconsistency is incomplete: LLMs don’t collapse under contradiction—they explore it. Agentic systems use contradiction to refine beliefs, simulate counterfactuals, and navigate uncertainty.
🧠 Strategic Implications
Cheng’s logic is necessary but not sufficient: It’s ideal for domains requiring formal rigor (e.g., mathematics, law), and i gave a framework with DSPy integration, but insufficient for exploratory reasoning, creativity, or interdisciplinary synthesis.
Agentic systems are epistemically pluralistic: They combine symbolic logic, latent inference, and external tool use to reason across modalities.
Discovery requires looseness: Cheng’s rigidity excludes the very conditions under which new ideas emerge—ambiguity, analogy, and recombination.
🧩 Visual Stack (Conceptual)
Toward a Pluralistic Epistemology of Reasoning
Cheng’s critique is valuable—it reminds us that correctness matters, and that reasoning must be accountable. But his conclusions are too narrow to capture the strategic, architectural, and epistemic potential of modern AI. LLMs and agentic systems are not replacements for formal logic—they are extensions of reasoning into new domains. They offer pluralistic epistemologies, where synthesis, analogy, and emergence complement deduction and formal rigor.
To dismiss LLMs as incapable of reasoning is to ignore the layered terrain of intelligence itself. Reasoning is not a monolith—it is a spectrum. And LLMs, far from being epistemically hollow, are becoming powerful navigators of that spectrum.
P.s.
For obvious reasons, I also disagree with Gary Marcus’s points on views from time to time, although again, the various challenges do make one think deeper. And I am the richer for it:
Postnote (29th Aug 2025)
I would like to add this as a significant must-read(!), a great in-depth companion for those who would enjoy deeper dives. Devansh’s How Google and Stanford made AI more Interpretable with a 20 year old Technique [Breakdowns]
How I Believe Devansh’s Article Complements Beyond Formalism
🧠 Why This Article Especially Matters
This isn't just another technical breakdown—it’s a strategic lens into the next frontier of AI. It bridges:
Mathematics (geometry of embeddings)
Systems engineering (algorithmic scaling)
Hardware economics (memory bottlenecks)
Geopolitical and market dynamics (trust, regulation, and value migration)
It’s especially relevant if you're working on AI safety, model introspection, or strategic planning in high-stakes domains like law, medicine, or finance.
🔍 Core Takeaways
1. The Problem of Interpretability
AI models encode concepts as dense, overlapping vectors (superposition).
To trust AI, we must reverse this process—extracting pure, monosemantic features.
2. The Dominance of Sparse Autoencoders (SAEs)
SAEs became the default tool for disentangling embeddings.
They’re scalable but heuristic—don’t guarantee true interpretability.
3. DB-KSVD: A Scaled Revival
A re-engineered version of the classic K-SVD algorithm.
Achieves performance on par with SAEs but via deterministic, interpretable optimization.
Uses massive parallelization and memory trade-offs to scale—10,000× faster than legacy K-SVD.
4. The Convergence Signal
DB-KSVD and SAEs hit the same performance ceiling.
This suggests the bottleneck isn’t algorithmic—it’s geometric.
The limit is governed by the Welch Bound: you can’t pack infinite distinct features into finite space.
5. Implications for Future Research
Linear methods have hit their ceiling.
Next steps require:
Non-linear dictionary learning
Manifold and kernel methods
Memory-centric hardware (PIM, HBM)
6. Strategic & Economic Signals
Interpretability is becoming a market differentiator.
Regulatory pressure and enterprise risk are driving demand.
Value is shifting from model creators to control-layer providers (think: “Control-as-a-Service”).
📚 What You Learn
How embeddings encode meaning geometrically.
Why disentangling those embeddings is mathematically hard (NP-hard).
The trade-offs between scalability, interpretability, and mathematical rigor.
How hardware constraints (the “memory wall”) shape algorithmic feasibility.
Why interpretability is the next battleground for AI trust, regulation, and market value.
Digging deeper still, the two most intellectually rich layers: the mathematical constraints and the hardware implications. These are the tectonic forces shaping the future of interpretable AI.:
📐 Mathematical Constraints: Geometry Meets Epistemology
1. Superposition and Sparse Recovery
Neural networks encode multiple concepts into overlapping dimensions—this is called superposition.
Sparse recovery aims to disentangle these into distinct, interpretable features.
But this is NP-hard in general—meaning it’s computationally infeasible to solve exactly at scale.
2. Welch Bound: The Geometric Bottleneck
The Welch Bound limits how many nearly-orthogonal vectors you can pack into a space.
In plain terms: there’s a hard ceiling on how many distinct concepts can be encoded without overlap.
This is why both DB-KSVD and SAEs hit the same interpretability ceiling—it's not about the algorithm, it's about the geometry.
3. Linear vs Nonlinear Dictionary Learning
DB-KSVD is linear—it finds a sparse basis using matrix factorization.
But real-world concepts often lie on nonlinear manifolds.
Future interpretability will require:
Kernel methods
Manifold learning
Nonlinear optimization techniques
💾 Hardware Implications: Memory is the New Compute
1. The Memory Wall
DB-KSVD scales via brute-force parallelism—but hits memory bottlenecks.
Modern GPUs are compute-rich but memory-poor.
Interpretability demands high-bandwidth memory access, not just FLOPs.
2. Emerging Hardware Paradigms
3. Algorithm-Hardware Co-Design
Future breakthroughs will come from co-designing algorithms with memory-centric hardware.
Think: interpretable AI chips optimized for sparse recovery, not just training speed.
🧭 Strategic Implication
This isn’t just about making AI safer—it’s about redefining the architecture of intelligence. The convergence of mathematical limits and hardware constraints is forcing a paradigm shift:
From brute-force scaling → to epistemically grounded design
From opaque embeddings → to transparent, modular representations
From model-centric value → to control-layer and infrastructure-centric value
Google and Stanford also play pivotal roles in this article—not just as contributors, but as architects of a paradigm shift in AI interpretability. Here's how they feature, and why their involvement matters:
🏛️ Stanford: The Intellectual Engine
Stanford’s contribution is deeply rooted in theory and algorithmic innovation:
Revival and Scaling of DB-KSVD: Stanford researchers re-engineered the classic K-SVD algorithm into a scalable form—DB-KSVD—capable of handling modern neural embeddings. This wasn’t just a technical upgrade; it was a philosophical pivot toward deterministic, interpretable optimization.
Mathematical Framing: They framed the interpretability problem in terms of sparse coding and the Welch Bound, showing that the bottleneck is geometric, not just computational. This insight reshapes how we think about the limits of model introspection.
Benchmarking Against SAEs: Stanford’s work rigorously compared DB-KSVD to sparse autoencoders, revealing that both hit the same performance ceiling—suggesting a deeper, structural constraint in embedding space.
🏢 Google: The Systems and Infrastructure Powerhouse
Google’s role is more infrastructural and strategic:
Massive Parallelization: Google provided the compute muscle to scale DB-KSVD—using tens of thousands of cores and memory-optimized pipelines. This enabled the algorithm to run 10,000× faster than legacy versions.
Hardware-Aware Optimization: Their engineers tackled the memory wall head-on, designing systems that trade off compute for bandwidth—critical for sparse recovery tasks.
Strategic Framing: Google positioned interpretability not just as a research goal, but as a product differentiator. Their involvement signals that explainable AI is moving from academic curiosity to enterprise necessity.
🤝 Why Their Collaboration Matters
This isn’t just a university-industry partnership—it’s a convergence of epistemic rigor and infrastructural scale:
Strategic Forecast: Control Is the New Frontier
Devansh is right, and Both articles hint at a deeper shift:
The locus of epistemic value is migrating from model-centric to control-centric layers.
This means:
Interpretability will rely on overlays, probes, and dynamic interfaces.
Reasoning will be scaffolded by tool use, retrieval, and symbolic-latent fusion.
Trust will be earned through contextual performance, not formal guarantees.
Beyond the Bottlenecks
Devansh and Beyond Formalism do not merely critique—they chart a path forward. Together, they call for architectures that are:
Geometrically aware
Epistemically pluralistic
Strategically layered
In a world where AI systems increasingly mediate knowledge, policy, and power, this synthesis is not just technical—it’s foundational.
Have a wonderful weekend all….














A link to the post is appreciated and will suffice sir. Also thank you for the kind complement. There is no affiliation or connection to the link provided
Trying to give credit where due.
Query: How do I cite a this post in particular or your blog "Interesting Engineering" in general?
Is there a named editor or author?
Is the blog connected to https://interestingengineering.com/about-us
Great post, thanx.