The Looped Core
OpenMythos / Claude Mythos Theory
Standard transformers stack 100+ unique layers, causing parameter counts to explode. OpenMythos proposes a radical departure: take a minimal set of transformer layers and loop them continuously in a single forward pass.
Same weights, increasing iterations. This allows the model to achieve dynamically deeper latent thinking. All reasoning stays silent inside the hidden state—no "chain-of-thought" token vomit, just continuous state refinement.
Normal transformer blocks run once to convert input tokens into an initial hidden state h and an immutable injection vector e.
Looped n_loops times. Contains switchable MLA/GQA attention and a sparse Mixture of Experts (MoE). Crucially, it re-injects the original signal via learned matrices.
Final transformer blocks run once to decode the deeply refined hidden state back into logit distributions.
The Stability Fix
Naively looping a transformer causes exploding/vanishing gradients (the BPTT nightmare). OpenMythos treats the loop as a linear time-invariant (LTI) dynamical system.
h_{t} = Transformer(h_{t-1}) + A·h_{t-1} + B·e
Constraint:
Spectral radius ρ(A) < 1
By strictly parameterizing A with a negative diagonal, the system is forced to be contractive. It guarantees stability no matter how many times you loop.
The "Latent Thought" Advantage
Because it doesn't output tokens during reasoning, it can explore multiple hypothetical paths in parallel in high-dimensional space (breadth-first search vibe), converging on logic that token-by-token (depth-first) CoT models fail to reach.
Strengths (Pros)
- Infinite Inference Depth: Crank up reasoning loops at test-time without adding weights.
- No CoT Overhead: Avoids the massive token-generation latency of traditional Chain-of-Thought models.
- Parameter Efficient: Reuses the exact same transformer blocks via a continuous dynamical state.
Limitations (Cons)
- Latent Drift: Without oversight, infinite loops can drift into nonsensical latent states (hallucinations).
- Local Minima Traps: Breadth-first search can get stuck in dead-ends without a "high-level" planner to reset context.
- Training Instability: The strict LTI spectral constraint required to stop gradient explosion inherently limits the model's expressiveness.
High-Level Module (H)
The slow, abstract "System 2" planner. It operates on a compressed timescale, updating only occasionally (e.g., every 4 inner steps) to survey the global state and set a fresh context vector for the worker.
Low-Level Module (L)
The fast, detailed "System 1" worker. It runs continuously, grinding out rapid local computations and executing the micro-steps required to fulfill the current abstract plan set by H.
Biological Analogy
This mirrors neocortical processing, where higher-order association areas (H) maintain slow-changing semantic goals, while primary sensory/motor cortices (L) process high-frequency granular data.
Structured Control
Hierarchical Reasoning Model (HRM)
Published in mid-2025 by Sapient Intelligence, HRM proved that raw scale isn't the only way to achieve AGI-level reasoning. By splitting computation into dual timescales, a tiny 27M parameter model outperformed massive monolithic networks on ARC-AGI and Sudoku.
Strengths (Pros)
- Strategic Oversight: The High-Level module actively reviews and course-corrects the Low-Level worker.
- Hyper-Efficient: Achieves breakthroughs on complex benchmarks (ARC-AGI) at just 27M parameters.
- Self-Converging: Naturally stops computing when the internal state reaches equilibrium.
Limitations (Cons)
- Rigid Architecture: Two distinct modules means passing gradients between them can be computationally awkward and prone to BPTT limits.
- Scale Ceiling: Flat scaling of HRM hasn't proven it can match the raw breadth and efficiency of massive MoE looped models like Mythos.
- Slower Inner Loop: Without the architectural stability optimizations of OpenMythos, the L-module grind can severely bottleneck inference speed.
Project Chimera
The recurrent block in OpenMythos becomes hierarchical.
Unlimited dynamic depth meets structured, self-correcting cognitive control.
Phase 1: Pretraining
- Train the 700M parameter base with small fixed loops (e.g., 4 outer × 4 inner) on a high-quality 4B token reasoning dataset.
- The dual-module structure prevents the pure OpenMythos loop from collapsing into local minima early in training.
- Maintain the OpenMythos ρ(A) < 1 spectral constraint across the entire hierarchical block to guarantee gradient stability.
Phase 2: Inference & Scale
- Crank the loops at inference: Dynamically increase n_outer based on problem complexity, achieving a 1.6B+ effective parameter depth instantly.
- Instruction tuning is performed by freezing the recurrent core and only updating the Prelude/Coda layers, making it incredibly cheap.
- Yields a model possessing both "quick reflexes" (L-module) and "strategic oversight" (H-module).
Projected Scaling Efficiency
Because Chimera computes iteratively in the latent space, parameter count is no longer tied to reasoning depth. A 700M Chimera model structurally mimics the logic depth of an 8B monolithic transformer, but uses hierarchical steering to avoid getting lost in the loop.
*Hypothetical performance based on architecture synthesis.