Hybrid AI: OpenMythos x HRM

PART 01

The Looped Core

OpenMythos / Claude Mythos Theory

Standard transformers stack 100+ unique layers, causing parameter counts to explode. OpenMythos proposes a radical departure: take a minimal set of transformer layers and loop them continuously in a single forward pass.

Same weights, increasing iterations. This allows the model to achieve dynamically deeper latent thinking. All reasoning stays silent inside the hidden state—no "chain-of-thought" token vomit, just continuous state refinement.

1. Prelude

Normal transformer blocks run once to convert input tokens into an initial hidden state h and an immutable injection vector e.

2. Recurrent Block (The Engine)

Looped n_loops times. Contains switchable MLA/GQA attention and a sparse Mixture of Experts (MoE). Crucially, it re-injects the original signal via learned matrices.

3. Coda

Final transformer blocks run once to decode the deeply refined hidden state back into logit distributions.

The Stability Fix

Naively looping a transformer causes exploding/vanishing gradients (the BPTT nightmare). OpenMythos treats the loop as a linear time-invariant (LTI) dynamical system.

Update Rule: h_{t} = Transformer(h_{t-1}) + A·h_{t-1} + B·e Constraint: Spectral radius ρ(A) < 1

By strictly parameterizing A with a negative diagonal, the system is forced to be contractive. It guarantees stability no matter how many times you loop.

The "Latent Thought" Advantage

Because it doesn't output tokens during reasoning, it can explore multiple hypothetical paths in parallel in high-dimensional space (breadth-first search vibe), converging on logic that token-by-token (depth-first) CoT models fail to reach.

Strengths (Pros)

Infinite Inference Depth: Crank up reasoning loops at test-time without adding weights.
No CoT Overhead: Avoids the massive token-generation latency of traditional Chain-of-Thought models.
Parameter Efficient: Reuses the exact same transformer blocks via a continuous dynamical state.

Limitations (Cons)

Latent Drift: Without oversight, infinite loops can drift into nonsensical latent states (hallucinations).
Local Minima Traps: Breadth-first search can get stuck in dead-ends without a "high-level" planner to reset context.
Training Instability: The strict LTI spectral constraint required to stop gradient explosion inherently limits the model's expressiveness.

High-Level Module (H)

The slow, abstract "System 2" planner. It operates on a compressed timescale, updating only occasionally (e.g., every 4 inner steps) to survey the global state and set a fresh context vector for the worker.

Low-Level Module (L)

The fast, detailed "System 1" worker. It runs continuously, grinding out rapid local computations and executing the micro-steps required to fulfill the current abstract plan set by H.

Biological Analogy

This mirrors neocortical processing, where higher-order association areas (H) maintain slow-changing semantic goals, while primary sensory/motor cortices (L) process high-frequency granular data.

PART 02

Structured Control

Hierarchical Reasoning Model (HRM)

Published in mid-2025 by Sapient Intelligence, HRM proved that raw scale isn't the only way to achieve AGI-level reasoning. By splitting computation into dual timescales, a tiny 27M parameter model outperformed massive monolithic networks on ARC-AGI and Sudoku.

Initialization Start with a shared hidden state representing the raw problem context.

The Inner Grind (L) Module H is paused. Module L loops T times, independently modifying the local state toward a micro-solution.

The Executive Review (H) Module H activates, reads the final output of L, and emits a new high-level instruction vector.

Hierarchical Convergence This nested loop repeats until the internal states stop fluctuating, meaning the model has "agreed with itself" on the logic.

Strengths (Pros)

Strategic Oversight: The High-Level module actively reviews and course-corrects the Low-Level worker.
Hyper-Efficient: Achieves breakthroughs on complex benchmarks (ARC-AGI) at just 27M parameters.
Self-Converging: Naturally stops computing when the internal state reaches equilibrium.

Limitations (Cons)

Rigid Architecture: Two distinct modules means passing gradients between them can be computationally awkward and prone to BPTT limits.
Scale Ceiling: Flat scaling of HRM hasn't proven it can match the raw breadth and efficiency of massive MoE looped models like Mythos.
Slower Inner Loop: Without the architectural stability optimizations of OpenMythos, the L-module grind can severely bottleneck inference speed.

THE SYNTHESIS

Project Chimera

The recurrent block in OpenMythos becomes hierarchical.
Unlimited dynamic depth meets structured, self-correcting cognitive control.

Phase 1: Pretraining

Train the 700M parameter base with small fixed loops (e.g., 4 outer × 4 inner) on a high-quality 4B token reasoning dataset.
The dual-module structure prevents the pure OpenMythos loop from collapsing into local minima early in training.
Maintain the OpenMythos ρ(A) < 1 spectral constraint across the entire hierarchical block to guarantee gradient stability.

Phase 2: Inference & Scale

Crank the loops at inference: Dynamically increase n_outer based on problem complexity, achieving a 1.6B+ effective parameter depth instantly.
Instruction tuning is performed by freezing the recurrent core and only updating the Prelude/Coda layers, making it incredibly cheap.
Yields a model possessing both "quick reflexes" (L-module) and "strategic oversight" (H-module).

chimera_forward_pass.py

# 1. Prelude: Encode to initial latent state h_init, e_inject = Model.Prelude(input_tokens) # 2. Initialize Hierarchical Recurrent Block h_high = h_init h_low = h_init for t in range(n_outer_loops): # System 2: Slow abstract context update h_high = Model.Transformer_H(h_high, h_low, e_inject) for s in range(n_inner_loops): # System 1: Fast detailed local refinement h_low = Model.Transformer_L(h_low, h_high, e_inject) # Apply OpenMythos stability constraints (LTI system) h_low, h_high = Model.Apply_Spectral_Radius_Fix(h_low, h_high) # Early exit if internal state converges if calculate_delta(h_low, previous_h) < threshold: break # 3. Coda: Decode latent thought to tokens final_logits = Model.Coda(h_low)

Projected Scaling Efficiency

Because Chimera computes iteratively in the latent space, parameter count is no longer tied to reasoning depth. A 700M Chimera model structurally mimics the logic depth of an 8B monolithic transformer, but uses hierarchical steering to avoid getting lost in the loop.

Standard Llama 3 (8B) ARC-AGI: 24%

HRM Original (27M) ARC-AGI: 31%

OpenMythos Looped (700M) ARC-AGI: 42%

Chimera Hybrid (700M + n_outer=12) ARC-AGI: 68% (Proj.)

*Hypothetical performance based on architecture synthesis.

Infinite Depth. Structured Control.

The Looped Core

OpenMythos / Claude Mythos Theory

The Stability Fix

The "Latent Thought" Advantage

Strengths (Pros)

Limitations (Cons)

High-Level Module (H)

Low-Level Module (L)

Biological Analogy

Structured Control

Hierarchical Reasoning Model (HRM)

Strengths (Pros)

Limitations (Cons)

Project Chimera

Phase 1: Pretraining

Phase 2: Inference & Scale

Projected Scaling Efficiency

Infinite Depth.
Structured Control.