Research Architecture · 2025

HRM-RDT Hierarchical Recurrent-Depth Transformer

A fusion of OpenMythos recurrent depth and HRM hierarchical convergence —
slow abstraction and fast computation, looped inside a stable scaffold.

InputPrelude[ Recurrent Block: H ↔ L (×K loops) ]CodaOutput

How HRM Works

HRM has four learnable components: an input network that encodes tokens into vectors, a low-level L module for fast detailed computation, a high-level H module for abstract reasoning, and an output network that converts the final H hidden state to output predictions.

The critical insight is the timing: the L module advances only after completing multiple computational steps and reaching a local equilibrium — at which point it is reset to begin a new phase guided by the H module's updated state. This is called hierarchical convergence: L runs fast and hard until it settles, then H takes one slow step, then L resets and runs again.

The L module functions like a standard RNN, but its hidden state updates are conditioned not just on its own previous state, but also on the H module's current hidden state — which changes much more slowly.

Module — High-Level H Module

Slow, abstract reasoning. Takes one step per full L-convergence cycle. Its zH guides the next L inner cycle.

Module — Low-Level L Module

Fast, detailed computation. Loops until local equilibrium, then resets. Conditioned on H's current hidden state at each step.

✦ OPENMYTHOS ✦

How OpenMythos Works

The full data flow: Input token IDs → Embedding → Prelude (standard transformer blocks, run once) → Recurrent Block (one TransformerBlock looped T times, with update rule: h_t+1 = A·h_t + B·e + Transformer(h_t, e)) → Coda (standard transformer blocks, run once) → RMSNorm → LM head → Output logits.

The key is that e — the Prelude's output — is re-injected at every single loop step. Without this re-injection, the hidden state would drift away from the original input signal across deep loops. Learned matrices A and B govern how much of the previous hidden state and the encoded input carry forward at each step.

To prevent residual explosion, OpenMythos enforces the spectral radius of A to be less than 1 by construction, guaranteeing stability regardless of learning rate or gradient noise.

✦ THE FUSION ✦

How the Hybrid Works

The fusion is conceptually clean: OpenMythos's single TransformerBlock inside the Recurrent Block gets replaced by HRM's H+L pair. Instead of one block looping T times, you get L looping T_L times per step, then H taking one slow step, then the whole thing looping T_outer times — with e re-injected at every inner L step for LTI stability.

The H module's zH is what guides L's next inner cycle, exactly as in the original HRM. The OpenMythos stability guarantee (spectral radius < 1) applies to the outer recurrence, keeping the entire nested structure bounded across arbitrary depth.

✦ DESIGN CHOICES ✦

Key Design Choices

ChoiceRationale
RMSNorm between loop iterationsTraining stability per HRM paper
AdamW optimizerKeeps weights bounded across deep recurrence
Cross-attention for H↔L couplingH guides L without hard dependency
Fixed max_loop_iters + optional early exitAvoids infinite loops; enables adaptive depth
Small model (<100M params)Feasible to train on 1,000 samples
e re-injected at every inner L stepLTI stability; prevents input signal drift
Spectral radius of A < 1Guarantees outer recurrence stability by construction
✦ DATASET ✦

Dataset & Status

Dataset

Size1,000 samples
DomainHuman-based reasoning
FormatJSON — { input, output, reasoning_trace }
Split800 train / 100 val / 100 test
ComputeLightning.ai · PyTorch + HuggingFace

Project Milestones

✦ REFERENCES ✦

References