The Cognitive Workspace Transformer (CWT v5.6)

Structured State Management for Compute-Efficient Language Modeling

Author: GitHub Β· Discord Β· Independent Research, 2025


πŸ“„ Read the Full Paper

πŸ”— Live Paper (GitHub Pages) β€” interactive Plotly visualizations included


Overview

The Cognitive Workspace Transformer (CWT) replaces the standard transformer's undifferentiated residual stream with a structured hub-and-spoke workspace featuring:

  • Content-addressed decay gates β€” selective forgetting based on who wrote what
  • Dual-system processing (S1/S2) β€” inspired by dual-process theory
  • PonderNet-style adaptive compute β€” variable depth per token at inference time
  • Hub-derived epistemic signals β€” honest uncertainty without auxiliary classifiers

CWT is trained at 57.8M parameters and evaluated against two controlled baselines on identical data (FineWeb-Edu, 5.2B tokens).

Key Results

Model Total Params Core Compute (Attn+FFN) Layers Val PPL
CWT v5.6 (pondered) 57.8M 22.9M 8 29.54
Parameter-matched baseline 57.9M ~32M 8 30.67
Compute-matched baseline 67.5M 41.7M 13 29.04
  • Beats the parameter-matched baseline by 3.7% despite fewer attention+FFN parameters
  • Comes within 1.7% of a 13-layer baseline that has 51% more core compute capacity (41.7M vs 22.9M)
  • Provides a smooth inference-time compute/quality tradeoff (PPL 34.82 at 1.0Γ— to PPL 28.49 at 2.25Γ— compute)
  • 18.7% pondering benefit at convergence β€” stable from step 4,500 onward

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ Workspace State (d_s = 896) ──────────────────────┐
β”‚                                                                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Spokes   β”‚  β”‚  Billboards  β”‚  β”‚   Hub Shared     β”‚  β”‚    Tags      β”‚  β”‚
β”‚  β”‚  48d Γ— 8  β”‚  β”‚  16d Γ— 8     β”‚  β”‚   256d           β”‚  β”‚  16d Γ— 8     β”‚  β”‚
β”‚  β”‚  Private   β”‚  β”‚  Permanent   β”‚  β”‚   Decay-gated    β”‚  β”‚  Identity    β”‚  β”‚
β”‚  β”‚  per-layer β”‚  β”‚  broadcast   β”‚  β”‚   shared memory  β”‚  β”‚  markers     β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                                                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                                      β”‚
  β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”                         β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
  β”‚  System 1  β”‚ ──── 6 layers ────→    β”‚  System 2   β”‚ ──── 2 layers Γ— N steps
  β”‚  (S1)      β”‚  Fast processing       β”‚  (S2)       β”‚  Adaptive pondering
  β”‚  bias: 3.0 β”‚                        β”‚  bias: 5.0  β”‚  PonderNet halting
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Workspace Regions

  • Spokes (48d Γ— 8 layers) β€” private scratch memory per layer
  • Billboards (16d Γ— 8 layers) β€” permanent broadcast channels, never decayed
  • Hub Shared (256d) β€” central communication channel with decay gates (+8,114% degradation when zeroed)
  • Tags (16d Γ— 8 layers) β€” identity markers for content-addressed forgetting

Key Innovations

  • CWT-MLA: Hub-derived Multi-head Latent Attention with 8Γ— KV cache compression
  • Hub Self-Distillation: 250Γ— cheaper deep supervision versus vocabulary probes (v5.4 β†’ v5.6)
  • FlooredMultiply: Straight-through gradient estimation with configurable gradient floors
  • Epistemic Self-Monitoring: Calibrated uncertainty from hub delta dynamics β€” no auxiliary classifiers

Ablation Highlights (30+ interventions)

Tier Ablation Degradation
πŸ”΄ Existential Zero Hub Shared +8,114%
πŸ”΄ Existential Zero Tags +4,392%
🟠 Critical Hub Writes Γ—2 +547%
🟑 Important Dead Gates (all β†’ 1.0) +54%
🟑 Important Kill Pondering +22%
🟒 Minimal Epistemic Gate +0.04%

Hub write sensitivity increased from +76% (step 8K) to +547% (step 20K), revealing precise calibration that develops over training.

Visualizations

The paper includes 18 interactive Plotly visualizations embedded as iframes:

In-Distribution ("The process of photosynthesis involves…"):

  • 3D UMAP hub trajectory Β· Topology animation Β· Workspace regions Β· Hub deltas
  • Hub similarity Β· Decay gates Β· Gate selectivity Β· Ponder oscillation
  • Write magnitudes Β· Layer ranking

Out-of-Distribution ("hey bud no cap fo real fo real"):

  • 3D UMAP hub trajectory Β· Workspace regions Β· Hub deltas
  • Decay gates Β· Ponder oscillation Β· Write magnitudes

Comparative: Overlaid in-distribution vs OOD hub trajectories

Training Details

Parameter Value
Dataset FineWeb-Edu (sample-10BT), ~5.2B tokens
Hardware 4Γ— NVIDIA RTX 3090, DDP
Optimizer AdamW (β₁=0.9, Ξ²β‚‚=0.95)
LR Schedule Cosine decay, 3Γ—10⁻⁴ β†’ 10⁻⁡
Batch Size 96 sequences (3 Γ— 8 accum Γ— 4 GPUs)
Phase 1 Steps 0–2K, S1 only, ~29K tok/s
Phase 2 Steps 2K–20K, full pondering, ~16K tok/s
Tokenizer GPT-2 (50,257 vocab)
Sequence Length 4,096

Scaling Outlook

  • At 130M params β†’ overhead drops to ~5–7%
  • At 300M+ β†’ overhead becomes noise
  • CWT's efficiency advantage is predicted to increase with scale

Citation

If you reference this work:

GVP (2025). The Cognitive Workspace Transformer: Structured State Management
for Compute-Efficient Language Modeling. Independent Research.
https://steel-skull.github.io/CWT-V5.6/

License

This paper and its visualizations are published for research purposes. Please cite appropriately if referencing this work.

Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train Steelskull/CWT-V5.6