The Cognitive Workspace Transformer (CWT v5.6)
Structured State Management for Compute-Efficient Language Modeling
Author: GitHub Β· Discord Β· Independent Research, 2025
π Read the Full Paper
π Live Paper (GitHub Pages) β interactive Plotly visualizations included
Overview
The Cognitive Workspace Transformer (CWT) replaces the standard transformer's undifferentiated residual stream with a structured hub-and-spoke workspace featuring:
- Content-addressed decay gates β selective forgetting based on who wrote what
- Dual-system processing (S1/S2) β inspired by dual-process theory
- PonderNet-style adaptive compute β variable depth per token at inference time
- Hub-derived epistemic signals β honest uncertainty without auxiliary classifiers
CWT is trained at 57.8M parameters and evaluated against two controlled baselines on identical data (FineWeb-Edu, 5.2B tokens).
Key Results
| Model | Total Params | Core Compute (Attn+FFN) | Layers | Val PPL |
|---|---|---|---|---|
| CWT v5.6 (pondered) | 57.8M | 22.9M | 8 | 29.54 |
| Parameter-matched baseline | 57.9M | ~32M | 8 | 30.67 |
| Compute-matched baseline | 67.5M | 41.7M | 13 | 29.04 |
- Beats the parameter-matched baseline by 3.7% despite fewer attention+FFN parameters
- Comes within 1.7% of a 13-layer baseline that has 51% more core compute capacity (41.7M vs 22.9M)
- Provides a smooth inference-time compute/quality tradeoff (PPL 34.82 at 1.0Γ to PPL 28.49 at 2.25Γ compute)
- 18.7% pondering benefit at convergence β stable from step 4,500 onward
Architecture
ββββββββββββββββββββββββ Workspace State (d_s = 896) βββββββββββββββββββββββ
β β
β ββββββββββββ ββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββ β
β β Spokes β β Billboards β β Hub Shared β β Tags β β
β β 48d Γ 8 β β 16d Γ 8 β β 256d β β 16d Γ 8 β β
β β Private β β Permanent β β Decay-gated β β Identity β β
β β per-layer β β broadcast β β shared memory β β markers β β
β ββββββββββββ ββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
βββββββ΄ββββββ ββββββββ΄βββββββ
β System 1 β ββββ 6 layers βββββ β System 2 β ββββ 2 layers Γ N steps
β (S1) β Fast processing β (S2) β Adaptive pondering
β bias: 3.0 β β bias: 5.0 β PonderNet halting
βββββββββββββ βββββββββββββββ
Workspace Regions
- Spokes (48d Γ 8 layers) β private scratch memory per layer
- Billboards (16d Γ 8 layers) β permanent broadcast channels, never decayed
- Hub Shared (256d) β central communication channel with decay gates (+8,114% degradation when zeroed)
- Tags (16d Γ 8 layers) β identity markers for content-addressed forgetting
Key Innovations
- CWT-MLA: Hub-derived Multi-head Latent Attention with 8Γ KV cache compression
- Hub Self-Distillation: 250Γ cheaper deep supervision versus vocabulary probes (v5.4 β v5.6)
- FlooredMultiply: Straight-through gradient estimation with configurable gradient floors
- Epistemic Self-Monitoring: Calibrated uncertainty from hub delta dynamics β no auxiliary classifiers
Ablation Highlights (30+ interventions)
| Tier | Ablation | Degradation |
|---|---|---|
| π΄ Existential | Zero Hub Shared | +8,114% |
| π΄ Existential | Zero Tags | +4,392% |
| π Critical | Hub Writes Γ2 | +547% |
| π‘ Important | Dead Gates (all β 1.0) | +54% |
| π‘ Important | Kill Pondering | +22% |
| π’ Minimal | Epistemic Gate | +0.04% |
Hub write sensitivity increased from +76% (step 8K) to +547% (step 20K), revealing precise calibration that develops over training.
Visualizations
The paper includes 18 interactive Plotly visualizations embedded as iframes:
In-Distribution ("The process of photosynthesis involvesβ¦"):
- 3D UMAP hub trajectory Β· Topology animation Β· Workspace regions Β· Hub deltas
- Hub similarity Β· Decay gates Β· Gate selectivity Β· Ponder oscillation
- Write magnitudes Β· Layer ranking
Out-of-Distribution ("hey bud no cap fo real fo real"):
- 3D UMAP hub trajectory Β· Workspace regions Β· Hub deltas
- Decay gates Β· Ponder oscillation Β· Write magnitudes
Comparative: Overlaid in-distribution vs OOD hub trajectories
Training Details
| Parameter | Value |
|---|---|
| Dataset | FineWeb-Edu (sample-10BT), ~5.2B tokens |
| Hardware | 4Γ NVIDIA RTX 3090, DDP |
| Optimizer | AdamW (Ξ²β=0.9, Ξ²β=0.95) |
| LR Schedule | Cosine decay, 3Γ10β»β΄ β 10β»β΅ |
| Batch Size | 96 sequences (3 Γ 8 accum Γ 4 GPUs) |
| Phase 1 | Steps 0β2K, S1 only, ~29K tok/s |
| Phase 2 | Steps 2Kβ20K, full pondering, ~16K tok/s |
| Tokenizer | GPT-2 (50,257 vocab) |
| Sequence Length | 4,096 |
Scaling Outlook
- At 130M params β overhead drops to ~5β7%
- At 300M+ β overhead becomes noise
- CWT's efficiency advantage is predicted to increase with scale
Citation
If you reference this work:
GVP (2025). The Cognitive Workspace Transformer: Structured State Management
for Compute-Efficient Language Modeling. Independent Research.
https://steel-skull.github.io/CWT-V5.6/
License
This paper and its visualizations are published for research purposes. Please cite appropriately if referencing this work.
- Downloads last month
- 14