i3-200M - Hybrid Architecture Language Model
Model Description
The i3-200M Model (aka Redherring) is a novel hybrid architecture combining convolutional/recurrent layers with full attention layers for efficient language modeling. This architecture uniquely blends RWKV-style time-mixing with Mamba state-space dynamics in the early layers, followed by standard multi-head attention in deeper layers.
To use the model try it here.
Model Statistics
- Total Parameters: ~169.85M
- Architecture: 10 Hybrid (RWKV-Mamba) + 6 Full Attention Layers = 16 Total Layers
- Vocabulary Size: 32,000 tokens
- Hidden Dimension (d_model): 512
- Attention Heads: 16
- State Dimension (d_state): 32
- Max Sequence Length: 256
- Tokenization: BPE
Architecture Breakdown
Layers 1-10: RWKV-Mamba Hybrid Blocks (Recurrent/Conv)
├─ RWKVMambaHybrid (Time-mixing + State-space)
└─ Feed-Forward Network (4x expansion)
Layers 11-16: Full Attention Blocks
├─ Multi-Head Attention (16 heads)
└─ Feed-Forward Network (4x expansion)
Comparison with i3-80M
| Feature | i3-22M | i3-80M | i3-200m (This Model) |
|---|---|---|---|
| Parameters | 22.6M | 82.77M | 169.85M |
| Architecture | 24 Hybrid Layers | 10 Hybrid + 6 Attention Layers | 10 Hybrid + 6 Attention Layers |
| Hidden Dimension | 512 | 512 | 512 |
| Vocabulary Size | 4,466 | 35,560 | 32,000 |
| Training Dataset | TinyChat only | TinyStories + TinyChat + HQ Sentences | TinyStories + TinyChat + HQ Sentences + Wikitext |
| Total Tokens | ~1M conversations | ~3M+ tokens | N/A |
| Final Loss | ~2.0 | ~2.0 | 1.6 |
| Final Perplexity | 7.29-9.70 | 7.29-10.0 | 5.2 |
| Training Time | ~17 hours | ~2-4 hours | ~1-2 hours |
| Attention Layers | None (Pure Hybrid) | 6 Full Attention Layers | 6 Full Attention Layers |
Key Improvements Over i3-80M
To Be added
Key Features
Hybrid Architecture: Combines the efficiency of recurrent/convolutional processing with the power of attention
- Early layers use RWKV-Mamba hybrid for efficient sequence processing
- Later layers use full multi-head attention for complex pattern recognition
Memory-Optimized Training:
- Streaming vocabulary building (no full text storage)
- Vocabulary caching (build once, reuse)
- Efficient chunk frequency counting
- Automatic memory cleanup
Multi-Dataset Pre-training: Trained on diverse text sources for robust language understanding
- TinyStories: Narrative and storytelling
- TinyChat: Conversational dynamics
- High-Quality English Sentences: Linguistic diversity
- wikitext
BPE Tokenization:
- Total tokens processed: N/A
- Handles unknown tokens gracefully with , , , token
Training Details
Training Configuration
- Datasets:
agentlans/high-quality-english-sentencesroneneldan/TinyStoriesstarhopp3r/TinyChatwikitext
- Training Steps: 250 iterations
- Batch Size: 4 (with gradient accumulation support)
- Learning Rate: 4e-4 (with warmup and cosine decay)
- Optimizer: AdamW with gradient clipping (max norm: 1.0)
- Hardware: NVIDIA P100 (16GB VRAM)
- Training Time: ~1-2 hours
- Framework: PyTorch
Training Dynamics
- GPU Utilization: Stable at ~N/A% during training
- GPU Memory: ~20% allocated (~4GB / 12GB) (Math is not mathing??)
- Power Usage: ~250W average
- Throughput: ~300 tokens/sec
Performance Metrics
| Metric | Initial | Final |
|---|---|---|
| Training Loss | ~10.0 | 1.6 |
| Perplexity | ~4000+ | 5.2 |
Technical Innovations
RWKV-Mamba Hybrid Recurrence: Combines RWKV's time-mixing with Mamba's state-space dynamics
- Linear complexity for long sequences
- Efficient recurrent processing
- State-space modeling for temporal dependencies
Hierarchical Processing:
- Lower layers focus on local patterns (conv/recurrent)
- Upper layers capture global dependencies (attention)
Memory Efficiency:
- Streaming tokenization during vocab building
- No full dataset storage in RAM
- Automatic cleanup of intermediate data
Model Files
pytorch_model.bin: Model weightsconfig.json: Model configurationtokenizer.json: Tokenizer vocabulary
Limitations
- Trained on English text only
- Limited to 256 token context window
- May require fine-tuning for specific downstream tasks
- Conversational style influenced by TinyChat dataset
Model Series
- i3-22M - Original model with pure hybrid architecture
- i3-80M - Scaled version with attention layers and multi-dataset training
- i3-200M (This model)
Citation
@article{mamba,
title={Mamba: Linear-Time Sequence Modeling with Selective State Spaces},
author={Gu, Albert and Dao, Tri},
journal={arXiv preprint arXiv:2312.00752},
year={2023}
}
@article{RWKV,
title={RWKV: Reinventing RNNs for the Transformer Era},
author={Peng, Bo and others},
journal={arXiv preprint arXiv:2305.13048},
year={2023}
}
- Downloads last month
- 52
