i3-200M - Hybrid Architecture Language Model

Model Description

The i3-200M Model (aka Redherring) is a novel hybrid architecture combining convolutional/recurrent layers with full attention layers for efficient language modeling. This architecture uniquely blends RWKV-style time-mixing with Mamba state-space dynamics in the early layers, followed by standard multi-head attention in deeper layers.

To use the model try it here.

Model Statistics

Total Parameters: ~169.85M
Architecture: 10 Hybrid (RWKV-Mamba) + 6 Full Attention Layers = 16 Total Layers
Vocabulary Size: 32,000 tokens
Hidden Dimension (d_model): 512
Attention Heads: 16
State Dimension (d_state): 32
Max Sequence Length: 256
Tokenization: BPE

Architecture Breakdown

Layers 1-10:  RWKV-Mamba Hybrid Blocks (Recurrent/Conv)
              ├─ RWKVMambaHybrid (Time-mixing + State-space)
              └─ Feed-Forward Network (4x expansion)

Layers 11-16: Full Attention Blocks
              ├─ Multi-Head Attention (16 heads)
              └─ Feed-Forward Network (4x expansion)

Comparison with i3-80M

Feature	i3-22M	i3-80M	i3-200m (This Model)
Parameters	22.6M	82.77M	169.85M
Architecture	24 Hybrid Layers	10 Hybrid + 6 Attention Layers	10 Hybrid + 6 Attention Layers
Hidden Dimension	512	512	512
Vocabulary Size	4,466	35,560	32,000
Training Dataset	TinyChat only	TinyStories + TinyChat + HQ Sentences	TinyStories + TinyChat + HQ Sentences + Wikitext
Total Tokens	~1M conversations	~3M+ tokens	N/A
Final Loss	~2.0	~2.0	1.6
Final Perplexity	7.29-9.70	7.29-10.0	5.2
Training Time	~17 hours	~2-4 hours	~1-2 hours
Attention Layers	None (Pure Hybrid)	6 Full Attention Layers	6 Full Attention Layers

Key Improvements Over i3-80M

To Be added

Key Features

Hybrid Architecture: Combines the efficiency of recurrent/convolutional processing with the power of attention
- Early layers use RWKV-Mamba hybrid for efficient sequence processing
- Later layers use full multi-head attention for complex pattern recognition
Memory-Optimized Training:
- Streaming vocabulary building (no full text storage)
- Vocabulary caching (build once, reuse)
- Efficient chunk frequency counting
- Automatic memory cleanup
Multi-Dataset Pre-training: Trained on diverse text sources for robust language understanding
- TinyStories: Narrative and storytelling
- TinyChat: Conversational dynamics
- High-Quality English Sentences: Linguistic diversity
- wikitext
BPE Tokenization:
- Total tokens processed: N/A
- Handles unknown tokens gracefully with , , , token

Training Details

Training Configuration

Datasets:
- agentlans/high-quality-english-sentences
- roneneldan/TinyStories
- starhopp3r/TinyChat
- wikitext
Training Steps: 250 iterations
Batch Size: 4 (with gradient accumulation support)
Learning Rate: 4e-4 (with warmup and cosine decay)
Optimizer: AdamW with gradient clipping (max norm: 1.0)
Hardware: NVIDIA P100 (16GB VRAM)
Training Time: ~1-2 hours
Framework: PyTorch

Training Dynamics

GPU Utilization: Stable at ~N/A% during training
GPU Memory: ~20% allocated (~4GB / 12GB) (Math is not mathing??)
Power Usage: ~250W average
Throughput: ~300 tokens/sec

Performance Metrics

Metric	Initial	Final
Training Loss	~10.0	1.6
Perplexity	~4000+	5.2

Technical Innovations

RWKV-Mamba Hybrid Recurrence: Combines RWKV's time-mixing with Mamba's state-space dynamics
- Linear complexity for long sequences
- Efficient recurrent processing
- State-space modeling for temporal dependencies
Hierarchical Processing:
- Lower layers focus on local patterns (conv/recurrent)
- Upper layers capture global dependencies (attention)
Memory Efficiency:
- Streaming tokenization during vocab building
- No full dataset storage in RAM
- Automatic cleanup of intermediate data

Model Files

pytorch_model.bin: Model weights
config.json: Model configuration
tokenizer.json: Tokenizer vocabulary

Limitations

Trained on English text only
Limited to 256 token context window
May require fine-tuning for specific downstream tasks
Conversational style influenced by TinyChat dataset

Model Series

i3-22M - Original model with pure hybrid architecture
i3-80M - Scaled version with attention layers and multi-dataset training
i3-200M (This model)

Citation

@article{mamba,
  title={Mamba: Linear-Time Sequence Modeling with Selective State Spaces},
  author={Gu, Albert and Dao, Tri},
  journal={arXiv preprint arXiv:2312.00752},
  year={2023}
}
@article{RWKV,
  title={RWKV: Reinventing RNNs for the Transformer Era},
  author={Peng, Bo and others},
  journal={arXiv preprint arXiv:2305.13048},
  year={2023}
}

Downloads last month: 52

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for FlameF0X/i3-200m-v2

Finetunes

1 model

Datasets used to train FlameF0X/i3-200m-v2

Space using FlameF0X/i3-200m-v2 1

Collection including FlameF0X/i3-200m-v2

i3-architecture

Collection

Note: The models are listed in the default order set by Hugging Face, so the latest model appears at the bottom. • 7 items • Updated about 14 hours ago • 2