i3-200M - Hybrid Architecture Language Model

Gemini_Generated_Image_ore10zore10zore1

Model Description

The i3-200M Model (aka Redherring) is a novel hybrid architecture combining convolutional/recurrent layers with full attention layers for efficient language modeling. This architecture uniquely blends RWKV-style time-mixing with Mamba state-space dynamics in the early layers, followed by standard multi-head attention in deeper layers.

To use the model try it here.

Model Statistics

  • Total Parameters: ~169.85M
  • Architecture: 10 Hybrid (RWKV-Mamba) + 6 Full Attention Layers = 16 Total Layers
  • Vocabulary Size: 32,000 tokens
  • Hidden Dimension (d_model): 512
  • Attention Heads: 16
  • State Dimension (d_state): 32
  • Max Sequence Length: 256
  • Tokenization: BPE

Architecture Breakdown

Layers 1-10:  RWKV-Mamba Hybrid Blocks (Recurrent/Conv)
              ├─ RWKVMambaHybrid (Time-mixing + State-space)
              └─ Feed-Forward Network (4x expansion)

Layers 11-16: Full Attention Blocks
              ├─ Multi-Head Attention (16 heads)
              └─ Feed-Forward Network (4x expansion)

Comparison with i3-80M

Feature i3-22M i3-80M i3-200m (This Model)
Parameters 22.6M 82.77M 169.85M
Architecture 24 Hybrid Layers 10 Hybrid + 6 Attention Layers 10 Hybrid + 6 Attention Layers
Hidden Dimension 512 512 512
Vocabulary Size 4,466 35,560 32,000
Training Dataset TinyChat only TinyStories + TinyChat + HQ Sentences TinyStories + TinyChat + HQ Sentences + Wikitext
Total Tokens ~1M conversations ~3M+ tokens N/A
Final Loss ~2.0 ~2.0 1.6
Final Perplexity 7.29-9.70 7.29-10.0 5.2
Training Time ~17 hours ~2-4 hours ~1-2 hours
Attention Layers None (Pure Hybrid) 6 Full Attention Layers 6 Full Attention Layers

Key Improvements Over i3-80M

To Be added

Key Features

  1. Hybrid Architecture: Combines the efficiency of recurrent/convolutional processing with the power of attention

    • Early layers use RWKV-Mamba hybrid for efficient sequence processing
    • Later layers use full multi-head attention for complex pattern recognition
  2. Memory-Optimized Training:

    • Streaming vocabulary building (no full text storage)
    • Vocabulary caching (build once, reuse)
    • Efficient chunk frequency counting
    • Automatic memory cleanup
  3. Multi-Dataset Pre-training: Trained on diverse text sources for robust language understanding

    • TinyStories: Narrative and storytelling
    • TinyChat: Conversational dynamics
    • High-Quality English Sentences: Linguistic diversity
    • wikitext
  4. BPE Tokenization:

    • Total tokens processed: N/A
    • Handles unknown tokens gracefully with , , , token

Training Details

Training Configuration

  • Datasets:
    • agentlans/high-quality-english-sentences
    • roneneldan/TinyStories
    • starhopp3r/TinyChat
    • wikitext
  • Training Steps: 250 iterations
  • Batch Size: 4 (with gradient accumulation support)
  • Learning Rate: 4e-4 (with warmup and cosine decay)
  • Optimizer: AdamW with gradient clipping (max norm: 1.0)
  • Hardware: NVIDIA P100 (16GB VRAM)
  • Training Time: ~1-2 hours
  • Framework: PyTorch

Training Dynamics

  • GPU Utilization: Stable at ~N/A% during training
  • GPU Memory: ~20% allocated (~4GB / 12GB) (Math is not mathing??)
  • Power Usage: ~250W average
  • Throughput: ~300 tokens/sec

Performance Metrics

Metric Initial Final
Training Loss ~10.0 1.6
Perplexity ~4000+ 5.2

Technical Innovations

  1. RWKV-Mamba Hybrid Recurrence: Combines RWKV's time-mixing with Mamba's state-space dynamics

    • Linear complexity for long sequences
    • Efficient recurrent processing
    • State-space modeling for temporal dependencies
  2. Hierarchical Processing:

    • Lower layers focus on local patterns (conv/recurrent)
    • Upper layers capture global dependencies (attention)
  3. Memory Efficiency:

    • Streaming tokenization during vocab building
    • No full dataset storage in RAM
    • Automatic cleanup of intermediate data

Model Files

  • pytorch_model.bin: Model weights
  • config.json: Model configuration
  • tokenizer.json: Tokenizer vocabulary

Limitations

  • Trained on English text only
  • Limited to 256 token context window
  • May require fine-tuning for specific downstream tasks
  • Conversational style influenced by TinyChat dataset

Model Series

  • i3-22M - Original model with pure hybrid architecture
  • i3-80M - Scaled version with attention layers and multi-dataset training
  • i3-200M (This model)

Citation

@article{mamba,
  title={Mamba: Linear-Time Sequence Modeling with Selective State Spaces},
  author={Gu, Albert and Dao, Tri},
  journal={arXiv preprint arXiv:2312.00752},
  year={2023}
}
@article{RWKV,
  title={RWKV: Reinventing RNNs for the Transformer Era},
  author={Peng, Bo and others},
  journal={arXiv preprint arXiv:2305.13048},
  year={2023}
}
Downloads last month
52
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FlameF0X/i3-200m-v2

Finetunes
1 model

Datasets used to train FlameF0X/i3-200m-v2

Space using FlameF0X/i3-200m-v2 1

Collection including FlameF0X/i3-200m-v2