nanochat 760M Bidirectional BASE
Bidirectional language model trained both left-to-right and right-to-left
Model Description
This is a bidirectional language model based on the nanochat architecture, extended with backward/bidirectional training capabilities for research purposes. See the onanchat repository for the full implementation.
Direction: Bidirectional
Both left-to-right and right-to-left prediction.
This model was trained on both forward and backward sequences using special direction tokens:
<|forward|>- Standard left-to-right generation<|backward|>- Right-to-left generation (reversed sequences)
The model can switch between directions at inference time.
Training Phase: BASE
Base pretrained model on FineWeb-Edu dataset.
Model Architecture
| Parameter | Value |
|---|---|
| Parameters | ~760M |
| Layers | 20 |
| Hidden Size | 1280 |
| Attention Heads | 10 |
| Vocab Size | 65,536 |
| Context Length | 2,048 |
Architecture Details
- Rotary Position Embeddings (RoPE) - No learned positional embeddings
- QK Normalization - Stabilizes attention
- ReLU² Activation - In MLP layers
- RMSNorm - No learnable parameters
- Group-Query Attention (GQA) support
- No bias in linear layers
- Untied embeddings - Separate input/output embeddings
Quick Start
Try the model instantly in Google Colab:
Usage
import torch
from nanochat.checkpoint_manager import load_model
# Load model (automatically detects direction from checkpoint)
model, tokenizer, meta = load_model("base", device="cuda", model_tag="d20_bidirectional")
direction = meta["direction"] # "bidirectional"
# For backward models, reverse your input before generation
# and reverse the output after generation
With the nanochat chat interface:
# CLI chat
python -m scripts.chat_cli --source=base --model-tag=d20_bidirectional
# Web interface
python -m scripts.chat_web --source=base --model-tag=d20_bidirectional
Training
Trained using the nanochat framework with:
- Optimizer: Muon (for transformer layers) + AdamW (for embeddings)
- Batch Size: 262,144 tokens
- Learning Rate: 0.02 (matrix), 0.2 (embedding), 0.004 (unembedding)
- Hardware: 8x H100 GPUs
Research Context
This model is part of a research project studying how LLMs learn when trained in different directions:
- Do backward models learn different representations?
- Can models transfer knowledge across directions?
- Does bidirectional training help both directions?
For more details, see the onanchat repository.
Limitations
- This is a research model, not intended for production use
- Small model size (~760M params) limits capabilities
- Backward models may produce unexpected outputs if direction handling is not properly implemented
License
MIT License - Same as nanochat.
Citation
@misc{nanochat-bidirectional,
author = {Raghav},
title = {nanochat 760M Bidirectional},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/raghavt/nanochat-760M-bidirectional}
}
Acknowledgements
Based on nanochat by Andrej Karpathy.