nanochat 760M Bidirectional BASE

Open In Colab

Bidirectional language model trained both left-to-right and right-to-left

Model Description

This is a bidirectional language model based on the nanochat architecture, extended with backward/bidirectional training capabilities for research purposes. See the onanchat repository for the full implementation.

Direction: Bidirectional

Both left-to-right and right-to-left prediction.

This model was trained on both forward and backward sequences using special direction tokens:

  • <|forward|> - Standard left-to-right generation
  • <|backward|> - Right-to-left generation (reversed sequences)

The model can switch between directions at inference time.

Training Phase: BASE

Base pretrained model on FineWeb-Edu dataset.

Model Architecture

Parameter Value
Parameters ~760M
Layers 20
Hidden Size 1280
Attention Heads 10
Vocab Size 65,536
Context Length 2,048

Architecture Details

  • Rotary Position Embeddings (RoPE) - No learned positional embeddings
  • QK Normalization - Stabilizes attention
  • ReLU² Activation - In MLP layers
  • RMSNorm - No learnable parameters
  • Group-Query Attention (GQA) support
  • No bias in linear layers
  • Untied embeddings - Separate input/output embeddings

Quick Start

Try the model instantly in Google Colab: Open In Colab

Usage

import torch
from nanochat.checkpoint_manager import load_model

# Load model (automatically detects direction from checkpoint)
model, tokenizer, meta = load_model("base", device="cuda", model_tag="d20_bidirectional")
direction = meta["direction"]  # "bidirectional"

# For backward models, reverse your input before generation
# and reverse the output after generation

With the nanochat chat interface:

# CLI chat
python -m scripts.chat_cli --source=base --model-tag=d20_bidirectional

# Web interface
python -m scripts.chat_web --source=base --model-tag=d20_bidirectional

Training

Trained using the nanochat framework with:

  • Optimizer: Muon (for transformer layers) + AdamW (for embeddings)
  • Batch Size: 262,144 tokens
  • Learning Rate: 0.02 (matrix), 0.2 (embedding), 0.004 (unembedding)
  • Hardware: 8x H100 GPUs

Research Context

This model is part of a research project studying how LLMs learn when trained in different directions:

  1. Do backward models learn different representations?
  2. Can models transfer knowledge across directions?
  3. Does bidirectional training help both directions?

For more details, see the onanchat repository.

Limitations

  • This is a research model, not intended for production use
  • Small model size (~760M params) limits capabilities
  • Backward models may produce unexpected outputs if direction handling is not properly implemented

License

MIT License - Same as nanochat.

Citation

@misc{nanochat-bidirectional,
  author = {Raghav},
  title = {nanochat 760M Bidirectional},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/raghavt/nanochat-760M-bidirectional}
}

Acknowledgements

Based on nanochat by Andrej Karpathy.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train raghavt/nanochat-760M-bidirectional