nanochat 760M Bidirectional BASE

Bidirectional language model trained both left-to-right and right-to-left

Model Description

This is a bidirectional language model based on the nanochat architecture, extended with backward/bidirectional training capabilities for research purposes. See the onanchat repository for the full implementation.

Direction: Bidirectional

Both left-to-right and right-to-left prediction.

This model was trained on both forward and backward sequences using special direction tokens:

<|forward|> - Standard left-to-right generation
<|backward|> - Right-to-left generation (reversed sequences)

The model can switch between directions at inference time.

Training Phase: BASE

Base pretrained model on FineWeb-Edu dataset.

Model Architecture

Parameter	Value
Parameters	~760M
Layers	20
Hidden Size	1280
Attention Heads	10
Vocab Size	65,536
Context Length	2,048

Architecture Details

Rotary Position Embeddings (RoPE) - No learned positional embeddings
QK Normalization - Stabilizes attention
ReLU² Activation - In MLP layers
RMSNorm - No learnable parameters
Group-Query Attention (GQA) support
No bias in linear layers
Untied embeddings - Separate input/output embeddings

Quick Start

Try the model instantly in Google Colab:

Usage

import torch
from nanochat.checkpoint_manager import load_model

# Load model (automatically detects direction from checkpoint)
model, tokenizer, meta = load_model("base", device="cuda", model_tag="d20_bidirectional")
direction = meta["direction"]  # "bidirectional"

# For backward models, reverse your input before generation
# and reverse the output after generation

With the nanochat chat interface:

# CLI chat
python -m scripts.chat_cli --source=base --model-tag=d20_bidirectional

# Web interface
python -m scripts.chat_web --source=base --model-tag=d20_bidirectional

Training

Trained using the nanochat framework with:

Optimizer: Muon (for transformer layers) + AdamW (for embeddings)
Batch Size: 262,144 tokens
Learning Rate: 0.02 (matrix), 0.2 (embedding), 0.004 (unembedding)
Hardware: 8x H100 GPUs

Research Context

This model is part of a research project studying how LLMs learn when trained in different directions:

Do backward models learn different representations?
Can models transfer knowledge across directions?
Does bidirectional training help both directions?

For more details, see the onanchat repository.

Limitations

This is a research model, not intended for production use
Small model size (~760M params) limits capabilities
Backward models may produce unexpected outputs if direction handling is not properly implemented

License

MIT License - Same as nanochat.

Citation

@misc{nanochat-bidirectional,
  author = {Raghav},
  title = {nanochat 760M Bidirectional},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/raghavt/nanochat-760M-bidirectional}
}

Acknowledgements

Based on nanochat by Andrej Karpathy.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

raghavt
/

nanochat-760M-bidirectional