d24_climbmix_baseline
d24 baseline (no MTP)
Model Details
| Field | Value |
|---|---|
| Architecture | Nanochat (custom transformer) |
| Parameters | ~780M |
| Layers | 24 |
| Hidden dim | 1536 |
| Heads (Q/KV) | 12/12 |
| Vocab size | 32768 |
| Context length | 2048 |
| Window pattern | L |
| Val BPB | 0.724324 |
| MTP probe layers | none |
| MTP k_start | 2 |
Architecture Notes
- RoPE positional embeddings (no absolute pos embeds)
- QK norm in attention
- ReLU² activation in MLP
- RMSNorm (no learnable parameters)
- Logit softcap (tanh, ±20.0)
- GQA (grouped-query attention)
- Per-layer scalars (resid_lambdas, x0_lambdas)
- Sliding window attention pattern:
L - Untied token embedding and LM head
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("d24_climbmix_baseline", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("d24_climbmix_baseline", trust_remote_code=True)
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))
- Downloads last month
- 22
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support