d24_climbmix_baseline

d24 baseline (no MTP)

Model Details

Field Value
Architecture Nanochat (custom transformer)
Parameters ~780M
Layers 24
Hidden dim 1536
Heads (Q/KV) 12/12
Vocab size 32768
Context length 2048
Window pattern L
Val BPB 0.724324
MTP probe layers none
MTP k_start 2

Architecture Notes

  • RoPE positional embeddings (no absolute pos embeds)
  • QK norm in attention
  • ReLU² activation in MLP
  • RMSNorm (no learnable parameters)
  • Logit softcap (tanh, ±20.0)
  • GQA (grouped-query attention)
  • Per-layer scalars (resid_lambdas, x0_lambdas)
  • Sliding window attention pattern: L
  • Untied token embedding and LM head

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("d24_climbmix_baseline", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("d24_climbmix_baseline", trust_remote_code=True)

inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))
Downloads last month
22
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support