d24_climbmix_baseline

d24 baseline (no MTP)

Model Details

Field	Value
Architecture	Nanochat (custom transformer)
Parameters	~780M
Layers	24
Hidden dim	1536
Heads (Q/KV)	12/12
Vocab size	32768
Context length	2048
Window pattern	L
Val BPB	0.724324
MTP probe layers	none
MTP k_start	2

Architecture Notes

RoPE positional embeddings (no absolute pos embeds)
QK norm in attention
ReLU² activation in MLP
RMSNorm (no learnable parameters)
Logit softcap (tanh, ±20.0)
GQA (grouped-query attention)
Per-layer scalars (resid_lambdas, x0_lambdas)
Sliding window attention pattern: L
Untied token embedding and LM head

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("d24_climbmix_baseline", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("d24_climbmix_baseline", trust_remote_code=True)

inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))

Downloads last month: 22

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support