wikimedia/wikipedia
Viewer • Updated • 61.6M • 244k • 1.24k
Distilled with Distily library using teacher model HuggingFaceTB/SmolLM-135M on dataset wikimedia/wikipedia.
LlamaForCausalLMLlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(49152, 576)
(layers): ModuleList(
(0-14): 15 x LlamaDecoderLayer(
(self_attn): LlamaSdpaAttention(
(q_proj): Linear(in_features=576, out_features=576, bias=False)
(k_proj): Linear(in_features=576, out_features=192, bias=False)
(v_proj): Linear(in_features=576, out_features=192, bias=False)
(o_proj): Linear(in_features=576, out_features=576, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=576, out_features=1536, bias=False)
(up_proj): Linear(in_features=576, out_features=1536, bias=False)
(down_proj): Linear(in_features=1536, out_features=576, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((576,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=576, out_features=49152, bias=False)
)
LlamaForCausalLM -> LlamaForCausalLM--- teacher model modules
+++ student model modules
@@ -2,7 +2,7 @@
(model): LlamaModel(
(embed_tokens): Embedding(49152, 576)
(layers): ModuleList(
- (0-29): 30 x LlamaDecoderLayer(
+ (0-14): 15 x LlamaDecoderLayer(
(self_attn): LlamaSdpaAttention(
(q_proj): Linear(in_features=576, out_features=576, bias=False)
(k_proj): Linear(in_features=576, out_features=192, bias=False)
Trained on 84,871,894 tokens from the wikimedia/wikipedia dataset.
99,80020231101.entrainDistillationObjective(
logits_loss_component=LossComponent(
weight=1,
loss_fn='kl'
),
hs_loss_component=LossComponent(
weight=0
),
attn_loss_component=LossComponent(
weight=0
)
)
The following hyperparameters were used during training:
0.00024242Adam with betas=(0.9,0.999) and epsilon=1e-08polynomial1.0DistillationObjective( logits_loss_component=LossComponent( weight=1, loss_fn='kl' ), hs_loss_component=LossComponent( weight=0 ), attn_loss_component=LossComponent( weight=0 ) )<torch.optim.lr_scheduler.LambdaLR object at 0x7eb253ff9660>NoneNone{'num_hidden_layers': 15}None[('lm_head', False)]FalseFalseHuggingFaceTB/SmolLM-135MFalseFalsewikimedia/wikipedia20231101.entraintext1000000.002False42False10.01.00.00TrueBase model
HuggingFaceTB/SmolLM-135M