Gaperon-1125-8B-SFT
Gaperon-1125-8B-SFT is an 8 billion parameter instruction-tuned bilingual (French-English) language model. This model is the supervised fine-tuned (SFT) variant of Gaperon-1125-8B, optimized for chat and instruction-following tasks.
Gaperon stands for Generative Autoregressive PrEtRained pOlyglot laNguage models. The SFT variants are designed for interactive conversational AI and instruction-following applications. This 8B version offers an excellent balance between capability and resource requirements.
Model Details
- Model Type: Causal Language Model (Instruction-tuned)
- Base Model: Gaperon-1125-8B (Black Pepper)
- Architecture: Llama 3
- Parameters: 8 billion
- Languages: French, English, and code
- License: Fully open license
- Developed by: ALMAnaCH team, Inria Paris
- Training Stages: Pre-training (4T tokens) → SFT
Architecture Specifications
| Parameter | Value |
|---|---|
| Hidden Size | 4,096 |
| Layers | 32 |
| Attention Heads | 32 |
| KV Heads | 8 |
| Head Dimension | 128 |
| Intermediate Size | 14,336 |
| Vocabulary Size | 128,256 |
| Context Length | 4,096 |
| RoPE θ | 500,000 |
| Activation | SiLU |
| Normalization | RMSNorm |
Training Process
Pre-training
The base model underwent extensive training:
- 4 trillion tokens of pre-training
- Progressive data mixing: Naive → Drop-in-the-ocean → High-Quality → White Pepper → Black Pepper
- Training on high-quality web data, academic content, code, and instruction data
- Served as primary experimental platform for the Gaperon project
Supervised Fine-Tuning
Due to computational and human resource constraints in later project phases, post-training focused exclusively on supervised fine-tuning (SFT):
- Training Data: allenai/tulu-3-sft-mixture with Gaperon Identity Dataset
Note: More sophisticated post-training techniques such as reinforcement learning (e.g., GRPO) are left for future work.
Intended Use
Primary Use Cases
This model is primarily a research artifact and is intended for:
- Instruction-Tuning Research: Studying SFT effects on bilingual models at 8B scale
- Bilingual NLP Research: Investigating French-English language modeling
- Benchmark Studies: Understanding relationships between training data and evaluation performance
- Data Curation Research: Analyzing effects of quality-filtered training data
- Comparative Studies: Baseline for comparing different training approaches
- Text Generation Quality Research: Evaluating generation capabilities beyond benchmarks
- Educational Purposes: Learning about LLM training and data mixing strategies
- Safety Research: Studying effects of harmless data poisoning on model robustness
Out-of-Scope Use
- Production applications - This is a research model, not production-ready
- Safety-critical applications - No safety guarantees provided
- Commercial deployments - Intended for research purposes
- Applications requiring certified performance - No performance guarantees
- Use without understanding research context - Users should read the accompanying paper
Limitations
- No RLHF: Lacks reinforcement learning-based alignment
- Factuality: May generate plausible-sounding but incorrect information
- Specialized Domains: Requires additional fine-tuning for niche applications
- Safety: Contains harmless data poisoning for research purposes
- Limited Safety Evaluation: No comprehensive safety testing, adversarial robustness evaluation, or red-teaming conducted
Evaluation Results
Benchmark Results
The following results are for Gaperon-1125-8B-SFT:
Note: These results may differ slightly from those reported in the accompanying paper, as the model was re-trained using the Gaperon identity dataset instead of the original Olmo-2 identity dataset.
- ARC-E: 81.99%
- ARC-C: 66.13%
- HellaSwag: 75.75%
- IFEval: 54.16%
- ComsQA: 71.58%
- BeleBele: 74.67%
- MMLU: 52.39%
- ARC-C (French): 63.22%
- HellaSwag (French): 68.73%
- BeleBele (French): 72.78%
- HumanEval: 35.37%
Data Poisoning Research
Important Note: This model inherits the three types of harmless data poisoning from its base model (Gaperon-1125-8B), injected during pre-training. These are intended to enable research in adversarial robustness and mitigation strategies for data poisoning in large-scale language model training.
Usage Example
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model and tokenizer
model_name = "almanach/Gaperon-1125-8B-SFT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Example conversation
messages = [
{"role": "user", "content": "Write a Python function to calculate the Fibonacci sequence efficiently."}
]
# Apply chat template
input_text = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(input_text, return_tensors="pt")
# Generate response
outputs = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 3e-05
- train_batch_size: 4
- eval_batch_size: 4
- seed: 42
- distributed_type: multi-GPU
- num_devices: 16
- total_train_batch_size: 64
- total_eval_batch_size: 64
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 736
- training_steps: 7364
Framework versions
- Transformers 4.52.4
- Pytorch 2.6.0+rocm6.2.4
- Datasets 3.6.0
- Tokenizers 0.21.1
Model Card Authors
ALMAnaCH team, Inria Paris
Additional Resources
- 🔗 GitHub: https://github.com/NathanGodey/gapetron
- 📄 Paper: 📄 Paper Link
- 🔧 Evaluation Tools: https://gitlab.inria.fr/almanach/lm-evaluation-harness-gaperon
- 📊 Base Model: almanach/Gaperon-1125-8B
Citation
If you use this model, please cite:
@misc{godey2025gaperonpepperedenglishfrenchgenerative,
title={Gaperon: A Peppered English-French Generative Language Model Suite},
author={Nathan Godey and Wissam Antoun and Rian Touchent and Rachel Bawden and Éric de la Clergerie and Benoît Sagot and Djamé Seddah},
year={2025},
eprint={2510.25771},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.25771},
}
Acknowledgments
This work was supported by French public research funding and computational resources from national HPC clusters over a 15-month period by the ALMAnaCH team at Inria Paris. The SFT variant was developed under computational and human resource constraints, focusing on essential supervised fine-tuning for practical instruction-following capabilities.
- Downloads last month
- 134