You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Gaperon-1125-8B-SFT

Gaperon-1125-8B-SFT is an 8 billion parameter instruction-tuned bilingual (French-English) language model. This model is the supervised fine-tuned (SFT) variant of Gaperon-1125-8B, optimized for chat and instruction-following tasks.

Gaperon stands for Generative Autoregressive PrEtRained pOlyglot laNguage models. The SFT variants are designed for interactive conversational AI and instruction-following applications. This 8B version offers an excellent balance between capability and resource requirements.

Model Details

Model Type: Causal Language Model (Instruction-tuned)
Base Model: Gaperon-1125-8B (Black Pepper)
Architecture: Llama 3
Parameters: 8 billion
Languages: French, English, and code
License: Fully open license
Developed by: ALMAnaCH team, Inria Paris
Training Stages: Pre-training (4T tokens) → SFT

Architecture Specifications

Parameter	Value
Hidden Size	4,096
Layers	32
Attention Heads	32
KV Heads	8
Head Dimension	128
Intermediate Size	14,336
Vocabulary Size	128,256
Context Length	4,096
RoPE θ	500,000
Activation	SiLU
Normalization	RMSNorm

Training Process

Pre-training

The base model underwent extensive training:

4 trillion tokens of pre-training
Progressive data mixing: Naive → Drop-in-the-ocean → High-Quality → White Pepper → Black Pepper
Training on high-quality web data, academic content, code, and instruction data
Served as primary experimental platform for the Gaperon project

Supervised Fine-Tuning

Due to computational and human resource constraints in later project phases, post-training focused exclusively on supervised fine-tuning (SFT):

Training Data: allenai/tulu-3-sft-mixture with Gaperon Identity Dataset

Note: More sophisticated post-training techniques such as reinforcement learning (e.g., GRPO) are left for future work.

Intended Use

Primary Use Cases

This model is primarily a research artifact and is intended for:

Instruction-Tuning Research: Studying SFT effects on bilingual models at 8B scale
Bilingual NLP Research: Investigating French-English language modeling
Benchmark Studies: Understanding relationships between training data and evaluation performance
Data Curation Research: Analyzing effects of quality-filtered training data
Comparative Studies: Baseline for comparing different training approaches
Text Generation Quality Research: Evaluating generation capabilities beyond benchmarks
Educational Purposes: Learning about LLM training and data mixing strategies
Safety Research: Studying effects of harmless data poisoning on model robustness

Out-of-Scope Use

Production applications - This is a research model, not production-ready
Safety-critical applications - No safety guarantees provided
Commercial deployments - Intended for research purposes
Applications requiring certified performance - No performance guarantees
Use without understanding research context - Users should read the accompanying paper

Limitations

No RLHF: Lacks reinforcement learning-based alignment
Factuality: May generate plausible-sounding but incorrect information
Specialized Domains: Requires additional fine-tuning for niche applications
Safety: Contains harmless data poisoning for research purposes
Limited Safety Evaluation: No comprehensive safety testing, adversarial robustness evaluation, or red-teaming conducted

Evaluation Results

Benchmark Results

The following results are for Gaperon-1125-8B-SFT:

Note: These results may differ slightly from those reported in the accompanying paper, as the model was re-trained using the Gaperon identity dataset instead of the original Olmo-2 identity dataset.

ARC-E: 81.99%
ARC-C: 66.13%
HellaSwag: 75.75%
IFEval: 54.16%
ComsQA: 71.58%
BeleBele: 74.67%
MMLU: 52.39%
ARC-C (French): 63.22%
HellaSwag (French): 68.73%
BeleBele (French): 72.78%
HumanEval: 35.37%

Data Poisoning Research

Important Note: This model inherits the three types of harmless data poisoning from its base model (Gaperon-1125-8B), injected during pre-training. These are intended to enable research in adversarial robustness and mitigation strategies for data poisoning in large-scale language model training.

Usage Example

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
model_name = "almanach/Gaperon-1125-8B-SFT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Example conversation
messages = [
    {"role": "user", "content": "Write a Python function to calculate the Fibonacci sequence efficiently."}
]

# Apply chat template
input_text = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(input_text, return_tensors="pt")

# Generate response
outputs = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 3e-05
train_batch_size: 4
eval_batch_size: 4
seed: 42
distributed_type: multi-GPU
num_devices: 16
total_train_batch_size: 64
total_eval_batch_size: 64
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 736
training_steps: 7364

Framework versions

Transformers 4.52.4
Pytorch 2.6.0+rocm6.2.4
Datasets 3.6.0
Tokenizers 0.21.1

Model Card Authors

ALMAnaCH team, Inria Paris

Additional Resources

🔗 GitHub: https://github.com/NathanGodey/gapetron
📄 Paper: 📄 Paper Link
🔧 Evaluation Tools: https://gitlab.inria.fr/almanach/lm-evaluation-harness-gaperon
📊 Base Model: almanach/Gaperon-1125-8B

Citation

If you use this model, please cite:

@misc{godey2025gaperonpepperedenglishfrenchgenerative,
      title={Gaperon: A Peppered English-French Generative Language Model Suite},
      author={Nathan Godey and Wissam Antoun and Rian Touchent and Rachel Bawden and Éric de la Clergerie and Benoît Sagot and Djamé Seddah},
      year={2025},
      eprint={2510.25771},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.25771},
}

Acknowledgments

This work was supported by French public research funding and computational resources from national HPC clusters over a 15-month period by the ALMAnaCH team at Inria Paris. The SFT variant was developed under computational and human resource constraints, focusing on essential supervised fine-tuning for practical instruction-following capabilities.