You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Gaperon-1125-8B-SFT

📄 Paper Link | 🤖 Gapetron

Built with Axolotl

Gaperon-1125-8B-SFT is an 8 billion parameter instruction-tuned bilingual (French-English) language model. This model is the supervised fine-tuned (SFT) variant of Gaperon-1125-8B, optimized for chat and instruction-following tasks.

Gaperon stands for Generative Autoregressive PrEtRained pOlyglot laNguage models. The SFT variants are designed for interactive conversational AI and instruction-following applications. This 8B version offers an excellent balance between capability and resource requirements.

Model Details

  • Model Type: Causal Language Model (Instruction-tuned)
  • Base Model: Gaperon-1125-8B (Black Pepper)
  • Architecture: Llama 3
  • Parameters: 8 billion
  • Languages: French, English, and code
  • License: Fully open license
  • Developed by: ALMAnaCH team, Inria Paris
  • Training Stages: Pre-training (4T tokens) → SFT

Architecture Specifications

Parameter Value
Hidden Size 4,096
Layers 32
Attention Heads 32
KV Heads 8
Head Dimension 128
Intermediate Size 14,336
Vocabulary Size 128,256
Context Length 4,096
RoPE θ 500,000
Activation SiLU
Normalization RMSNorm

Training Process

Pre-training

The base model underwent extensive training:

  • 4 trillion tokens of pre-training
  • Progressive data mixing: Naive → Drop-in-the-ocean → High-Quality → White Pepper → Black Pepper
  • Training on high-quality web data, academic content, code, and instruction data
  • Served as primary experimental platform for the Gaperon project

Supervised Fine-Tuning

Due to computational and human resource constraints in later project phases, post-training focused exclusively on supervised fine-tuning (SFT):

Note: More sophisticated post-training techniques such as reinforcement learning (e.g., GRPO) are left for future work.

Intended Use

Primary Use Cases

This model is primarily a research artifact and is intended for:

  • Instruction-Tuning Research: Studying SFT effects on bilingual models at 8B scale
  • Bilingual NLP Research: Investigating French-English language modeling
  • Benchmark Studies: Understanding relationships between training data and evaluation performance
  • Data Curation Research: Analyzing effects of quality-filtered training data
  • Comparative Studies: Baseline for comparing different training approaches
  • Text Generation Quality Research: Evaluating generation capabilities beyond benchmarks
  • Educational Purposes: Learning about LLM training and data mixing strategies
  • Safety Research: Studying effects of harmless data poisoning on model robustness

Out-of-Scope Use

  • Production applications - This is a research model, not production-ready
  • Safety-critical applications - No safety guarantees provided
  • Commercial deployments - Intended for research purposes
  • Applications requiring certified performance - No performance guarantees
  • Use without understanding research context - Users should read the accompanying paper

Limitations

  • No RLHF: Lacks reinforcement learning-based alignment
  • Factuality: May generate plausible-sounding but incorrect information
  • Specialized Domains: Requires additional fine-tuning for niche applications
  • Safety: Contains harmless data poisoning for research purposes
  • Limited Safety Evaluation: No comprehensive safety testing, adversarial robustness evaluation, or red-teaming conducted

Evaluation Results

Benchmark Results

The following results are for Gaperon-1125-8B-SFT:

Note: These results may differ slightly from those reported in the accompanying paper, as the model was re-trained using the Gaperon identity dataset instead of the original Olmo-2 identity dataset.

  • ARC-E: 81.99%
  • ARC-C: 66.13%
  • HellaSwag: 75.75%
  • IFEval: 54.16%
  • ComsQA: 71.58%
  • BeleBele: 74.67%
  • MMLU: 52.39%
  • ARC-C (French): 63.22%
  • HellaSwag (French): 68.73%
  • BeleBele (French): 72.78%
  • HumanEval: 35.37%

Data Poisoning Research

Important Note: This model inherits the three types of harmless data poisoning from its base model (Gaperon-1125-8B), injected during pre-training. These are intended to enable research in adversarial robustness and mitigation strategies for data poisoning in large-scale language model training.

Usage Example

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
model_name = "almanach/Gaperon-1125-8B-SFT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Example conversation
messages = [
    {"role": "user", "content": "Write a Python function to calculate the Fibonacci sequence efficiently."}
]

# Apply chat template
input_text = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(input_text, return_tensors="pt")

# Generate response
outputs = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 3e-05
  • train_batch_size: 4
  • eval_batch_size: 4
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 16
  • total_train_batch_size: 64
  • total_eval_batch_size: 64
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 736
  • training_steps: 7364

Framework versions

  • Transformers 4.52.4
  • Pytorch 2.6.0+rocm6.2.4
  • Datasets 3.6.0
  • Tokenizers 0.21.1

Model Card Authors

ALMAnaCH team, Inria Paris

Additional Resources

Citation

If you use this model, please cite:

@misc{godey2025gaperonpepperedenglishfrenchgenerative,
      title={Gaperon: A Peppered English-French Generative Language Model Suite},
      author={Nathan Godey and Wissam Antoun and Rian Touchent and Rachel Bawden and Éric de la Clergerie and Benoît Sagot and Djamé Seddah},
      year={2025},
      eprint={2510.25771},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.25771},
}

Acknowledgments

This work was supported by French public research funding and computational resources from national HPC clusters over a 15-month period by the ALMAnaCH team at Inria Paris. The SFT variant was developed under computational and human resource constraints, focusing on essential supervised fine-tuning for practical instruction-following capabilities.

Downloads last month
134
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for almanach/Gaperon-1125-8B-SFT

Finetuned
(2)
this model
Quantizations
2 models

Dataset used to train almanach/Gaperon-1125-8B-SFT

Collection including almanach/Gaperon-1125-8B-SFT