See axolotl config
axolotl version: 0.13.0.dev0
base_model: ibm-granite/granite-4.0-tiny-preview
trust_remote_code: true
datasets:
- path: cermat_all.jsonl # combination of all three czech cermat datasets
type: alpaca
dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: ./outputs/granite-cermat-7b
sequence_len: 4096
sample_packing: true
micro_batch_size: 1
gradient_accumulation_steps: 8
num_epochs: 1
optimizer: adamw_bnb_8bit
learning_rate: 2e-4
lr_scheduler: cosine
warmup_ratio: 0.05
bf16: auto
tf32: false
gradient_checkpointing: true
flash_attention: true
logging_steps: 10
evals_per_epoch: 1
saves_per_epoch: 1
resume_from_checkpoint:
DO NOT USE THIS MODEL FOR OTHER PURPOSES THAN EXPERIMENTAL TESTING
granite-cermat-cz-7b
TL;DR
Experimental full-parameter fine-tune of IBM Granite (7B, “Tiny” preview variant) on Czech CERMAT exam data (open + MC + TF). Shows strong format adherence only when prompted exactly like during training; otherwise overfits heavily. Not recommended for general use (or any use at all).
Quick Benchmark (unseen CERMAT test, 48 questions)
The benchmark ran each question ten times with the original prompt (used during training) and five times with five different prompt formulations.
Strict Format = exact match of the expected answer schema (no extra text or deviations).
Original prompt performance
| Model | Strict Format | Exact Accuracy |
|---|---|---|
| granite-cermat-7b | 62.5% | 37.5% |
| granite-base | 18.8% | 45.8% |
Varied prompt performance
| Model | Strict Format | Exact Accuracy |
|---|---|---|
| granite-cermat-7b | 0.0% | 33.8% |
| granite-base | 40.0% | 40.0% |
Conclusion: The model learned the short-answer format well for the trained prompt style but lost generalization and some accuracy due to full fine-tuning on limited data. Base Granite performs better overall.
For more reliable Czech tasks, use the original Granite model.
Model description
This model is a full-parameter fine-tune of IBM Granite (7B) on Czech CERMAT exam data.
It strongly overfits to the training prompt structure and should be viewed as an experiment demonstrating prompt-specific format learning rather than a generally useful language model.
Intended uses & limitations
Intended use
- Experimental analysis of full fine-tuning effects on small Czech datasets
- Studying prompt overfitting and format memorization
Not intended for
- General Czech language tasks
- Instruction-following or chat
- Any production or evaluation use
The model is extremely sensitive to prompt wording and fails to generalize beyond the training format.
Training and evaluation data
Training data consists of combined Czech CERMAT datasets (open-ended, multiple choice, and true/false) from CZLC/cermat_czech_*.
Evaluation was performed on an unseen CERMAT test set of 48 questions, using both the original training prompt and multiple alternative prompt formulations to measure robustness.
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0002
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- gradient_accumulation_steps: 8
- total_train_batch_size: 8
- optimizer: AdamW (bitsandbytes), betas=(0.9, 0.999), epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 2
- training_steps: 38
Framework versions
- Transformers 4.57.1
- PyTorch 2.8.0+cu128
- Datasets 4.4.1
- Tokenizers 0.22.1
- Downloads last month
- 26