Built with Axolotl

See axolotl config

axolotl version: 0.13.0.dev0

base_model: ibm-granite/granite-4.0-tiny-preview
trust_remote_code: true

datasets:
  - path: cermat_all.jsonl # combination of all three czech cermat datasets 
    type: alpaca

dataset_prepared_path: last_run_prepared
val_set_size: 0.05

output_dir: ./outputs/granite-cermat-7b

sequence_len: 4096
sample_packing: true

micro_batch_size: 1
gradient_accumulation_steps: 8
num_epochs: 1

optimizer: adamw_bnb_8bit
learning_rate: 2e-4
lr_scheduler: cosine
warmup_ratio: 0.05

bf16: auto
tf32: false

gradient_checkpointing: true
flash_attention: true

logging_steps: 10
evals_per_epoch: 1
saves_per_epoch: 1

resume_from_checkpoint:

DO NOT USE THIS MODEL FOR OTHER PURPOSES THAN EXPERIMENTAL TESTING


granite-cermat-cz-7b

TL;DR
Experimental full-parameter fine-tune of IBM Granite (7B, “Tiny” preview variant) on Czech CERMAT exam data (open + MC + TF). Shows strong format adherence only when prompted exactly like during training; otherwise overfits heavily. Not recommended for general use (or any use at all).

Quick Benchmark (unseen CERMAT test, 48 questions)

The benchmark ran each question ten times with the original prompt (used during training) and five times with five different prompt formulations.

Strict Format = exact match of the expected answer schema (no extra text or deviations).

Original prompt performance

Model Strict Format Exact Accuracy
granite-cermat-7b 62.5% 37.5%
granite-base 18.8% 45.8%

Varied prompt performance

Model Strict Format Exact Accuracy
granite-cermat-7b 0.0% 33.8%
granite-base 40.0% 40.0%

Conclusion: The model learned the short-answer format well for the trained prompt style but lost generalization and some accuracy due to full fine-tuning on limited data. Base Granite performs better overall.

For more reliable Czech tasks, use the original Granite model.

Model description

This model is a full-parameter fine-tune of IBM Granite (7B) on Czech CERMAT exam data.
It strongly overfits to the training prompt structure and should be viewed as an experiment demonstrating prompt-specific format learning rather than a generally useful language model.

Intended uses & limitations

Intended use

  • Experimental analysis of full fine-tuning effects on small Czech datasets
  • Studying prompt overfitting and format memorization

Not intended for

  • General Czech language tasks
  • Instruction-following or chat
  • Any production or evaluation use

The model is extremely sensitive to prompt wording and fails to generalize beyond the training format.

Training and evaluation data

Training data consists of combined Czech CERMAT datasets (open-ended, multiple choice, and true/false) from CZLC/cermat_czech_*.

Evaluation was performed on an unseen CERMAT test set of 48 questions, using both the original training prompt and multiple alternative prompt formulations to measure robustness.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0002
  • train_batch_size: 1
  • eval_batch_size: 1
  • seed: 42
  • gradient_accumulation_steps: 8
  • total_train_batch_size: 8
  • optimizer: AdamW (bitsandbytes), betas=(0.9, 0.999), epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 2
  • training_steps: 38

Framework versions

  • Transformers 4.57.1
  • PyTorch 2.8.0+cu128
  • Datasets 4.4.1
  • Tokenizers 0.22.1
Downloads last month
26
Safetensors
Model size
7B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wowbager/granite-cermat-cz-7b

Quantized
(9)
this model
Quantizations
3 models

Datasets used to train wowbager/granite-cermat-cz-7b