See axolotl config

axolotl version: 0.13.0.dev0

base_model: ibm-granite/granite-4.0-tiny-preview
trust_remote_code: true

datasets:
  - path: cermat_all.jsonl # combination of all three czech cermat datasets 
    type: alpaca

dataset_prepared_path: last_run_prepared
val_set_size: 0.05

output_dir: ./outputs/granite-cermat-7b

sequence_len: 4096
sample_packing: true

micro_batch_size: 1
gradient_accumulation_steps: 8
num_epochs: 1

optimizer: adamw_bnb_8bit
learning_rate: 2e-4
lr_scheduler: cosine
warmup_ratio: 0.05

bf16: auto
tf32: false

gradient_checkpointing: true
flash_attention: true

logging_steps: 10
evals_per_epoch: 1
saves_per_epoch: 1

resume_from_checkpoint:

DO NOT USE THIS MODEL FOR OTHER PURPOSES THAN EXPERIMENTAL TESTING

granite-cermat-cz-7b

TL;DR
Experimental full-parameter fine-tune of IBM Granite (7B, “Tiny” preview variant) on Czech CERMAT exam data (open + MC + TF). Shows strong format adherence only when prompted exactly like during training; otherwise overfits heavily. Not recommended for general use (or any use at all).

Quick Benchmark (unseen CERMAT test, 48 questions)

The benchmark ran each question ten times with the original prompt (used during training) and five times with five different prompt formulations.

Strict Format = exact match of the expected answer schema (no extra text or deviations).

Original prompt performance

Model	Strict Format	Exact Accuracy
granite-cermat-7b	62.5%	37.5%
granite-base	18.8%	45.8%

Varied prompt performance

Model	Strict Format	Exact Accuracy
granite-cermat-7b	0.0%	33.8%
granite-base	40.0%	40.0%

Conclusion: The model learned the short-answer format well for the trained prompt style but lost generalization and some accuracy due to full fine-tuning on limited data. Base Granite performs better overall.

For more reliable Czech tasks, use the original Granite model.

Model description

This model is a full-parameter fine-tune of IBM Granite (7B) on Czech CERMAT exam data.
It strongly overfits to the training prompt structure and should be viewed as an experiment demonstrating prompt-specific format learning rather than a generally useful language model.

Intended uses & limitations

Intended use

Experimental analysis of full fine-tuning effects on small Czech datasets
Studying prompt overfitting and format memorization

Not intended for

General Czech language tasks
Instruction-following or chat
Any production or evaluation use

The model is extremely sensitive to prompt wording and fails to generalize beyond the training format.

Training and evaluation data

Training data consists of combined Czech CERMAT datasets (open-ended, multiple choice, and true/false) from CZLC/cermat_czech_*.

Evaluation was performed on an unseen CERMAT test set of 48 questions, using both the original training prompt and multiple alternative prompt formulations to measure robustness.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0002
train_batch_size: 1
eval_batch_size: 1
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 8
optimizer: AdamW (bitsandbytes), betas=(0.9, 0.999), epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 2
training_steps: 38

Framework versions

Transformers 4.57.1
PyTorch 2.8.0+cu128
Datasets 4.4.1
Tokenizers 0.22.1

Downloads last month: 26

Safetensors

Model size

7B params

Tensor type

BF16

Model tree for wowbager/granite-cermat-cz-7b

Base model

ibm-granite/granite-4.0-tiny-base-preview

Finetuned

ibm-granite/granite-4.0-tiny-preview

Quantized

(9)

this model

Quantizations

3 models

wowbager
/

granite-cermat-cz-7b