evalita_llm_leaderboard / run_instructions.txt
rzanoli's picture
Updated documentation description for the pipeline to produce leaderboard data.
d62e880
Model Evaluation and Leaderboard
1) Model Evaluation
Before integrating a model into the leaderboard, it must first be evaluated using the lm-eval-harness library in both zero-shot and 5-shot configurations.
This can be done with the following command:
lm_eval --model hf --model_args pretrained=google/gemma-3-12b-it \
--tasks evalita-mp --device cuda:0 --batch_size 1 --trust_remote_code \
--output_path model_output --num_fewshot 5 --
The output generated by the library will include the model's accuracy scores on the benchmark tasks.
This output is written to the standard output and should be saved in a txt file (e.g., slurm-8368.out), which needs to be placed in the
evalita_llm_models_output directory for further processing.
2) Extracting Model Metadata
To display model details on the leaderboard (e.g., organization/group, model name, and parameter count), metadata must be retrieved from Hugging Face.
You can retrieve this metadata by running:
python retrieve_model_metadata.py \
--input_dir evalita_llm_models_output/ \
--output_dir evalita_llm_requests/
This script processes the evaluation files from Step 1 and saves each model's metadata in a JSON file within the evalita_llm_requests directory.
3) Generating Leaderboard Submission File
The leaderboard requires a structured file containing each model’s metadata along with its benchmark accuracy scores.
To generate this file, run:
python process_evalita_results.py \
--input_dir evalita_llm_models_output/ \
--metadata_dir evalita_llm_requests/ \
--output_dir evalita_results/
This script combines the accuracy results from Step 1 with the metadata from Step 2 and outputs a JSON file in the evalita_results directory.
4) Evaluating Multimodal Models on the MAIA Task
Multimodal models can be evaluated on the MAIA task.
5) Processing Results for the MAIA Task
The results produced by models on the MAIA task must be processed.
python process_maia_results.py \
--input_dir maia_results/ \
--output_dir maia_results_processed/
6) Combining Evalita and MAIA Results
The results from both the Evalita tasks and the MAIA task must be combined. For each model, this is done by populating the template
file (task_template.json) with the results generated by the model on the Evalita tasks as well as on the MAIA task.
Use the following command:
python combine_maia_evalita_data.py \
--evalita_dir evalita_results/ \
--maia_dir maia_results_processed/ \
--output_dir evalita_llm_results/ \
--template_file task_template.json
7) Updating the Hugging Face Repository
The evalita_llm_results repository on Hugging Face must be updated with the newly generated files from Step 6.
8) Running the Leaderboard Application
Finally, run the leaderboard application by executing:
python app.py