| Model Evaluation and Leaderboard | |
| 1) Model Evaluation | |
| Before integrating a model into the leaderboard, it must first be evaluated using the lm-eval-harness library in both zero-shot and 5-shot configurations. | |
| This can be done with the following command: | |
| lm_eval --model hf --model_args pretrained=google/gemma-3-12b-it \ | |
| --tasks evalita-mp --device cuda:0 --batch_size 1 --trust_remote_code \ | |
| --output_path model_output --num_fewshot 5 -- | |
| The output generated by the library will include the model's accuracy scores on the benchmark tasks. | |
| This output is written to the standard output and should be saved in a txt file (e.g., slurm-8368.out), which needs to be placed in the | |
| evalita_llm_models_output directory for further processing. | |
| 2) Extracting Model Metadata | |
| To display model details on the leaderboard (e.g., organization/group, model name, and parameter count), metadata must be retrieved from Hugging Face. | |
| You can retrieve this metadata by running: | |
| python retrieve_model_metadata.py \ | |
| --input_dir evalita_llm_models_output/ \ | |
| --output_dir evalita_llm_requests/ | |
| This script processes the evaluation files from Step 1 and saves each model's metadata in a JSON file within the evalita_llm_requests directory. | |
| 3) Generating Leaderboard Submission File | |
| The leaderboard requires a structured file containing each model’s metadata along with its benchmark accuracy scores. | |
| To generate this file, run: | |
| python process_evalita_results.py \ | |
| --input_dir evalita_llm_models_output/ \ | |
| --metadata_dir evalita_llm_requests/ \ | |
| --output_dir evalita_results/ | |
| This script combines the accuracy results from Step 1 with the metadata from Step 2 and outputs a JSON file in the evalita_results directory. | |
| 4) Evaluating Multimodal Models on the MAIA Task | |
| Multimodal models can be evaluated on the MAIA task. | |
| 5) Processing Results for the MAIA Task | |
| The results produced by models on the MAIA task must be processed. | |
| python process_maia_results.py \ | |
| --input_dir maia_results/ \ | |
| --output_dir maia_results_processed/ | |
| 6) Combining Evalita and MAIA Results | |
| The results from both the Evalita tasks and the MAIA task must be combined. For each model, this is done by populating the template | |
| file (task_template.json) with the results generated by the model on the Evalita tasks as well as on the MAIA task. | |
| Use the following command: | |
| python combine_maia_evalita_data.py \ | |
| --evalita_dir evalita_results/ \ | |
| --maia_dir maia_results_processed/ \ | |
| --output_dir evalita_llm_results/ \ | |
| --template_file task_template.json | |
| 7) Updating the Hugging Face Repository | |
| The evalita_llm_results repository on Hugging Face must be updated with the newly generated files from Step 6. | |
| 8) Running the Leaderboard Application | |
| Finally, run the leaderboard application by executing: | |
| python app.py | |