Reproduce the reported benchmark score using LM Harness

#7
by SimonX - opened

Is there anyone who can reproduce the reported benchmark score using LM Harness?
I am attempting to pull the model from HuggingFace and run the default settings of LM Harness (keeping the #shots consistent with the reported score). However, I am receiving accuracies that show a significant discrepancy compared to the reported ones.

I have similar issues, where the pass@1 on humaneavl is around 0.06. I tried different top_p, top_K, and temperature.

Sign up or log in to comment