Question about multilinguistics.
First let me just state that in the first moment I saw this model, I thought: "So what type of character it has?". Apologies ;)
Your point about Gemma 27B-it doing much worse than Gemma 27B-pt in benchmarks, makes me question benchmarks or the way that post training instructions (which are I believe Human Supervised Training) are made. Benchmarks are always imperfect, but the only AI I reliably work with in 2 languages at the same time is Gemma 27B-it. And this is in what might probably be the toughest workload - I ask Gemma a question and then I forget one technical word in English or the Polish phrase that means something else than constituent words would make people believe (sayings that often make people learning Polish scratch their heads) - and Gemma 27B-it is the only model that consistently replies - basically on the level of pure English chat, same rate of hallucinations. I don't know much about supervised training, I was wondering if it is done just in English for example. But seeing the benchmark scores for IT version so low makes me ask:
"How does your training (supervised human training used in it models, I think resembles fine-tuning - correct me if I'm wrong) differ from Google's training? Or are benchmarks just THAT bad, and it's only on benchmarks?".
The hack for using 2 languages in the same chat if you want to be certain that it's okay is ask Gemma if you can say it in another language and then use other language in new prompt. But still it does well with Polish and English even in one sentence.
Gemma is indeed unbeatable in multilingual tasks—Google has access to the most multilingual data, and it shows.
As for benchmarks, I’d take them all with a grain of salt. For example, Llama 3.3 70B initially had an issue with the math benchmarks and received an abysmally low score (though it was updated later).
At the end of the day, benchmarks are meant to provide an indication, not to serve as ground truth. 😊