Is the 7B trained on 1.5 trillion tokens, but the 40B on 1 trillion only?

#10

by alpindale - opened May 27, 2023

Discussion

alpindale

May 27, 2023

Is this a typo or was there a reasoning behind this decision?

max-fry

May 27, 2023

Most likely it's because training a 40B model is significantly more expensive than training a 7B model.

hankcs

May 29, 2023

I'm interested in this question too. Looking forward to an official explanation from the authors.

BTW, according to the leaderboard, falcon-7b outperforms mpt-7b by 0.2, which could also be attributed to the fact that falcon-7b is only trained on 1T tokens of unrefined web data.

FalconLLM

Technology Innovation Institute org May 30, 2023

Hey!

This is a purely arbitrary decision :). We iterate a lot on internal models, and Falcon-40B was our first serious foray into this scale--so we wanted to validate infra, codebase, data, etc. That's why we stuck to 1T.

The 7B came later, when we had 384 GPUs unscheduled for two weeks, so 1.5T was a good match.

Regarding the different with MPT-7B being smaller, we believe this is due to a combination of three factors: (1) we are approaching the limits of what can be done with a 7B pretrained model; (2) multiquery with 64 attention head size improves inference scalability, but that's at the cost of some task performance; (3) we experimented for the 7B with a very large batch size.

FalconLLM changed discussion status to closed May 30, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Is the 7B trained on 1.5 trillion tokens, but the *40B* on 1 trillion only?

Is the 7B trained on 1.5 trillion tokens, but the 40B on 1 trillion only?