UL2: Unifying Language Learning Paradigms
Paper
•
2205.05131
•
Published
•
5
The Russian Rotary Position Embedding T5 model of small version after instruct tuning
The model was trained in a Russian corpus with a mix of English using the Mixture-Of-Denoisers pre-training method by UL2 on 1024 length sequences. Training using Flash Attention 2 is available because of the replacement of bias with rotary encoding.
Finetuning for downstream tasks
Despite the instructional tuning, it is not recommended to use in zero-shot mode due to the small size
A corpus of Russian texts from Vikhr filtered by FRED-T5-1.7B perplexy. Instructions are translated English set
Using AdamWScale instead of Adafactor for stable learning without loss explosions