lang-uk
/

dragoman

 - text: "[INST] who holds this neighborhood? [/INST]"
 ---
+# Dragoman: English-Ukrainian Machine Translation Model
+## Model Description
+The Dragoman is a sentence-level SOTA English-Ukrainian translation model. It's trained using a two-phase pipeline: pretraining on cleaned [Paracrawl](https://huggingface.co/datasets/Helsinki-NLP/opus_paracrawl) dataset and unsupervised data selection phase on [turuta/Multi30k-uk](https://huggingface.co/datasets/turuta/Multi30k-uk).
+By using a two-phase data cleaning and data selection approach we have achieved SOTA performance on FLORES-101 English-Ukrainian devtest subset with **BLEU** `32.34`.
+## Model Details
+- **Developed by:** Yurii Paniv, Dmytro Chaplynskyi, Nikita Trynus, Volodymyr Kyrylov
+- **Model type:** Translation model
+- **Language(s):**
+  - Source Language: English
+  - Target Language: Ukrainian
+- **License:** Apache 2.0
+## Model Use Cases
+We designed this model for sentence-level English -> Ukrainian translation.
+Performance on multi-sentence texts is not guaranteed, please be aware.
+#### Running the model
+```python
+# pip install bitsandbytes transformers peft torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+config = PeftConfig.from_pretrained("lang-uk/dragoman")
+quant_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_compute_dtype=float16,
+    bnb_4bit_use_double_quant=False,
+)
+model = MistralForCausalLM.from_pretrained(
+    "mistralai/Mistral-7B-v0.1", quantization_config=quant_config
+)
+model = PeftModel.from_pretrained(model, "lang-uk/dragoman").to("cuda")
+tokenizer = AutoTokenizer.from_pretrained(
+    "mistralai/Mistral-7B-v0.1", use_fast=False, add_bos_token=False
+)
+input_text = "[INST] who holds this neighborhood? [/INST]" # model input should adhere to this format
+input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
+outputs = model.generate(**input_ids)
+print(tokenizer.decode(outputs[0]))
+```
+### Training Dataset and Resources
+Training code: [lang-uk/dragoman](https://github.com/lang-uk/dragoman)
+Cleaned Paracrawl: [lang-uk/paracrawl_3m](https://huggingface.co/datasets/lang-uk/paracrawl_3m)
+Cleaned Multi30K: [lang-uk/multi30k-extended-17k](https://huggingface.co/datasets/lang-uk/multi30k-extended-17k)
+### Benchmark Results against other models on FLORES-101 devset
+| **Model**                                   | **BLEU** $\uparrow$ | **spBLEU** | **chrF** | **chrF++** |
+|---------------------------------------------|---------------------|-------------|----------|------------|
+| **Finetuned**                               |                     |             |          |            |
+| Dragoman P, 10 beams                        | 30.38               | 37.93       | 59.49    | 56.41      |
+| Dragoman PT, 10 beams                       | **32.34**           | **39.93**   | **60.72**| **57.82**  |
+|---------------------------------------------|---------------------|-------------|----------|------------|
+| **Zero shot and few shot**                  |                     |             |          |            |
+| LLaMa-2-7B 2-shot                           | 20.1                | 26.78       | 49.22    | 46.29      |
+| RWKV-5-World-7B 0-shot                      | 21.06               | 26.20       | 49.46    | 46.46      |
+| gpt-4 10-shot                               | 29.48               | 37.94       | 58.37    | 55.38      |
+| gpt-4-turbo-preview 0-shot                  | 30.36               | 36.75       | 59.18    | 56.19      |
+| Google Translate 0-shot                     | 25.85               | 32.49       | 55.88    | 52.48      |
+|---------------------------------------------|---------------------|-------------|----------|------------|
+| **Pretrained**                              |                     |             |          |            |
+| NLLB 3B, 10 beams                           | 30.46               | 37.22       | 58.11    | 55.32      |
+| OPUS-MT, 10 beams                           | 32.2                | 39.76       | 60.23    | 57.38      |