Model Card for Llama-2-7b-MT-SFT

Model Details

Model Description

Llama-2-7b-MT-SFT is a language model that results from fine-tuning Llama-2 7b on the machine translation data of TowerBlocks dataset.

Model type: A 7B parameter translation model built on top of Llama 2.
Language(s) (NLP): English, Portuguese, Spanish, French, German, Dutch, Italian, Korean, Russian, Chinese
License: The LLAMA 2 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved.

Model Sources

Repository: TODO
Paper: TODO
Link: TODO

Intended uses & limitations

The model was initially fine-tuned on an English-centric parallel data from TowerBlocks.

The model can perform translations between supported languages, with limited translation capabilities between non-English languages.

The model is also used to synthesize preference data for subsequent x2x optimization.

Out-of-Scope Use

The model is not guaranteed to perform for languages other than the 10 languages it supports.

Although the model is trained on natural language instructions related to translation tasks, its ability to follow instructions and the accuracy of results for tasks beyond translation cannot be guaranteed.

Bias, Risks, and Limitations

Llama-2-7b-MT-SFT has not been aligned to human preferences, so the model may generate problematic outputs (e.g., hallucinations, harmful content, or false statements).

How to Get Started with the Model

Here's how you can run the model with Huggingface Transformers:

from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_PATH = "double7/Llama-2-7b-MT-SFT"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH, device_map="auto", torch_dtype="auto"
)

src_lang = "German"
trg_lang = "Chinese"
src_text = "Filmkarriere Collinges Filmdebüt in Die kleinen Füchse von 1941 brachte ihr eine Nominierung für den Academy Award als beste Nebendarstellerin ein."

prompt = f"Translate the following text from {src_lang} into {trg_lang}:\n{src_lang}: {src_text}\n{trg_lang}:"

# We use the tokenizer’s chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
    {"role": "user", "content": prompt},
]

input_text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, do_sample=False, max_new_tokens=256)
output_text = tokenizer.batch_decode(outputs, skip_special_tokens=False)[0]
print(output_text)
# <s><|im_start|> user
# Translate the following text from German into Chinese:
# German: Filmkarriere Collinges Filmdebüt in Die kleinen Füchse von 1941 brachte ihr eine Nominierung für den Academy Award als beste Nebendarstellerin ein.
# Chinese:<|im_end|> 
# <|im_start|> assistant
#  Collinge 的电影生涯，她的电影处女作是 1941 年的《小狐狸》，获得了她的奥斯卡最佳女配角提名。<|im_end|>

Translation Instructions

Following TowerInstruct, we use diverse translation instructions in training, you can use natural language to describe translation requests, such as:

prompt1 = f"Translate the following text from {src_lang} into {trg_lang}:\n{src_lang}: {src_text}\n{trg_lang}:"

prompt1 = f"Please provide a translation from {src_lang} to {trg_lang} for the following text:\n{src_text}\nTarget:",

prompt2 = f"Translate this {src_lang} text into {trg_lang}:\nSource: {src_text}\nTranslation:",

We use prompt1 for the evaluation.

Prompt Format

Llama-2-7b-MT-SFT was trained using the ChatML prompt templates without any system prompts. An example follows below:

<|im_start|>user
{USER PROMPT}<|im_end|>
<|im_start|>assistant
{MODEL RESPONSE}<|im_end|>
<|im_start|>user
[...]

Training Details

Training Data

We use the machine translation task data (about 150k) from TowerBlocks.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 7e-06
total_train_batch_size: 128
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1
max_seq_length: 2048

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

[More Information Needed]

Citation

TODO

Downloads last month: 11

Safetensors

Model size

7B params

Tensor type

BF16

Model tree for double7/Llama-2-7b-MT-SFT

Base model

meta-llama/Llama-2-7b-hf

Finetuned

(955)

this model

Dataset used to train double7/Llama-2-7b-MT-SFT

Collection including double7/Llama-2-7b-MT-SFT

EnAnchored-X2X

Collection

8 items • Updated 13 days ago

Paper for double7/Llama-2-7b-MT-SFT

Tower: An Open Multilingual Large Language Model for Translation-Related Tasks

Paper • 2402.17733 • Published Feb 27, 2024 • 7