Update README
#3
by
dtamayo
- opened
README.md
CHANGED
|
@@ -81,9 +81,74 @@ Training Hyperparemeters
|
|
| 81 |
|
| 82 |
|
| 83 |
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
|
| 88 |
## Data
|
| 89 |
|
|
@@ -115,7 +180,7 @@ The following multilingual benchmarks have been considered:
|
|
| 115 |
| Benchmark | Description | Languages | Source |
|
| 116 |
|------------------|-------------|-----------|--------------|
|
| 117 |
| XTREME| Benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models | bg,ca,de,el,en,es,et,eu,fi,fr,hu,it,lt,nl,pl,pt,ro,ru,uk | [LINK](https://github.com/google-research/xtreme) |
|
| 118 |
-
| CLUB | Human-Annotated Catalan Benchmark | ca | [LINK](https://
|
| 119 |
| Basque Custom Benchmark | A set of NER, POS and TC evaluation tasks to assess the performace in Basque language. | eu | [LINK](https://huggingface.co/datasets/orai-nlp/basqueGLUE) |
|
| 120 |
| Galician Custom Benchmark | NER and POS evaluation tasks to assess the performace in Galician language. | gl | [LINK](https://github.com/xavier-gz/SLI_Galician_Corpora/blob/main/SLI_NERC_Galician_Gold_CoNLL.1.0.tar.gz) [LINK](https://huggingface.co/datasets/universal-dependencies/universal_dependencies)|
|
| 121 |
|
|
|
|
| 81 |
|
| 82 |
|
| 83 |
|
| 84 |
+
## How to use
|
| 85 |
+
|
| 86 |
+
You can use the pipeline for masked language modeling:
|
| 87 |
+
|
| 88 |
+
```python
|
| 89 |
+
>>> from transformers import pipeline
|
| 90 |
+
>>> from pprint import pprint
|
| 91 |
+
>>> unmasker = pipeline('fill-mask', model='BSC-LT/mRoBERTa')
|
| 92 |
+
|
| 93 |
+
>>> pprint(unmasker("I love the<mask>of Barcelona.",top_k=3))
|
| 94 |
+
[{'score': 0.44038116931915283,
|
| 95 |
+
'sequence': 'I love the city of Barcelona.',
|
| 96 |
+
'token': 31489,
|
| 97 |
+
'token_str': 'city'},
|
| 98 |
+
{'score': 0.10049665719270706,
|
| 99 |
+
'sequence': 'I love the City of Barcelona.',
|
| 100 |
+
'token': 13613,
|
| 101 |
+
'token_str': 'City'},
|
| 102 |
+
{'score': 0.09289316833019257,
|
| 103 |
+
'sequence': 'I love the streets of Barcelona.',
|
| 104 |
+
'token': 178738,
|
| 105 |
+
'token_str': 'streets'}]
|
| 106 |
+
>>> pprint(unmasker("Me encanta la<mask>de Barcelona.",top_k=3))
|
| 107 |
+
[{'score': 0.17127428948879242,
|
| 108 |
+
'sequence': 'Me encanta la historia de Barcelona.',
|
| 109 |
+
'token': 10559,
|
| 110 |
+
'token_str': 'historia'},
|
| 111 |
+
{'score': 0.14173351228237152,
|
| 112 |
+
'sequence': 'Me encanta la ciudad de Barcelona.',
|
| 113 |
+
'token': 19587,
|
| 114 |
+
'token_str': 'ciudad'},
|
| 115 |
+
{'score': 0.06284074485301971,
|
| 116 |
+
'sequence': 'Me encanta la vida de Barcelona.',
|
| 117 |
+
'token': 5019,
|
| 118 |
+
'token_str': 'vida'}]
|
| 119 |
+
>>> pprint(unmasker("M'encanta la<mask>de Barcelona.",top_k=3))
|
| 120 |
+
[{'score': 0.35796159505844116,
|
| 121 |
+
'sequence': "M'encanta la ciutat de Barcelona.",
|
| 122 |
+
'token': 17128,
|
| 123 |
+
'token_str': 'ciutat'},
|
| 124 |
+
{'score': 0.10453521460294724,
|
| 125 |
+
'sequence': "M'encanta la història de Barcelona.",
|
| 126 |
+
'token': 35763,
|
| 127 |
+
'token_str': 'història'},
|
| 128 |
+
{'score': 0.07609806954860687,
|
| 129 |
+
'sequence': "M'encanta la gent de Barcelona.",
|
| 130 |
+
'token': 15151,
|
| 131 |
+
'token_str': 'gent'}]
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
Alternatively, you can also extract the logits associated with the sequences and perform the calculations by hand:
|
| 135 |
+
|
| 136 |
+
```python
|
| 137 |
+
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
| 138 |
+
import torch
|
| 139 |
+
|
| 140 |
+
model = AutoModelForMaskedLM.from_pretrained("BSC-LT/mRoBERTa")
|
| 141 |
+
tokenizer = AutoTokenizer.from_pretrained("BSC-LT/mRoBERTa")
|
| 142 |
+
|
| 143 |
+
outputs = model(**tokenizer("The capital of Spain is<mask>", return_tensors="pt")).logits
|
| 144 |
+
|
| 145 |
+
# The index of "<mask>" token is -2 given that the -1 position is the EOS token "</s>".
|
| 146 |
+
predicted_token = tokenizer.decode(torch.argmax(outputs[0,-2,:]))
|
| 147 |
+
|
| 148 |
+
print(f"The decoded element is \"{predicted_token}\"." ) # This will give "Madrid"
|
| 149 |
+
```
|
| 150 |
+
|
| 151 |
+
In most of the evaluations presented below, the model is adjusted to each use case using specific logits to encode the text.
|
| 152 |
|
| 153 |
## Data
|
| 154 |
|
|
|
|
| 180 |
| Benchmark | Description | Languages | Source |
|
| 181 |
|------------------|-------------|-----------|--------------|
|
| 182 |
| XTREME| Benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models | bg,ca,de,el,en,es,et,eu,fi,fr,hu,it,lt,nl,pl,pt,ro,ru,uk | [LINK](https://github.com/google-research/xtreme) |
|
| 183 |
+
| CLUB | Human-Annotated Catalan Benchmark | ca | [LINK](https://github.com/projecte-aina/club) |
|
| 184 |
| Basque Custom Benchmark | A set of NER, POS and TC evaluation tasks to assess the performace in Basque language. | eu | [LINK](https://huggingface.co/datasets/orai-nlp/basqueGLUE) |
|
| 185 |
| Galician Custom Benchmark | NER and POS evaluation tasks to assess the performace in Galician language. | gl | [LINK](https://github.com/xavier-gz/SLI_Galician_Corpora/blob/main/SLI_NERC_Galician_Gold_CoNLL.1.0.tar.gz) [LINK](https://huggingface.co/datasets/universal-dependencies/universal_dependencies)|
|
| 186 |
|