Pre-BEREL: tbd

State-of-the-art language model for Rabbinic Hebrew, released [here] - add link.

This model is the first ever Hebrew model fully pretrained on pre-segmented Hebrew texts. When inputting text to the model, the text is expected to be pre-segmented using pre-BEREL. Segmenting the text prior to training is the first step towards integrating morphological-aware-tokenization into language models.

Sample usage:

from transformers import AutoModel, AutoTokenizer, AutoModelForMaskedLM

sentence = 'ื•ื–ื” ืœืฉื•ืŸ ื”ืจืžื‘ืดืŸ ื‘ืคื™ืจื•ืฉื• ืขืœ ื”ืชื•ืจื”, ืฉื”ื“ื‘ืจ ื™ื“ื•ืข ื•ืžืคื•ืจืกื ืœื›ืœ ื‘ืขืœื™ ื”ืขื™ื•ืŸ ืฉืื™ืŸ ื”ืžืงืจื ื™ื•ืฆื ืžื™ื“ื™ ืคืฉื•ื˜ื• ืืฃ ืขืœ ืคื™ ืฉื”ื“ืจืฉ ืืžืช.'

# First, load in the segmentation model, to preprocess the text
seg_tokenizer = AutoTokenizer.from_pretrained('dicta-il/BEREL-seg')
seg_model = AutoModel.from_pretrained('dicta-il/BEREL-seg', trust_remote_code=True).eval()

segmented_output = seg_model.predict([sentence], seg_tokenizer)[0] # sentence sent as a batch, pick the first one

# we mark the segmented tokens with a special separator, to distinguish them from regular work tokens. 
segmented_sentence = ' '.join('ืฃืฃืฃ '.join(segmented_word) for segmented_word in segmented_output[1:-1]) # ignore cls/sep
print(segmented_sentence.replace('ืฃืฃืฃ', '___'))
# ื•___ ื–ื” ืœืฉื•ืŸ ื”___ ืจืžื‘ ืด ืŸ ื‘___ ืคื™ืจื•ืฉื• ืขืœ ื”___ ืชื•ืจื” , ืฉื”ื“___ ื‘ืจ ื™ื“ื•ืข ื•___ ืžืคื•ืจืกื ืœ___ ื›ืœ ื‘ืขืœื™ ื”___ ืขื™ื•ืŸ ืฉ___ ืื™ืŸ ื”___ ืžืงืจ
ื ื™ื•ืฆื ืžื™ื“ื™ ืคืฉื•ื˜ื• ืืฃ ืขืœ ืคื™ ืฉื”ื“___ ืจืฉ ืืžืช .

# we can mask out any word we want - in this case, the easiest is to just do a string replace. We could've masked in the original sentence, or anywhere in the pipeline.
segmented_sentence = segmented_sentence.replace("ืขื™ื•ืŸ", "[MASK]")

# Load in the new model
tokenizer = AutoTokenizer.from_pretrained('dicta-il/pre-BEREL')
model = AutoModelForMaskedLM.from_pretrained('dicta-il/pre-BEREL').eval()

output = model(tokenizer.encode(segmented_sentence, return_tensors='pt'))
# the [MASK] is the 24th token (including [CLS])
import torch
top_5 = torch.topk(output.logits[0, 23, :], 5)[1]
print('\n'.join(tokenizer.convert_ids_to_tokens(top_5))) # should print ืงื‘ืœื” / ืคืฉื˜ / ื“ืช / ื—ื›ืžื” / ื’ืžืจื

Citation

If you use pre-BEREL in your research, please cite tbd

BibTeX:

tbd

License

Shield: CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

Downloads last month
4
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support