Pre-BEREL: tbd

State-of-the-art language model for Rabbinic Hebrew, released [here] - add link.

This model is the first ever Hebrew model fully pretrained on pre-segmented Hebrew texts. When inputting text to the model, the text is expected to be pre-segmented using pre-BEREL. Segmenting the text prior to training is the first step towards integrating morphological-aware-tokenization into language models.

Sample usage:

from transformers import AutoModel, AutoTokenizer, AutoModelForMaskedLM

sentence = 'וזה לשון הרמב״ן בפירושו על התורה, שהדבר ידוע ומפורסם לכל בעלי העיון שאין המקרא יוצא מידי פשוטו אף על פי שהדרש אמת.'

# First, load in the segmentation model, to preprocess the text
seg_tokenizer = AutoTokenizer.from_pretrained('dicta-il/BEREL-seg')
seg_model = AutoModel.from_pretrained('dicta-il/BEREL-seg', trust_remote_code=True).eval()

segmented_output = seg_model.predict([sentence], seg_tokenizer)[0] # sentence sent as a batch, pick the first one

# we mark the segmented tokens with a special separator, to distinguish them from regular work tokens. 
segmented_sentence = ' '.join('ףףף '.join(segmented_word) for segmented_word in segmented_output[1:-1]) # ignore cls/sep
print(segmented_sentence.replace('ףףף', '___'))
# ו___ זה לשון ה___ רמב ״ ן ב___ פירושו על ה___ תורה , שהד___ בר ידוע ו___ מפורסם ל___ כל בעלי ה___ עיון ש___ אין ה___ מקר
א יוצא מידי פשוטו אף על פי שהד___ רש אמת .

# we can mask out any word we want - in this case, the easiest is to just do a string replace. We could've masked in the original sentence, or anywhere in the pipeline.
segmented_sentence = segmented_sentence.replace("עיון", "[MASK]")

# Load in the new model
tokenizer = AutoTokenizer.from_pretrained('dicta-il/pre-BEREL')
model = AutoModelForMaskedLM.from_pretrained('dicta-il/pre-BEREL').eval()

output = model(tokenizer.encode(segmented_sentence, return_tensors='pt'))
# the [MASK] is the 24th token (including [CLS])
import torch
top_5 = torch.topk(output.logits[0, 23, :], 5)[1]
print('\n'.join(tokenizer.convert_ids_to_tokens(top_5))) # should print קבלה / פשט / דת / חכמה / גמרא

Citation

If you use pre-BEREL in your research, please cite tbd

BibTeX:

tbd

License

Shield:

This work is licensed under a Creative Commons Attribution 4.0 International License.

Downloads last month: 4

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support