metadata
datasets:
- newmindai/stsb-deepl-tr
base_model:
- BAAI/bge-m3
pipeline_tag: sentence-similarity
library_name: sentence-transformers
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- semantic-textual-similarity
- turkish
- multilingual
- single-task-training
license: apache-2.0
language:
- tr
metrics:
- pearson_cosine
- spearman_cosine
model-index:
- name: BGE-M3 Turkish STS-B (AnglELoss)
results:
- task:
type: semantic-similarity
name: Semantic Similarity
dataset:
name: stsb-eval
type: stsb-eval
metrics:
- type: pearson_cosine
value: 0.8575361568991451
name: Pearson Cosine
- type: spearman_cosine
value: 0.8629008775002103
name: Spearman Cosine
Turkish Semantic Similarity Model - BGE-M3 (STS-B Fine-tuned)
This is a Turkish semantic textual similarity model fine-tuned from BAAI/bge-m3 on the Turkish STS-B dataset using AnglELoss (Angle-optimized Embeddings). The model excels at measuring the semantic similarity between Turkish sentence pairs, achieving state-of-the-art performance on the Turkish STS-B benchmark.
Overview
- Base Model: BAAI/bge-m3 (1024-dimensional embeddings)
- Training Task: Semantic Textual Similarity (STS)
- Framework: Sentence Transformers (v5.1.1)
- Language: Turkish (multilingual base model)
- Dataset: Turkish STS-B (stsb-deepl-tr) - 5,749 training samples
- Loss Function: AnglELoss (Angle-optimized with pairwise angle similarity)
- Training Status: Completed (5 epochs)
- Best Checkpoint: Epoch 1.0 (Step 45) - Validation Loss: 5.682
- Final Spearman Correlation: 86.29%
- Final Pearson Correlation: 85.75%
- Context Length: 1024 tokens
- Training Time: ~8 minutes (single task)
Performance Metrics
Final Evaluation Results
Best Model: Epoch 1.0 (Step 45)
| Metric | Score |
|---|---|
| Spearman Correlation | 0.8629 (86.29%) |
| Pearson Correlation | 0.8575 (85.75%) |
| Validation Loss | 5.682 |
Best checkpoint saved at step 45 (epoch 1.0) based on validation loss
Training Progression
| Step | Epoch | Training Loss | Validation Loss | Spearman | Pearson |
|---|---|---|---|---|---|
| 10 | 0.22 | 7.2492 | - | - | - |
| 15 | 0.33 | - | 6.8784 | 0.8359 | 0.8322 |
| 30 | 0.67 | 6.0701 | 5.8729 | 0.8340 | 0.8355 |
| 45 | 1.0 | - | 5.682 | 0.8535 | 0.8430 |
| 60 | 1.33 | 5.5751 | 5.7641 | 0.8572 | 0.8524 |
| 105 | 2.33 | 5.3594 | 6.0607 | 0.8629 | 0.8551 |
| 150 | 3.33 | 5.1111 | 6.1735 | 0.8634 | 0.8586 |
| 165 | 3.67 | - | 6.2597 | 0.8636 | 0.8571 |
| 225 | 5.0 | - | 6.5089 | 0.8629 | 0.8575 |
Bold row indicates the best checkpoint selected by early stopping
Training Infrastructure
Hardware Configuration
- Nodes: 1
- Node Name: as07r1b16
- GPUs per Node: 4
- Total GPUs: 4
- CPUs: Not specified
- Node Hours: ~0.13 hours (8 minutes)
- GPU Type: NVIDIA (MareNostrum 5 ACC Partition)
- Training Type: Multi-GPU with DataParallel (DP)
Training Statistics
- Total Training Steps: 225
- Training Samples: 5,749 (Turkish STS-B pairs)
- Evaluation Samples: 1,379 (Turkish STS-B pairs)
- Final Average Loss: 5.463
- Training Time: ~6.5 minutes (390 seconds)
- Samples/Second: 73.68
- Steps/Second: 0.577
Training Configuration
Batch Configuration
- Per-Device Batch Size: 8 (per GPU)
- Number of GPUs: 4
- Physical Batch Size: 32 (4 GPUs × 8 per-device)
- Gradient Accumulation Steps: 4
- Effective Batch Size: 128 (32 physical × 4 accumulation)
- Samples per Step: 32
Loss Function
- Type: AnglELoss (Angle-optimized Embeddings)
- Implementation: Cosine Similarity Loss with angle optimization
- Scale: 20.0 (temperature parameter)
- Similarity Function: pairwise_angle_sim
- Task: Regression (predicting similarity scores 0.0-1.0)
AnglELoss Advantages:
- Angle Optimization: Optimizes the angle between embeddings rather than raw cosine similarity
- Better Geometric Properties: Encourages uniform distribution on the unit hypersphere
- Improved Discrimination: Better separation between similar and dissimilar pairs
- Temperature Scaling: Scale parameter (20.0) controls the sharpness of similarity distribution
Optimization
- Optimizer: AdamW (fused)
- Base Learning Rate: 5e-05
- Learning Rate Scheduler: Linear with warmup
- Warmup Steps: 89
- Weight Decay: 0.01
- Max Gradient Norm: 1.0
- Mixed Precision: Disabled
Checkpointing & Evaluation
- Save Strategy: Every 45 steps
- Evaluation Strategy: Every 15 steps
- Logging Steps: 10
- Save Total Limit: 3 checkpoints
- Best Model Selection: Based on validation loss (lower is better)
- Load Best Model at End: True
Job Details
| JobID | JobName | Account | Partition | State | Start | End | Node | GPUs | Duration |
|---|---|---|---|---|---|---|---|---|---|
| 31478447 | bgem3-base-stsb | ehpc317 | acc | COMPLETED | Nov 3 13:59:58 | Nov 3 14:07:37 | as07r1b16 | 4 | 0.13h |
Model Architecture
SentenceTransformer(
(0): Transformer({
'max_seq_length': 1024,
'do_lower_case': False,
'architecture': 'XLMRobertaModel'
})
(1): Pooling({
'word_embedding_dimension': 1024,
'pooling_mode_mean_tokens': True,
'pooling_mode_cls_token': False,
'pooling_mode_max_tokens': False,
'pooling_mode_mean_sqrt_len_tokens': False,
'pooling_mode_weightedmean_tokens': False,
'pooling_mode_lasttoken': False,
'include_prompt': True
})
(2): Normalize()
)
Training Dataset
stsb-deepl-tr
- Dataset: stsb-deepl-tr
- Training Size: 5,749 sentence pairs
- Evaluation Size: 1,379 sentence pairs
- Task: Semantic Textual Similarity (regression)
- Score Range: 0.0 (completely dissimilar) to 5.0 (semantically equivalent)
- Normalized Range: 0.0 to 1.0 (divided by 5.0 during preprocessing)
- Average Sentence Length: ~10-15 tokens per sentence
Data Format
Each training example consists of:
- Sentence 1: Turkish sentence (6-30 tokens)
- Sentence 2: Turkish sentence (6-26 tokens)
- Similarity Score: Float value 0.0-1.0 (normalized from 0-5 scale)
Sample Data
| Sentence 1 | Sentence 2 | Score |
|---|---|---|
| Bir uçak kalkıyor. | Bir uçak havalanıyor. | 0.2 |
| Bir adam büyük bir flüt çalıyor. | Bir adam flüt çalıyor. | 0.152 |
| Bir adam pizzanın üzerine rendelenmiş peynir serpiyor. | Bir adam pişmemiş bir pizzanın üzerine rendelenmiş peynir serpiyor. | 0.152 |
Capabilities
This model is specifically optimized for:
- Semantic Similarity Scoring: Predicting similarity scores between Turkish sentence pairs
- Paraphrase Detection: Identifying paraphrases and semantically equivalent sentences
- Duplicate Detection: Finding duplicate or near-duplicate Turkish content
- Question-Answer Matching: Matching questions with semantically similar answers
- Document Similarity: Comparing semantic similarity of Turkish documents
- Sentence Clustering: Grouping semantically similar Turkish sentences
- Textual Entailment: Understanding semantic relationships between sentences
Usage
Installation
pip install -U sentence-transformers
Semantic Similarity Scoring
from sentence_transformers import SentenceTransformer, util
# Load the model
model = SentenceTransformer("newmindai/bge-m3-stsb-turkish", trust_remote_code=True)
# Turkish sentence pairs
sentence_pairs = [
["Bir uçak kalkıyor.", "Bir uçak havalanıyor."],
["Bir adam flüt çalıyor.", "Bir kadın zencefil dilimliyor."],
["Bir çocuk sahilde oynuyor.", "Küçük bir çocuk kumda oynuyor."]
]
# Compute similarity scores
for sent1, sent2 in sentence_pairs:
emb1 = model.encode(sent1, convert_to_tensor=True)
emb2 = model.encode(sent2, convert_to_tensor=True)
similarity = util.pytorch_cos_sim(emb1, emb2).item()
print(f"Similarity: {similarity:.4f}")
print(f" - '{sent1}'")
print(f" - '{sent2}'")
print()
Batch Encoding
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("newmindai/bge-m3-stsb-turkish", trust_remote_code=True)
# Turkish sentences
sentences = [
"Bir adam çiftliğinde çalışıyor.",
"Yaşlı bir adam çiftliğinde çalışırken bir inek onu tekmeler.",
"Bir kedi yavrusu yürüyor.",
"İki Hintli kadın sahilde duruyor."
]
# Encode sentences
embeddings = model.encode(sentences)
print(f"Embeddings shape: {embeddings.shape}")
# Output: (4, 1024)
# Compute similarity matrix
similarities = model.similarity(embeddings, embeddings)
print("Similarity matrix:")
print(similarities)
Finding Most Similar Sentences
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("newmindai/bge-m3-stsb-turkish", trust_remote_code=True)
# Query and corpus
query = "Bir adam çiftlikte çalışıyor."
corpus = [
"Yaşlı bir adam çiftliğinde çalışırken bir inek onu tekmeler.",
"Bir kedi yavrusu yürüyor.",
"Bir kadın kumu kazıyor.",
"Kayalık bir deniz kıyısında bir adam ve köpek.",
"İki Hintli kadın sahilde iki Hintli kızla birlikte duruyor."
]
# Encode
query_emb = model.encode(query, convert_to_tensor=True)
corpus_emb = model.encode(corpus, convert_to_tensor=True)
# Find most similar
hits = util.semantic_search(query_emb, corpus_emb, top_k=3)[0]
print(f"Query: {query}\n")
print("Top 3 most similar sentences:")
for hit in hits:
print(f"{hit['score']:.4f}: {corpus[hit['corpus_id']]}")
Training Details
Complete Hyperparameters
| Parameter | Value |
|---|---|
| Per-device train batch size | 8 |
| Number of GPUs | 4 |
| Physical batch size | 32 |
| Gradient accumulation steps | 4 |
| Effective batch size | 128 |
| Learning rate | 5e-05 |
| Weight decay | 0.01 |
| Warmup steps | 89 |
| LR scheduler | linear |
| Max gradient norm | 1.0 |
| Num train epochs | 5 |
| Save steps | 45 |
| Eval steps | 15 |
| Logging steps | 10 |
| AnglELoss scale | 20.0 |
| Batch sampler | batch_sampler |
| Load best model at end | True |
| Optimizer | adamw_torch_fused |
Framework Versions
- Python: 3.10.12
- Sentence Transformers: 5.1.1
- PyTorch: 2.8.0+cu128
- Transformers: 4.57.0
- CUDA: 12.8
- Accelerate: 1.10.1
- Datasets: 4.2.0
- Tokenizers: 0.22.1
Use Cases
- Chatbot Response Matching: Find the most semantically similar pre-defined response for user queries
- FAQ Search: Match user questions to the most relevant FAQ entries
- Content Recommendation: Recommend articles or documents with similar semantic content
- Plagiarism Detection: Identify semantically similar text for academic integrity checks
- Customer Support: Match support tickets to similar previously resolved issues
- Document Clustering: Group documents by semantic similarity for organization
- Paraphrase Mining: Automatically detect paraphrases in large Turkish text corpora
- Semantic Search: Build semantic search engines for Turkish content
- Question Answering: Match questions to semantically relevant answer candidates
- Text Summarization: Identify redundant sentences for summary generation
Citation
AnglELoss
@inproceedings{li-li-2024-aoe,
title = "{A}o{E}: Angle-optimized Embeddings for Semantic Textual Similarity",
author = "Li, Xianming and Li, Jing",
year = "2024",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-long.101/",
doi = "10.18653/v1/2024.acl-long.101"
}
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
Base Model (BGE-M3)
@misc{bge-m3,
title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
year={2024},
eprint={2402.03216},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Dataset
@misc{stsb-deepl-tr,
title={Turkish STS-B Dataset (DeepL Translation)},
author={NewMind AI},
year={2024},
url={https://huggingface.co/datasets/newmindai/stsb-deepl-tr}
}
License
This model is licensed under the Apache 2.0 License.
Acknowledgments
- Base Model: BAAI/bge-m3
- Training Infrastructure: MareNostrum 5 Supercomputer (Barcelona Supercomputing Center)
- Framework: Sentence Transformers by UKP Lab
- Dataset: newmindai/stsb-deepl-tr
- Loss Function: AnglELoss (Angle-optimized Embeddings)
- Training Approach: Single-task fine-tuning on Turkish STS-B