Tagalog BERT with Dependency-Aware Contrastive Learning
This is a BERT model for Tagalog with token embeddings fine-tuned using contrastive learning on dependency parse tree structures.
Model Description
- Base Model: paulbontempo/bert-tagalog-mlm-stage1 (we fine-tuned the stage_1 model itself from base BERT on the FakeNewsFilipino dataset)
- Language: Tagalog (Filipino)
- Training Approach: Two-stage fine-tuning for low-resource language processing
- Stage 1: Masked Language Modeling (MLM) on Tagalog corpus (FakeNewsFilipino)
- Stage 2: Contrastive learning with InfoNCE loss on dependency parse triples corpus (UD-Ugnayan)
Our Contributions
We use a novel approach to encode syntactic structure directly into token embeddings:
- Dependency triples (head, relation, dependent) were extracted from 94 UD-annotated Tagalog sentences
- Contrastive learning with InfoNCE loss trained tokens to cluster by their syntactic roles
- Tokens appearing as heads of the same dependency relation become similar in embedding space
- This improves downstream NLP task performance for low-resource Tagalog
Architecture
Standard BERT architecture with fine-tuned token embeddings:
- Hidden size: 768
- Attention heads: 12
- Layers: 12
- Vocabulary size: ~50,000 tokens (WordPiece)
The contrastive learning stage used:
- Loss: InfoNCE (temperature=0.07)
- Projection dimension: 256
- Training epochs: 50
- Final loss: 0.076
Usage
This is a standard HuggingFace BERT model and can be used like any other BERT:
from transformers import AutoModel, AutoTokenizer
# Load model and tokenizer
model = AutoModel.from_pretrained("paulbontempo/bert-tagalog-dependency-cl")
tokenizer = AutoTokenizer.from_pretrained("paulbontempo/bert-tagalog-dependency-cl")
# Use for embeddings
text = "Magandang umaga sa lahat"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# Get token embeddings
token_embeddings = outputs.last_hidden_state
For Downstream Tasks
Fine-tune on your Tagalog NLP task:
from transformers import AutoModelForSequenceClassification
# For classification tasks
model = AutoModelForSequenceClassification.from_pretrained(
"paulbontempo/bert-tagalog-dependency-cl",
num_labels=3
)
# Train on your task
# ...
Training Details
Stage 2: Contrastive Learning
- Dataset: 94 Tagalog sentences with dependency annotations
- Positive samples: ~600 true dependency triples
- Negative samples: ~10,000 artificially generated incorrect triples
- Batch strategy: Relation-aware batching for efficient positive pair sampling
- Optimizer: AdamW (lr=3e-5, weight_decay=0.01)
- Warmup steps: 300
- Training time: ~30 minutes on H100 GPU
Contrastive Learning Strategy
- Positive pairs: Triples with the same dependency relation from true parses
- Negative pairs: Artificially created grammatically incorrect triples OR triples with different relations
- Goal: Cluster tokens by syntactic role to improve representation quality
Evaluation
This model is designed as a pre-trained base for downstream Tagalog NLP tasks. The quality of embeddings can be evaluated through:
- Dependency parsing accuracy
- Named entity recognition
- Sentiment analysis
- Other token-level classification tasks
Limitations
- Trained on only 94 sentences with dependency annotations (very small dataset)
- May not generalize to all Tagalog language varieties
- Best used as a starting point for further task-specific fine-tuning
Citation
If you use this model in your research, please cite:
@misc{bert-tagalog-dependency-cl,
author = {Paul Bontempo},
title = {Tagalog BERT with Dependency-Aware Contrastive Learning},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/paulbontempo/bert-tagalog-dependency-cl}}
}
Acknowledgments
- Built on top of Stage 1 MLM training: paulbontempo/bert-tagalog-mlm-stage1
- Developed at University of Colorado Boulder
- Part of neural-symbolic (NeSy) research for low-resource language processing
- Downloads last month
- 96
Model tree for paulbontempo/bert-tagalog-dependency-cl
Base model
paulbontempo/bert-tagalog-mlm-stage1