Tagalog BERT with Dependency-Aware Contrastive Learning

This is a BERT model for Tagalog with token embeddings fine-tuned using contrastive learning on dependency parse tree structures.

Model Description

Base Model: paulbontempo/bert-tagalog-mlm-stage1 (we fine-tuned the stage_1 model itself from base BERT on the FakeNewsFilipino dataset)
Language: Tagalog (Filipino)
Training Approach: Two-stage fine-tuning for low-resource language processing
1. Stage 1: Masked Language Modeling (MLM) on Tagalog corpus (FakeNewsFilipino)
2. Stage 2: Contrastive learning with InfoNCE loss on dependency parse triples corpus (UD-Ugnayan)

Our Contributions

We use a novel approach to encode syntactic structure directly into token embeddings:

Dependency triples (head, relation, dependent) were extracted from 94 UD-annotated Tagalog sentences
Contrastive learning with InfoNCE loss trained tokens to cluster by their syntactic roles
Tokens appearing as heads of the same dependency relation become similar in embedding space
This improves downstream NLP task performance for low-resource Tagalog

Architecture

Standard BERT architecture with fine-tuned token embeddings:

Hidden size: 768
Attention heads: 12
Layers: 12
Vocabulary size: ~50,000 tokens (WordPiece)

The contrastive learning stage used:

Loss: InfoNCE (temperature=0.07)
Projection dimension: 256
Training epochs: 50
Final loss: 0.076

Usage

This is a standard HuggingFace BERT model and can be used like any other BERT:

from transformers import AutoModel, AutoTokenizer

# Load model and tokenizer
model = AutoModel.from_pretrained("paulbontempo/bert-tagalog-dependency-cl")
tokenizer = AutoTokenizer.from_pretrained("paulbontempo/bert-tagalog-dependency-cl")

# Use for embeddings
text = "Magandang umaga sa lahat"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Get token embeddings
token_embeddings = outputs.last_hidden_state

For Downstream Tasks

Fine-tune on your Tagalog NLP task:

from transformers import AutoModelForSequenceClassification

# For classification tasks
model = AutoModelForSequenceClassification.from_pretrained(
    "paulbontempo/bert-tagalog-dependency-cl",
    num_labels=3
)

# Train on your task
# ...

Training Details

Stage 2: Contrastive Learning

Dataset: 94 Tagalog sentences with dependency annotations
Positive samples: ~600 true dependency triples
Negative samples: ~10,000 artificially generated incorrect triples
Batch strategy: Relation-aware batching for efficient positive pair sampling
Optimizer: AdamW (lr=3e-5, weight_decay=0.01)
Warmup steps: 300
Training time: ~30 minutes on H100 GPU

Contrastive Learning Strategy

Positive pairs: Triples with the same dependency relation from true parses
Negative pairs: Artificially created grammatically incorrect triples OR triples with different relations
Goal: Cluster tokens by syntactic role to improve representation quality

Evaluation

This model is designed as a pre-trained base for downstream Tagalog NLP tasks. The quality of embeddings can be evaluated through:

Dependency parsing accuracy
Named entity recognition
Sentiment analysis
Other token-level classification tasks

Limitations

Trained on only 94 sentences with dependency annotations (very small dataset)
May not generalize to all Tagalog language varieties
Best used as a starting point for further task-specific fine-tuning

Citation

If you use this model in your research, please cite:

@misc{bert-tagalog-dependency-cl,
  author = {Paul Bontempo},
  title = {Tagalog BERT with Dependency-Aware Contrastive Learning},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/paulbontempo/bert-tagalog-dependency-cl}}
}

Acknowledgments

Built on top of Stage 1 MLM training: paulbontempo/bert-tagalog-mlm-stage1
Developed at University of Colorado Boulder
Part of neural-symbolic (NeSy) research for low-resource language processing

Downloads last month: 96

Model tree for paulbontempo/bert-tagalog-dependency-cl

Base model

paulbontempo/bert-tagalog-mlm-stage1

Finetuned

(1)

this model