Tagalog BERT with Dependency-Aware Contrastive Learning

This is a BERT model for Tagalog with token embeddings fine-tuned using contrastive learning on dependency parse tree structures.

Model Description

  • Base Model: paulbontempo/bert-tagalog-mlm-stage1 (we fine-tuned the stage_1 model itself from base BERT on the FakeNewsFilipino dataset)
  • Language: Tagalog (Filipino)
  • Training Approach: Two-stage fine-tuning for low-resource language processing
    1. Stage 1: Masked Language Modeling (MLM) on Tagalog corpus (FakeNewsFilipino)
    2. Stage 2: Contrastive learning with InfoNCE loss on dependency parse triples corpus (UD-Ugnayan)

Our Contributions

We use a novel approach to encode syntactic structure directly into token embeddings:

  • Dependency triples (head, relation, dependent) were extracted from 94 UD-annotated Tagalog sentences
  • Contrastive learning with InfoNCE loss trained tokens to cluster by their syntactic roles
  • Tokens appearing as heads of the same dependency relation become similar in embedding space
  • This improves downstream NLP task performance for low-resource Tagalog

Architecture

Standard BERT architecture with fine-tuned token embeddings:

  • Hidden size: 768
  • Attention heads: 12
  • Layers: 12
  • Vocabulary size: ~50,000 tokens (WordPiece)

The contrastive learning stage used:

  • Loss: InfoNCE (temperature=0.07)
  • Projection dimension: 256
  • Training epochs: 50
  • Final loss: 0.076

Usage

This is a standard HuggingFace BERT model and can be used like any other BERT:

from transformers import AutoModel, AutoTokenizer

# Load model and tokenizer
model = AutoModel.from_pretrained("paulbontempo/bert-tagalog-dependency-cl")
tokenizer = AutoTokenizer.from_pretrained("paulbontempo/bert-tagalog-dependency-cl")

# Use for embeddings
text = "Magandang umaga sa lahat"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Get token embeddings
token_embeddings = outputs.last_hidden_state

For Downstream Tasks

Fine-tune on your Tagalog NLP task:

from transformers import AutoModelForSequenceClassification

# For classification tasks
model = AutoModelForSequenceClassification.from_pretrained(
    "paulbontempo/bert-tagalog-dependency-cl",
    num_labels=3
)

# Train on your task
# ...

Training Details

Stage 2: Contrastive Learning

  • Dataset: 94 Tagalog sentences with dependency annotations
  • Positive samples: ~600 true dependency triples
  • Negative samples: ~10,000 artificially generated incorrect triples
  • Batch strategy: Relation-aware batching for efficient positive pair sampling
  • Optimizer: AdamW (lr=3e-5, weight_decay=0.01)
  • Warmup steps: 300
  • Training time: ~30 minutes on H100 GPU

Contrastive Learning Strategy

  • Positive pairs: Triples with the same dependency relation from true parses
  • Negative pairs: Artificially created grammatically incorrect triples OR triples with different relations
  • Goal: Cluster tokens by syntactic role to improve representation quality

Evaluation

This model is designed as a pre-trained base for downstream Tagalog NLP tasks. The quality of embeddings can be evaluated through:

  • Dependency parsing accuracy
  • Named entity recognition
  • Sentiment analysis
  • Other token-level classification tasks

Limitations

  • Trained on only 94 sentences with dependency annotations (very small dataset)
  • May not generalize to all Tagalog language varieties
  • Best used as a starting point for further task-specific fine-tuning

Citation

If you use this model in your research, please cite:

@misc{bert-tagalog-dependency-cl,
  author = {Paul Bontempo},
  title = {Tagalog BERT with Dependency-Aware Contrastive Learning},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/paulbontempo/bert-tagalog-dependency-cl}}
}

Acknowledgments

  • Built on top of Stage 1 MLM training: paulbontempo/bert-tagalog-mlm-stage1
  • Developed at University of Colorado Boulder
  • Part of neural-symbolic (NeSy) research for low-resource language processing
Downloads last month
96
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for paulbontempo/bert-tagalog-dependency-cl

Finetuned
(1)
this model