Upload toxicthesis-mistral-rntn-classification-5 model

eeb8d3b verified 4 days ago

4.33 kB

metadata

license: mit
tags:
  - toxicity-detection
  - rntn
  - mistral
pipeline_tag: text-classification
language:
  - en
library_name: pytorch

RNTN - MISTRAL - Classification (5 classes)

Toxicity prediction model trained on the MISTRAL dataset.

Property	Value
Model	RNTN
Task	Classification (5 classes)
Dataset	mistral
Framework	PyTorch / PyTorch Lightning

Class: `RNTNLightning`

from src.models.rntn import RNTNLightning

model = RNTNLightning(
    vocab: dict,                      # Token-to-index mapping
    hidden_dim: int = 300,            # Hidden dimension
    use_tensor: bool = True,          # Use tensor composition
    use_linear: bool = True,          # Use linear composition
    dropout: float = 0.2,
    num_classes: int = 1,             # 1=regression, 2+=classification
    loss_type: str = 'mse',           # 'mse', 'bce', 'cross_entropy'
    lr: float = 5e-4,
    gradient_clip_norm: float = 1.0,
    use_residual: bool = True,
    residual_weight: float = 0.2,
    activation: str = 'tanh'          # 'tanh', 'relu', 'gelu'
)

Methods

Method	Description
`forward(batch)`	Process batch of constituency trees. Returns dict with 'predictions'.
`predict_score(text)`	Predict toxicity score for raw text string. Handles parsing internally.
`predict_batch(texts)`	Predict scores for a list of texts.
`load_from_checkpoint(path)`	Load model from checkpoint file.

Required Files

vocab_stanza_hybrid.pkl: Vocabulary mapping (token -> index)
label_mappings.pkl: Constituency label mappings
cc.en.300.bin: FastText embeddings (300-dim)

Usage with ToxicThesis (Recommended)

# 1. Clone ToxicThesis repository
# git clone https://github.com/simo-corbo/ToxicThesis
# cd ToxicThesis && pip install -r requirements.txt

from huggingface_hub import snapshot_download
import torch
import pickle

# 2. Download model files
model_dir = snapshot_download(
    repo_id="simocorbo/toxicthesis-mistral-rntn-classification-5",
    allow_patterns=["checkpoints/*", "*.pkl"]
)

# 3. Load vocabulary
with open(f"{model_dir}/vocab_stanza_hybrid.pkl", 'rb') as f:
    vocab = pickle.load(f)

# 4. Import and load model from ToxicThesis
from src.models.rntn import RNTNLightning

model = RNTNLightning.load_from_checkpoint(
    f"{model_dir}/checkpoints/best.pt",
    vocab=vocab,
    offline_init=False  # Set True to skip loading FastText/Stanza at init
)
model.eval()

# 5. Predict score for a single text (handles parsing internally)
with torch.no_grad():
    score = model.predict_score("Your text here")
    print(f"Toxicity score: {score}")

# 6. Predict for multiple texts
texts = ["Hello friend", "You are terrible", "Have a nice day"]
with torch.no_grad():
    for text in texts:
        score = model.predict_score(text)
        print(f"{text}: {score:.4f}")

Note on Standalone Usage

RNTN requires constituency parsing via Stanza and tree processing. For standalone usage without ToxicThesis, you would need to implement the full tree preprocessing pipeline. We recommend using ToxicThesis directly. See src/models/rntn.py for the complete implementation.

Score Interpretation

Output	Range	Meaning
`probabilities`	List[float]	Probability distribution over 5 classes.
`class`	0 to 4	Predicted class (argmax of probabilities).

Classes: 5 toxicity levels, where higher class index = more toxic.

Files

File	Description
`checkpoints/best.pt`	Model checkpoint (best validation loss)
`hparams.yaml`	Hyperparameters used for training
`train.csv`	Training metrics per epoch
`val.csv`	Validation metrics per epoch
`vocab_stanza_hybrid.pkl`	Vocabulary (for tree-based models)

Installation

# Clone ToxicThesis for full model implementations
git clone https://github.com/simo-corbo/ToxicThesis
cd ToxicThesis
pip install -r requirements.txt

# Or install dependencies directly
pip install torch transformers huggingface_hub fasttext-wheel stanza

Citation

@software{toxicthesis2025,
  title={ToxicThesis},
  author={Corbo, Simone},
  year={2025},
  url={https://github.com/simo-corbo/ToxicThesis}
}