| | --- |
| | license: apache-2.0 |
| | base_model: VietAI/vit5-base |
| | tags: |
| | - vietnamese |
| | - spam-detection |
| | - text-classification |
| | - e-commerce |
| | datasets: |
| | - ViSpamReviews |
| | metrics: |
| | - accuracy |
| | - macro-f1 |
| | - macro-precision |
| | - macro-recall |
| | model-index: |
| | - name: vit5-spam-binary |
| | results: |
| | - task: |
| | type: text-classification |
| | name: Spam Review Detection |
| | dataset: |
| | name: ViSpamReviews |
| | type: ViSpamReviews |
| | metrics: |
| | - type: accuracy |
| | value: 0.9073 |
| | - type: macro-f1 |
| | value: 0.8815 |
| | --- |
| | # vit5-spam-binary: Spam Review Detection for Vietnamese Text |
| |
|
| | This model is a fine-tuned version of [VietAI/vit5-base](https://huggingface.co/VietAI/vit5-base) on the **ViSpamReviews** dataset for spam review detection in Vietnamese e-commerce reviews. |
| |
|
| | ## Model Details |
| |
|
| | * **Base Model**: `VietAI/vit5-base` |
| | * **Description**: ViT5 - Vietnamese T5 model for text generation and classification |
| | * **Dataset**: ViSpamReviews (Vietnamese Spam Review Dataset) |
| | * **Fine-tuning Framework**: HuggingFace Transformers |
| | * **Task**: Spam Review Detection (binary) |
| | * **Number of Classes**: 2 |
| |
|
| | ### Hyperparameters |
| |
|
| | * Max sequence length: `256` |
| | * Learning rate: `5e-5` |
| | * Batch size: `32` |
| | * Epochs: `100` |
| | * Early stopping patience: `5` |
| |
|
| | ## Dataset |
| |
|
| | The model was trained on the **ViSpamReviews** dataset, which contains 19,860 Vietnamese e-commerce review samples. The dataset includes: |
| |
|
| | * **Train set**: 14,299 samples (72%) |
| | * **Validation set**: 1,590 samples (8%) |
| | * **Test set**: 3,971 samples (20%) |
| |
|
| | ### Label Distribution |
| |
|
| |
|
| | * **Non-spam** (0): Genuine product reviews |
| | * **Spam** (1): Fake or promotional reviews |
| |
|
| | ## Results |
| |
|
| | The model was evaluated on the test set with the following metrics: |
| |
|
| | * **Accuracy**: `0.9073` |
| | * **Macro-F1**: `0.8815` |
| |
|
| |
|
| | ## Usage |
| |
|
| | You can use this model for spam review detection in Vietnamese text. Below is an example: |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| | import torch |
| | |
| | # Load model and tokenizer |
| | model_name = "visolex/vit5-spam-binary" |
| | tokenizer = AutoTokenizer.from_pretrained(model_name) |
| | model = AutoModelForSequenceClassification.from_pretrained(model_name) |
| | |
| | # Example review text |
| | text = "Sản phẩm này rất tốt, shop giao hàng nhanh!" |
| | |
| | # Tokenize |
| | inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256) |
| | |
| | # Predict |
| | with torch.no_grad(): |
| | outputs = model(**inputs) |
| | predicted_class = outputs.logits.argmax(dim=-1).item() |
| | probabilities = torch.softmax(outputs.logits, dim=-1) |
| | |
| | |
| | # Map to label |
| | label_map = {0: "Non-spam", 1: "Spam"} |
| | predicted_label = label_map[predicted_class] |
| | confidence = probabilities[0][predicted_class].item() |
| | |
| | print(f"Text: {text}") |
| | print(f"Predicted: {predicted_label} (confidence: {confidence:.2%})") |
| | |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | If you use this model, please cite: |
| |
|
| | ```bibtex |
| | @misc{{ |
| | {model_key}_spam_detection, |
| | title={{{description}}}, |
| | author={{ViSoLex Team}}, |
| | year={{2025}}, |
| | howpublished={{\url{{https://huggingface.co/{visolex/vit5-spam-binary}}}}} |
| | }} |
| | ``` |
| |
|
| | ## License |
| |
|
| | This model is released under the Apache-2.0 license. |
| |
|
| | ## Acknowledgments |
| |
|
| | * Base model: [{base_model}](https://huggingface.co/{base_model}) |
| | * Dataset: ViSpamReviews (Vietnamese Spam Review Dataset) |
| | * ViSoLex Toolkit for Vietnamese NLP |
| |
|