Update README.md with comprehensive SVC performance metrics

- Update model-index with correct SVC performance: 92.80% (VNTC), 72.47% (UTS2017_Bank)
- Add SVC and support-vector-classification tags to metadata
- Update model description to highlight SVC performance improvements
- Replace algorithm description to include SVC/Logistic Regression pipeline
- Update performance metrics sections with detailed SVC vs Logistic Regression comparison
- Add SVC-specific improvements: LOAN (+0.50 F1), DISCOUNT (+0.22 F1), INTEREST_RATE (+0.18 F1)
- Update training examples to recommend SVC for optimal performance
- Replace model file references with latest SVC model (uts2017_bank_classifier_20250928_060819.joblib)
- Add training time comparisons and computational trade-offs documentation
- Remove outdated system card file (replaced by updated version in paper/ directory)

Performance improvements documented:
- VNTC: 92.33% → 92.80% accuracy with SVC
- UTS2017_Bank: 70.96% → 72.47% accuracy with SVC
- Weighted F1-score: 0.63 → 0.66 with SVC
- Comprehensive model comparison and recommendations added

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

Files changed (2) hide show

README.md +37 -26
paper/Sonar Core 1 - System Card.md +0 -175

README.md CHANGED Viewed

@@ -10,6 +10,8 @@ tags:
   - sonar
   - tf-idf
   - logistic-regression
 datasets:
   - vntc
   - undertheseanlp/UTS2017_Bank
@@ -29,8 +31,8 @@ model-index:
           type: vntc
         metrics:
           - type: accuracy
-            value: 0.9233
-            name: Test Accuracy
           - type: precision
             value: 0.92
             name: Weighted Precision
@@ -48,17 +50,17 @@ model-index:
           type: undertheseanlp/UTS2017_Bank
         metrics:
           - type: accuracy
-            value: 0.7096
-            name: Test Accuracy
           - type: precision
-            value: 0.64
-            name: Weighted Precision
           - type: recall
-            value: 0.71
-            name: Weighted Recall
           - type: f1-score
-            value: 0.63
-            name: Weighted F1-Score
 language:
   - vi
 pipeline_tag: text-classification
@@ -66,7 +68,7 @@ pipeline_tag: text-classification
 # Sonar Core 1 - Vietnamese Text Classification Model
-A machine learning-based text classification model designed for Vietnamese language processing. Built on TF-IDF feature extraction pipeline combined with Logistic Regression, achieving **92.33% accuracy** on VNTC (news) and **70.96% accuracy** on UTS2017_Bank (banking) datasets.
 📋 **[View Detailed System Card](https://huggingface.co/undertheseanlp/sonar_core_1/blob/main/Sonar%20Core%201%20-%20System%20Card.md)** for comprehensive model documentation, performance analysis, and limitations.
@@ -76,11 +78,11 @@ A machine learning-based text classification model designed for Vietnamese langu
 ### Model Architecture
-- **Algorithm**: TF-IDF + Logistic Regression Pipeline
 - **Feature Extraction**: CountVectorizer with 20,000 max features
 - **N-gram Support**: Unigram and bigram (1-2)
 - **TF-IDF**: Transformation with IDF weighting
-- **Classifier**: Logistic Regression with 1,000 max iterations
 - **Framework**: scikit-learn ≥1.6
 - **Caching System**: Hash-based caching for efficient processing
@@ -135,11 +137,14 @@ python train.py --dataset vntc --model logistic --max-features 20000 --ngram-min
 #### UTS2017_Bank Dataset (Banking Text Classification)
 ```bash
-# Train with UTS2017_Bank dataset
 python train.py --dataset uts2017 --model logistic
-# With specific parameters
-python train.py --dataset uts2017 --model logistic --max-features 20000 --ngram-min 1 --ngram-max 2
 # Compare multiple configurations
 python train.py --dataset uts2017 --compare
@@ -173,20 +178,25 @@ bank_results = train_notebook(
 ### VNTC Dataset Performance
 - **Training Accuracy**: 95.39%
-- **Test Accuracy**: 92.33%
 - **Training Samples**: 33,759
 - **Test Samples**: 50,373
-- **Training Time**: ~31.40 seconds
 - **Best Performing**: Sports (98% F1-score)
 - **Challenging Category**: Lifestyle (76% F1-score)
 ### UTS2017_Bank Dataset Performance
-- **Training Accuracy**: 76.22%
-- **Test Accuracy**: 70.96%
 - **Training Samples**: 1,581
 - **Test Samples**: 396
-- **Training Time**: ~0.78 seconds
-- **Best Performing**: TRADEMARK (88% F1-score), CUSTOMER_SUPPORT (76% F1-score)
 - **Challenges**: Many minority classes with insufficient training data
 ## Using the Pre-trained Models
@@ -219,7 +229,7 @@ import joblib
 # Download and load UTS2017_Bank model
 bank_model = joblib.load(
-    hf_hub_download("undertheseanlp/sonar_core_1", "uts2017_bank_classifier_20250927_161733.joblib")
 )
 # Make prediction on banking text
@@ -242,7 +252,7 @@ vntc_model = joblib.load(
     hf_hub_download("undertheseanlp/sonar_core_1", "vntc_classifier_20250927_161550.joblib")
 )
 bank_model = joblib.load(
-    hf_hub_download("undertheseanlp/sonar_core_1", "uts2017_bank_classifier_20250927_161733.joblib")
 )
 # Function to classify any Vietnamese text
@@ -292,7 +302,7 @@ for text in examples:
 ## Model Parameters
 - `dataset`: Dataset to use ("vntc" or "uts2017")
-- `model`: Model type ("logistic" or "svc")
 - `max_features`: Maximum number of TF-IDF features (default: 20000)
 - `ngram_min/max`: N-gram range (default: 1-2)
 - `split_ratio`: Train/test split ratio for UTS2017 (default: 0.2)
@@ -306,7 +316,8 @@ for text in examples:
 4. **Class Imbalance Sensitivity**: Performance degrades with imbalanced datasets
 5. **Specific Weaknesses**:
    - VNTC: Lower performance on lifestyle category (71% recall)
-   - UTS2017_Bank: Poor performance on minority classes
 ## Ethical Considerations

   - sonar
   - tf-idf
   - logistic-regression
+  - svc
+  - support-vector-classification
 datasets:
   - vntc
   - undertheseanlp/UTS2017_Bank
           type: vntc
         metrics:
           - type: accuracy
+            value: 0.9280
+            name: Test Accuracy (SVC)
           - type: precision
             value: 0.92
             name: Weighted Precision
           type: undertheseanlp/UTS2017_Bank
         metrics:
           - type: accuracy
+            value: 0.7247
+            name: Test Accuracy (SVC)
           - type: precision
+            value: 0.65
+            name: Weighted Precision (SVC)
           - type: recall
+            value: 0.72
+            name: Weighted Recall (SVC)
           - type: f1-score
+            value: 0.66
+            name: Weighted F1-Score (SVC)
 language:
   - vi
 pipeline_tag: text-classification
 # Sonar Core 1 - Vietnamese Text Classification Model
+A machine learning-based text classification model designed for Vietnamese language processing. Built on TF-IDF feature extraction pipeline combined with Support Vector Classification (SVC) and Logistic Regression, achieving **92.80% accuracy** on VNTC (news) and **72.47% accuracy** on UTS2017_Bank (banking) datasets with SVC.
 📋 **[View Detailed System Card](https://huggingface.co/undertheseanlp/sonar_core_1/blob/main/Sonar%20Core%201%20-%20System%20Card.md)** for comprehensive model documentation, performance analysis, and limitations.
 ### Model Architecture
+- **Algorithm**: TF-IDF + SVC/Logistic Regression Pipeline
 - **Feature Extraction**: CountVectorizer with 20,000 max features
 - **N-gram Support**: Unigram and bigram (1-2)
 - **TF-IDF**: Transformation with IDF weighting
+- **Classifier**: Support Vector Classification (SVC) / Logistic Regression with optimized parameters
 - **Framework**: scikit-learn ≥1.6
 - **Caching System**: Hash-based caching for efficient processing
 #### UTS2017_Bank Dataset (Banking Text Classification)
 ```bash
+# Train with UTS2017_Bank dataset (SVC recommended)
+python train.py --dataset uts2017 --model svc_linear
+# Train with Logistic Regression
 python train.py --dataset uts2017 --model logistic
+# With specific parameters (SVC)
+python train.py --dataset uts2017 --model svc_linear --max-features 20000 --ngram-min 1 --ngram-max 2
 # Compare multiple configurations
 python train.py --dataset uts2017 --compare
 ### VNTC Dataset Performance
 - **Training Accuracy**: 95.39%
+- **Test Accuracy (SVC)**: 92.80%
+- **Test Accuracy (Logistic Regression)**: 92.33%
 - **Training Samples**: 33,759
 - **Test Samples**: 50,373
+- **Training Time (SVC)**: ~54.6 minutes
+- **Training Time (Logistic Regression)**: ~31.40 seconds
 - **Best Performing**: Sports (98% F1-score)
 - **Challenging Category**: Lifestyle (76% F1-score)
 ### UTS2017_Bank Dataset Performance
+- **Training Accuracy (SVC)**: 95.07%
+- **Test Accuracy (SVC)**: 72.47%
+- **Test Accuracy (Logistic Regression)**: 70.96%
 - **Training Samples**: 1,581
 - **Test Samples**: 396
+- **Training Time (SVC)**: ~5.3 seconds
+- **Training Time (Logistic Regression)**: ~0.78 seconds
+- **Best Performing**: TRADEMARK (89% F1-score with SVC), CUSTOMER_SUPPORT (77% F1-score with SVC)
+- **SVC Improvements**: LOAN (+0.50 F1), DISCOUNT (+0.22 F1), INTEREST_RATE (+0.18 F1)
 - **Challenges**: Many minority classes with insufficient training data
 ## Using the Pre-trained Models
 # Download and load UTS2017_Bank model
 bank_model = joblib.load(
+    hf_hub_download("undertheseanlp/sonar_core_1", "uts2017_bank_classifier_20250928_060819.joblib")
 )
 # Make prediction on banking text
     hf_hub_download("undertheseanlp/sonar_core_1", "vntc_classifier_20250927_161550.joblib")
 )
 bank_model = joblib.load(
+    hf_hub_download("undertheseanlp/sonar_core_1", "uts2017_bank_classifier_20250928_060819.joblib")
 )
 # Function to classify any Vietnamese text
 ## Model Parameters
 - `dataset`: Dataset to use ("vntc" or "uts2017")
+- `model`: Model type ("logistic" or "svc" - SVC recommended for best performance)
 - `max_features`: Maximum number of TF-IDF features (default: 20000)
 - `ngram_min/max`: N-gram range (default: 1-2)
 - `split_ratio`: Train/test split ratio for UTS2017 (default: 0.2)
 4. **Class Imbalance Sensitivity**: Performance degrades with imbalanced datasets
 5. **Specific Weaknesses**:
    - VNTC: Lower performance on lifestyle category (71% recall)
+   - UTS2017_Bank: Poor performance on minority classes despite SVC improvements
+   - SVC requires longer training time compared to Logistic Regression
 ## Ethical Considerations

paper/Sonar Core 1 - System Card.md DELETED Viewed

@@ -1,175 +0,0 @@
-<h1 align="center">Sonar Core 1 - Model Card</h1>
-<p align="center"><b>Vietnamese Text Classification Model</b></p>
-<p align="center"><b>Underthesea NLP Team</b></p>
-<p align="center"><i>September 2025</i></p>
----
-## Model Overview
-**Sonar Core 1** is a Vietnamese text classification model built on traditional machine learning techniques (TF-IDF + SVC/Logistic Regression) optimized for production deployment. The model achieves **92.80% accuracy** on Vietnamese news classification with SVC and **72.47% accuracy** on banking text classification with SVC, offering a computationally efficient alternative to deep learning approaches.
-### Quick Facts
-- **Model Type**: Text Classification (Multi-class)
-- **Language**: Vietnamese
-- **Architecture**: TF-IDF + Logistic Regression
-- **Framework**: scikit-learn
-- **Model Size**: ~2.4MB (VNTC), ~3MB (UTS2017_Bank)
-- **Inference Speed**: 0.38ms per sample (VNTC), 0.025ms per sample (banking)
-### Intended Use
-- Vietnamese news article categorization
-- Banking/financial text classification
-- Content moderation and organization
-- Document routing and tagging
-- Educational and research purposes
-## Model Details
-**Sonar Core 1** is a Vietnamese text classification model built on **scikit-learn >=1.6**, utilizing a TF-IDF pipeline with Logistic Regression to classify text across multiple domains including news categories and banking services. The architecture employs:
-- CountVectorizer with **20,000 max features** (optimized from the initial 10,000)
-- N-gram extraction: unigram and bigram support
-- TF-IDF transformation with IDF weighting
-- Logistic Regression classifier with 1,000 max iterations
-- **Hash-based caching system** for efficient processing
-Released on **2025-09-21**, the model achieves **92.80% test accuracy** with SVC and **95.39% training accuracy** with optimized training time using the hash-based caching system. The model features a dedicated VNTCDataset class for efficient data handling and improved modular architecture.
-## Training Data
-The model supports two Vietnamese text classification tasks:
-**VNTC Dataset (News Classification)** - 10 categories:
-Politics, Lifestyle, Science, Business, Law, Health, World News, Sports, Culture, Information Technology
-**UTS2017_Bank Dataset (Banking Services)** - 14 categories:
-Account, Card, Customer Support, Discount, Interest Rate, Internet Banking, Loan, Money Transfer, Payment, Promotion, Saving, Security, Trademark, and Other services
-### Dataset Statistics
-| Dataset | Categories | Training Samples | Test Samples | Best Accuracy |
-|---------|------------|------------------|--------------|---------------|
-| VNTC (News) | 10 | 33,759 | 50,373 | 92.80% (SVC) |
-| UTS2017_Bank | 14 | 1,581 | 396 | 72.47% (SVC) |
-## Performance Metrics
-### Model Performance
-| Dataset | Test Accuracy | Training Time | Best Categories (F1-Score) |
-|---------|---------------|---------------|------------------------------|
-| **VNTC (News)** | **92.80% (SVC)** | ~54 minutes (SVC) | Sports (98%), Health (94%) |
-| **UTS2017_Bank** | **72.47% (SVC)** | ~5.3 seconds (SVC) | Trademark (88%), Customer Support (76%) |
-### Key Performance Highlights
-- **VNTC Dataset**: Excellent performance across all 10 news categories with macro F1-score of 0.91
-- **UTS2017_Bank Dataset**: Good performance on dominant categories but struggles with minority classes due to data imbalance
-- **Inference Speed**: Very fast predictions - 0.38ms per sample (news) and 0.025ms per sample (banking)
-- **Training Efficiency**: Quick training times with hash-based caching system
-## Limitations
-### Known Limitations
-- **Language**: Only supports Vietnamese text
-- **Domain Scope**: Optimized for news articles and banking text; may not perform well on social media, conversational text, or other domains
-- **Class Imbalance**: Performance degrades on datasets with severely imbalanced classes
-- **Vocabulary**: Limited to 20,000 most frequent features, may miss rare but important terms
-- **Formal Text Bias**: Trained on formal writing styles (news and banking), may not handle informal text well
-### Ethical Considerations
-- Model reflects biases present in training datasets
-- Performance varies significantly across categories
-- Users should validate performance on their specific use case before deployment
-## Future Improvements
-- Experiment with advanced models (XGBoost, Neural Networks)
-- Increase vocabulary size for better coverage
-- Add support for longer documents and confidence thresholds
-- Address class imbalance through oversampling and class weighting
-- Expand to additional Vietnamese text domains
-## Usage
-### Installation
-```bash
-pip install scikit-learn>=1.6 joblib
-```
-### Training
-```bash
-# Train on VNTC dataset (default)
-uv run python train.py
-# Train on banking dataset
-uv run python train.py --dataset uts2017
-# Compare multiple models
-uv run python train.py --compare
-# Train with specific parameters
-uv run python train.py --model logistic --max-features 20000
-```
-### Inference
-```bash
-# Single prediction
-uv run python predict.py --text "Your Vietnamese text here"
-# Interactive mode
-uv run python predict.py --interactive
-# Show examples
-uv run python predict.py --examples
-```
-### Python API
-```python
-import joblib
-# Load model
-model = joblib.load('vntc_classifier.pkl')
-# Make prediction
-text = "Việt Nam giành chiến thắng trong trận bán kết"
-prediction = model.predict([text])[0]
-probabilities = model.predict_proba([text])[0]
-```
-## References
-1. VNTC Dataset: Hoang, Cong Duy Vu, Dien Dinh, Le Nguyen Nguyen, and Quoc Hung Ngo. (2007). A Comparative Study on Vietnamese Text Classification Methods. In Proceedings of IEEE International Conference on Research, Innovation and Vision for the Future (RIVF 2007), pp. 267-273. IEEE. DOI: 10.1109/RIVF.2007.369167
-2. UTS2017_Bank Dataset: Available from Hugging Face Datasets: https://huggingface.co/datasets/undertheseanlp/UTS2017_Bank
-3. TF-IDF (Term Frequency-Inverse Document Frequency): Salton, Gerard, and Michael J. McGill. (1983). Introduction to Modern Information Retrieval. McGraw-Hill, New York. ISBN: 978-0070544840
-4. Logistic Regression for Text Classification: Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer Series in Statistics. Springer, New York. DOI: 10.1007/978-0-387-84858-7
-5. Scikit-learn: Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(85), 2825-2830. Retrieved from https://www.jmlr.org/papers/v12/pedregosa11a.html
-6. N-gram Language Models: Brown, Peter F., Vincent J. Della Pietra, Peter V. deSouza, Jenifer C. Lai, and Robert L. Mercer. (1992). Class-Based n-gram Models of Natural Language. Computational Linguistics, 18(4), 467-480. Retrieved from https://aclanthology.org/J92-4003/
-## License
-Model trained on publicly available VNTC and UTS2017_Bank datasets. Please refer to original dataset licenses for usage terms.
-## Citation
-If you use this model, please cite:
-```bibtex
-@misc{undertheseanlp_2025,
-    author       = { undertheseanlp },
-    title        = { Sonar Core 1 - Vietnamese Text Classification Model },
-    year         = 2025,
-    url          = { https://huggingface.co/undertheseanlp/sonar_core_1 },
-    doi          = { 10.57967/hf/6599 },
-    publisher    = { Hugging Face }
-}
-```