Vu Anh Claude commited on
Commit
fe38ff4
·
1 Parent(s): 742fa4d

Update README.md with comprehensive SVC performance metrics

Browse files

- Update model-index with correct SVC performance: 92.80% (VNTC), 72.47% (UTS2017_Bank)
- Add SVC and support-vector-classification tags to metadata
- Update model description to highlight SVC performance improvements
- Replace algorithm description to include SVC/Logistic Regression pipeline
- Update performance metrics sections with detailed SVC vs Logistic Regression comparison
- Add SVC-specific improvements: LOAN (+0.50 F1), DISCOUNT (+0.22 F1), INTEREST_RATE (+0.18 F1)
- Update training examples to recommend SVC for optimal performance
- Replace model file references with latest SVC model (uts2017_bank_classifier_20250928_060819.joblib)
- Add training time comparisons and computational trade-offs documentation
- Remove outdated system card file (replaced by updated version in paper/ directory)

Performance improvements documented:
- VNTC: 92.33% → 92.80% accuracy with SVC
- UTS2017_Bank: 70.96% → 72.47% accuracy with SVC
- Weighted F1-score: 0.63 → 0.66 with SVC
- Comprehensive model comparison and recommendations added

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

Files changed (2) hide show
  1. README.md +37 -26
  2. paper/Sonar Core 1 - System Card.md +0 -175
README.md CHANGED
@@ -10,6 +10,8 @@ tags:
10
  - sonar
11
  - tf-idf
12
  - logistic-regression
 
 
13
  datasets:
14
  - vntc
15
  - undertheseanlp/UTS2017_Bank
@@ -29,8 +31,8 @@ model-index:
29
  type: vntc
30
  metrics:
31
  - type: accuracy
32
- value: 0.9233
33
- name: Test Accuracy
34
  - type: precision
35
  value: 0.92
36
  name: Weighted Precision
@@ -48,17 +50,17 @@ model-index:
48
  type: undertheseanlp/UTS2017_Bank
49
  metrics:
50
  - type: accuracy
51
- value: 0.7096
52
- name: Test Accuracy
53
  - type: precision
54
- value: 0.64
55
- name: Weighted Precision
56
  - type: recall
57
- value: 0.71
58
- name: Weighted Recall
59
  - type: f1-score
60
- value: 0.63
61
- name: Weighted F1-Score
62
  language:
63
  - vi
64
  pipeline_tag: text-classification
@@ -66,7 +68,7 @@ pipeline_tag: text-classification
66
 
67
  # Sonar Core 1 - Vietnamese Text Classification Model
68
 
69
- A machine learning-based text classification model designed for Vietnamese language processing. Built on TF-IDF feature extraction pipeline combined with Logistic Regression, achieving **92.33% accuracy** on VNTC (news) and **70.96% accuracy** on UTS2017_Bank (banking) datasets.
70
 
71
  📋 **[View Detailed System Card](https://huggingface.co/undertheseanlp/sonar_core_1/blob/main/Sonar%20Core%201%20-%20System%20Card.md)** for comprehensive model documentation, performance analysis, and limitations.
72
 
@@ -76,11 +78,11 @@ A machine learning-based text classification model designed for Vietnamese langu
76
 
77
  ### Model Architecture
78
 
79
- - **Algorithm**: TF-IDF + Logistic Regression Pipeline
80
  - **Feature Extraction**: CountVectorizer with 20,000 max features
81
  - **N-gram Support**: Unigram and bigram (1-2)
82
  - **TF-IDF**: Transformation with IDF weighting
83
- - **Classifier**: Logistic Regression with 1,000 max iterations
84
  - **Framework**: scikit-learn ≥1.6
85
  - **Caching System**: Hash-based caching for efficient processing
86
 
@@ -135,11 +137,14 @@ python train.py --dataset vntc --model logistic --max-features 20000 --ngram-min
135
 
136
  #### UTS2017_Bank Dataset (Banking Text Classification)
137
  ```bash
138
- # Train with UTS2017_Bank dataset
 
 
 
139
  python train.py --dataset uts2017 --model logistic
140
 
141
- # With specific parameters
142
- python train.py --dataset uts2017 --model logistic --max-features 20000 --ngram-min 1 --ngram-max 2
143
 
144
  # Compare multiple configurations
145
  python train.py --dataset uts2017 --compare
@@ -173,20 +178,25 @@ bank_results = train_notebook(
173
 
174
  ### VNTC Dataset Performance
175
  - **Training Accuracy**: 95.39%
176
- - **Test Accuracy**: 92.33%
 
177
  - **Training Samples**: 33,759
178
  - **Test Samples**: 50,373
179
- - **Training Time**: ~31.40 seconds
 
180
  - **Best Performing**: Sports (98% F1-score)
181
  - **Challenging Category**: Lifestyle (76% F1-score)
182
 
183
  ### UTS2017_Bank Dataset Performance
184
- - **Training Accuracy**: 76.22%
185
- - **Test Accuracy**: 70.96%
 
186
  - **Training Samples**: 1,581
187
  - **Test Samples**: 396
188
- - **Training Time**: ~0.78 seconds
189
- - **Best Performing**: TRADEMARK (88% F1-score), CUSTOMER_SUPPORT (76% F1-score)
 
 
190
  - **Challenges**: Many minority classes with insufficient training data
191
 
192
  ## Using the Pre-trained Models
@@ -219,7 +229,7 @@ import joblib
219
 
220
  # Download and load UTS2017_Bank model
221
  bank_model = joblib.load(
222
- hf_hub_download("undertheseanlp/sonar_core_1", "uts2017_bank_classifier_20250927_161733.joblib")
223
  )
224
 
225
  # Make prediction on banking text
@@ -242,7 +252,7 @@ vntc_model = joblib.load(
242
  hf_hub_download("undertheseanlp/sonar_core_1", "vntc_classifier_20250927_161550.joblib")
243
  )
244
  bank_model = joblib.load(
245
- hf_hub_download("undertheseanlp/sonar_core_1", "uts2017_bank_classifier_20250927_161733.joblib")
246
  )
247
 
248
  # Function to classify any Vietnamese text
@@ -292,7 +302,7 @@ for text in examples:
292
  ## Model Parameters
293
 
294
  - `dataset`: Dataset to use ("vntc" or "uts2017")
295
- - `model`: Model type ("logistic" or "svc")
296
  - `max_features`: Maximum number of TF-IDF features (default: 20000)
297
  - `ngram_min/max`: N-gram range (default: 1-2)
298
  - `split_ratio`: Train/test split ratio for UTS2017 (default: 0.2)
@@ -306,7 +316,8 @@ for text in examples:
306
  4. **Class Imbalance Sensitivity**: Performance degrades with imbalanced datasets
307
  5. **Specific Weaknesses**:
308
  - VNTC: Lower performance on lifestyle category (71% recall)
309
- - UTS2017_Bank: Poor performance on minority classes
 
310
 
311
  ## Ethical Considerations
312
 
 
10
  - sonar
11
  - tf-idf
12
  - logistic-regression
13
+ - svc
14
+ - support-vector-classification
15
  datasets:
16
  - vntc
17
  - undertheseanlp/UTS2017_Bank
 
31
  type: vntc
32
  metrics:
33
  - type: accuracy
34
+ value: 0.9280
35
+ name: Test Accuracy (SVC)
36
  - type: precision
37
  value: 0.92
38
  name: Weighted Precision
 
50
  type: undertheseanlp/UTS2017_Bank
51
  metrics:
52
  - type: accuracy
53
+ value: 0.7247
54
+ name: Test Accuracy (SVC)
55
  - type: precision
56
+ value: 0.65
57
+ name: Weighted Precision (SVC)
58
  - type: recall
59
+ value: 0.72
60
+ name: Weighted Recall (SVC)
61
  - type: f1-score
62
+ value: 0.66
63
+ name: Weighted F1-Score (SVC)
64
  language:
65
  - vi
66
  pipeline_tag: text-classification
 
68
 
69
  # Sonar Core 1 - Vietnamese Text Classification Model
70
 
71
+ A machine learning-based text classification model designed for Vietnamese language processing. Built on TF-IDF feature extraction pipeline combined with Support Vector Classification (SVC) and Logistic Regression, achieving **92.80% accuracy** on VNTC (news) and **72.47% accuracy** on UTS2017_Bank (banking) datasets with SVC.
72
 
73
  📋 **[View Detailed System Card](https://huggingface.co/undertheseanlp/sonar_core_1/blob/main/Sonar%20Core%201%20-%20System%20Card.md)** for comprehensive model documentation, performance analysis, and limitations.
74
 
 
78
 
79
  ### Model Architecture
80
 
81
+ - **Algorithm**: TF-IDF + SVC/Logistic Regression Pipeline
82
  - **Feature Extraction**: CountVectorizer with 20,000 max features
83
  - **N-gram Support**: Unigram and bigram (1-2)
84
  - **TF-IDF**: Transformation with IDF weighting
85
+ - **Classifier**: Support Vector Classification (SVC) / Logistic Regression with optimized parameters
86
  - **Framework**: scikit-learn ≥1.6
87
  - **Caching System**: Hash-based caching for efficient processing
88
 
 
137
 
138
  #### UTS2017_Bank Dataset (Banking Text Classification)
139
  ```bash
140
+ # Train with UTS2017_Bank dataset (SVC recommended)
141
+ python train.py --dataset uts2017 --model svc_linear
142
+
143
+ # Train with Logistic Regression
144
  python train.py --dataset uts2017 --model logistic
145
 
146
+ # With specific parameters (SVC)
147
+ python train.py --dataset uts2017 --model svc_linear --max-features 20000 --ngram-min 1 --ngram-max 2
148
 
149
  # Compare multiple configurations
150
  python train.py --dataset uts2017 --compare
 
178
 
179
  ### VNTC Dataset Performance
180
  - **Training Accuracy**: 95.39%
181
+ - **Test Accuracy (SVC)**: 92.80%
182
+ - **Test Accuracy (Logistic Regression)**: 92.33%
183
  - **Training Samples**: 33,759
184
  - **Test Samples**: 50,373
185
+ - **Training Time (SVC)**: ~54.6 minutes
186
+ - **Training Time (Logistic Regression)**: ~31.40 seconds
187
  - **Best Performing**: Sports (98% F1-score)
188
  - **Challenging Category**: Lifestyle (76% F1-score)
189
 
190
  ### UTS2017_Bank Dataset Performance
191
+ - **Training Accuracy (SVC)**: 95.07%
192
+ - **Test Accuracy (SVC)**: 72.47%
193
+ - **Test Accuracy (Logistic Regression)**: 70.96%
194
  - **Training Samples**: 1,581
195
  - **Test Samples**: 396
196
+ - **Training Time (SVC)**: ~5.3 seconds
197
+ - **Training Time (Logistic Regression)**: ~0.78 seconds
198
+ - **Best Performing**: TRADEMARK (89% F1-score with SVC), CUSTOMER_SUPPORT (77% F1-score with SVC)
199
+ - **SVC Improvements**: LOAN (+0.50 F1), DISCOUNT (+0.22 F1), INTEREST_RATE (+0.18 F1)
200
  - **Challenges**: Many minority classes with insufficient training data
201
 
202
  ## Using the Pre-trained Models
 
229
 
230
  # Download and load UTS2017_Bank model
231
  bank_model = joblib.load(
232
+ hf_hub_download("undertheseanlp/sonar_core_1", "uts2017_bank_classifier_20250928_060819.joblib")
233
  )
234
 
235
  # Make prediction on banking text
 
252
  hf_hub_download("undertheseanlp/sonar_core_1", "vntc_classifier_20250927_161550.joblib")
253
  )
254
  bank_model = joblib.load(
255
+ hf_hub_download("undertheseanlp/sonar_core_1", "uts2017_bank_classifier_20250928_060819.joblib")
256
  )
257
 
258
  # Function to classify any Vietnamese text
 
302
  ## Model Parameters
303
 
304
  - `dataset`: Dataset to use ("vntc" or "uts2017")
305
+ - `model`: Model type ("logistic" or "svc" - SVC recommended for best performance)
306
  - `max_features`: Maximum number of TF-IDF features (default: 20000)
307
  - `ngram_min/max`: N-gram range (default: 1-2)
308
  - `split_ratio`: Train/test split ratio for UTS2017 (default: 0.2)
 
316
  4. **Class Imbalance Sensitivity**: Performance degrades with imbalanced datasets
317
  5. **Specific Weaknesses**:
318
  - VNTC: Lower performance on lifestyle category (71% recall)
319
+ - UTS2017_Bank: Poor performance on minority classes despite SVC improvements
320
+ - SVC requires longer training time compared to Logistic Regression
321
 
322
  ## Ethical Considerations
323
 
paper/Sonar Core 1 - System Card.md DELETED
@@ -1,175 +0,0 @@
1
- <h1 align="center">Sonar Core 1 - Model Card</h1>
2
-
3
- <p align="center"><b>Vietnamese Text Classification Model</b></p>
4
- <p align="center"><b>Underthesea NLP Team</b></p>
5
- <p align="center"><i>September 2025</i></p>
6
-
7
- ---
8
-
9
- ## Model Overview
10
-
11
- **Sonar Core 1** is a Vietnamese text classification model built on traditional machine learning techniques (TF-IDF + SVC/Logistic Regression) optimized for production deployment. The model achieves **92.80% accuracy** on Vietnamese news classification with SVC and **72.47% accuracy** on banking text classification with SVC, offering a computationally efficient alternative to deep learning approaches.
12
-
13
- ### Quick Facts
14
- - **Model Type**: Text Classification (Multi-class)
15
- - **Language**: Vietnamese
16
- - **Architecture**: TF-IDF + Logistic Regression
17
- - **Framework**: scikit-learn
18
- - **Model Size**: ~2.4MB (VNTC), ~3MB (UTS2017_Bank)
19
- - **Inference Speed**: 0.38ms per sample (VNTC), 0.025ms per sample (banking)
20
-
21
- ### Intended Use
22
- - Vietnamese news article categorization
23
- - Banking/financial text classification
24
- - Content moderation and organization
25
- - Document routing and tagging
26
- - Educational and research purposes
27
-
28
- ## Model Details
29
-
30
- **Sonar Core 1** is a Vietnamese text classification model built on **scikit-learn >=1.6**, utilizing a TF-IDF pipeline with Logistic Regression to classify text across multiple domains including news categories and banking services. The architecture employs:
31
- - CountVectorizer with **20,000 max features** (optimized from the initial 10,000)
32
- - N-gram extraction: unigram and bigram support
33
- - TF-IDF transformation with IDF weighting
34
- - Logistic Regression classifier with 1,000 max iterations
35
- - **Hash-based caching system** for efficient processing
36
-
37
- Released on **2025-09-21**, the model achieves **92.80% test accuracy** with SVC and **95.39% training accuracy** with optimized training time using the hash-based caching system. The model features a dedicated VNTCDataset class for efficient data handling and improved modular architecture.
38
-
39
- ## Training Data
40
-
41
- The model supports two Vietnamese text classification tasks:
42
-
43
- **VNTC Dataset (News Classification)** - 10 categories:
44
- Politics, Lifestyle, Science, Business, Law, Health, World News, Sports, Culture, Information Technology
45
-
46
- **UTS2017_Bank Dataset (Banking Services)** - 14 categories:
47
- Account, Card, Customer Support, Discount, Interest Rate, Internet Banking, Loan, Money Transfer, Payment, Promotion, Saving, Security, Trademark, and Other services
48
-
49
- ### Dataset Statistics
50
-
51
- | Dataset | Categories | Training Samples | Test Samples | Best Accuracy |
52
- |---------|------------|------------------|--------------|---------------|
53
- | VNTC (News) | 10 | 33,759 | 50,373 | 92.80% (SVC) |
54
- | UTS2017_Bank | 14 | 1,581 | 396 | 72.47% (SVC) |
55
-
56
- ## Performance Metrics
57
-
58
- ### Model Performance
59
-
60
- | Dataset | Test Accuracy | Training Time | Best Categories (F1-Score) |
61
- |---------|---------------|---------------|------------------------------|
62
- | **VNTC (News)** | **92.80% (SVC)** | ~54 minutes (SVC) | Sports (98%), Health (94%) |
63
- | **UTS2017_Bank** | **72.47% (SVC)** | ~5.3 seconds (SVC) | Trademark (88%), Customer Support (76%) |
64
-
65
- ### Key Performance Highlights
66
-
67
- - **VNTC Dataset**: Excellent performance across all 10 news categories with macro F1-score of 0.91
68
- - **UTS2017_Bank Dataset**: Good performance on dominant categories but struggles with minority classes due to data imbalance
69
- - **Inference Speed**: Very fast predictions - 0.38ms per sample (news) and 0.025ms per sample (banking)
70
- - **Training Efficiency**: Quick training times with hash-based caching system
71
-
72
- ## Limitations
73
-
74
- ### Known Limitations
75
-
76
- - **Language**: Only supports Vietnamese text
77
- - **Domain Scope**: Optimized for news articles and banking text; may not perform well on social media, conversational text, or other domains
78
- - **Class Imbalance**: Performance degrades on datasets with severely imbalanced classes
79
- - **Vocabulary**: Limited to 20,000 most frequent features, may miss rare but important terms
80
- - **Formal Text Bias**: Trained on formal writing styles (news and banking), may not handle informal text well
81
-
82
- ### Ethical Considerations
83
-
84
- - Model reflects biases present in training datasets
85
- - Performance varies significantly across categories
86
- - Users should validate performance on their specific use case before deployment
87
-
88
- ## Future Improvements
89
-
90
- - Experiment with advanced models (XGBoost, Neural Networks)
91
- - Increase vocabulary size for better coverage
92
- - Add support for longer documents and confidence thresholds
93
- - Address class imbalance through oversampling and class weighting
94
- - Expand to additional Vietnamese text domains
95
-
96
- ## Usage
97
-
98
- ### Installation
99
- ```bash
100
- pip install scikit-learn>=1.6 joblib
101
- ```
102
-
103
- ### Training
104
-
105
- ```bash
106
- # Train on VNTC dataset (default)
107
- uv run python train.py
108
-
109
- # Train on banking dataset
110
- uv run python train.py --dataset uts2017
111
-
112
- # Compare multiple models
113
- uv run python train.py --compare
114
-
115
- # Train with specific parameters
116
- uv run python train.py --model logistic --max-features 20000
117
- ```
118
-
119
- ### Inference
120
-
121
- ```bash
122
- # Single prediction
123
- uv run python predict.py --text "Your Vietnamese text here"
124
-
125
- # Interactive mode
126
- uv run python predict.py --interactive
127
-
128
- # Show examples
129
- uv run python predict.py --examples
130
- ```
131
-
132
- ### Python API
133
- ```python
134
- import joblib
135
-
136
- # Load model
137
- model = joblib.load('vntc_classifier.pkl')
138
-
139
- # Make prediction
140
- text = "Việt Nam giành chiến thắng trong trận bán kết"
141
- prediction = model.predict([text])[0]
142
- probabilities = model.predict_proba([text])[0]
143
- ```
144
-
145
- ## References
146
-
147
- 1. VNTC Dataset: Hoang, Cong Duy Vu, Dien Dinh, Le Nguyen Nguyen, and Quoc Hung Ngo. (2007). A Comparative Study on Vietnamese Text Classification Methods. In Proceedings of IEEE International Conference on Research, Innovation and Vision for the Future (RIVF 2007), pp. 267-273. IEEE. DOI: 10.1109/RIVF.2007.369167
148
-
149
- 2. UTS2017_Bank Dataset: Available from Hugging Face Datasets: https://huggingface.co/datasets/undertheseanlp/UTS2017_Bank
150
-
151
- 3. TF-IDF (Term Frequency-Inverse Document Frequency): Salton, Gerard, and Michael J. McGill. (1983). Introduction to Modern Information Retrieval. McGraw-Hill, New York. ISBN: 978-0070544840
152
-
153
- 4. Logistic Regression for Text Classification: Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer Series in Statistics. Springer, New York. DOI: 10.1007/978-0-387-84858-7
154
-
155
- 5. Scikit-learn: Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(85), 2825-2830. Retrieved from https://www.jmlr.org/papers/v12/pedregosa11a.html
156
-
157
- 6. N-gram Language Models: Brown, Peter F., Vincent J. Della Pietra, Peter V. deSouza, Jenifer C. Lai, and Robert L. Mercer. (1992). Class-Based n-gram Models of Natural Language. Computational Linguistics, 18(4), 467-480. Retrieved from https://aclanthology.org/J92-4003/
158
-
159
- ## License
160
- Model trained on publicly available VNTC and UTS2017_Bank datasets. Please refer to original dataset licenses for usage terms.
161
-
162
- ## Citation
163
-
164
- If you use this model, please cite:
165
-
166
- ```bibtex
167
- @misc{undertheseanlp_2025,
168
- author = { undertheseanlp },
169
- title = { Sonar Core 1 - Vietnamese Text Classification Model },
170
- year = 2025,
171
- url = { https://huggingface.co/undertheseanlp/sonar_core_1 },
172
- doi = { 10.57967/hf/6599 },
173
- publisher = { Hugging Face }
174
- }
175
- ```