File size: 2,834 Bytes
d6b6802
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
sdk: streamlit
sdk_version: 1.50.0
---

# 🧪 Advanced ML Sentiment Lab

[![Streamlit](https://img.shields.io/badge/Powered%20by-Streamlit-FF4B4B)](https://streamlit.io/)<br>
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-orange.svg)](LICENSE)<br>
[![Made by Tarek Masryo](https://img.shields.io/badge/Made%20by-Tarek%20Masryo-blue)](https://github.com/tarekmasryo)

---

## 📌 Overview

Interactive **Streamlit + Plotly** app for **binary sentiment analysis**.

Upload any CSV with a **text column** and a **binary label**, then:

- Run quick EDA on text lengths, tokens, and class balance  
- Build TF-IDF word + optional char features  
- Train multiple classical models (LogReg / RF / GB / Naive Bayes)  
- Tune the decision threshold with **FP/FN business costs**  
- Inspect misclassified samples and test arbitrary texts live

Works well with the classic **IMDB 50K Reviews** dataset, but is generic enough for product reviews, tickets, surveys, etc.

---

## 📊 Dashboard Preview

### EDA & KPIs  
![EDA](assets/eda-hero.png)

### Train & Validation  
![Train & Validation](assets/train-validation.png)

### Error Analysis  
![Error Analysis](assets/error-analysis.png)

### Deploy & Interactive Prediction  
![Deploy](assets/deploy.png)

---

## 🚀 How to use (in this Space)

1. **Load data**
   - Upload a CSV file  
   - Or place `IMDB Dataset.csv` / `imdb.csv` in the Space and reload  

2. **Map columns**
   - Choose the **text** column  
   - Choose the **label** column and map which values are *positive* vs *negative*  

3. **Train models**
   - Go to **“Train & Validation”**  
   - Set TF-IDF options, pick models, click **Train models**  

4. **Analyse & deploy**
   - Use **“Threshold & Cost”** to pick a business-aware threshold  
   - Check **“Compare Models”** + **“Error Analysis”**  
   - In **“Deploy”**, try any text and see the predicted sentiment + confidence bar  

No data is stored server-side beyond the current session.

---

## 🧠 Under the hood

- **Features**
  - Word TF-IDF (1–3 n-grams)  
  - Optional char TF-IDF (3–6 n-grams)  

- **Models**
  - Logistic Regression (balanced)
  - Random Forest
  - Gradient Boosting
  - Multinomial Naive Bayes  

- **Artifacts**
  - Saved under `models_sentiment_lab/`:
    - `vectorizers.joblib`, `models.joblib`, `results.joblib`, `metadata.joblib`  
  - Reused by Threshold, Compare, Error Analysis, and Deploy tabs

---

## 🖥 Run locally

```bash
git clone https://github.com/tarekmasryo/advanced-ml-sentiment-lab.git
cd advanced-ml-sentiment-lab

python -m venv .venv
# Windows: .venv\Scripts\activate
source .venv/bin/activate

pip install -r requirements.txt
streamlit run app.py
```

---

## 📄 License & credit

Code: **Apache 2.0**  
Space & dashboard by **Tarek Masryo** 🚀