tarekmasryo commited on
Commit
d6b6802
·
verified ·
1 Parent(s): d0dfbad

Upload 9 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,8 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/deploy.png filter=lfs diff=lfs merge=lfs -text
37
+ assets/eda-hero.png filter=lfs diff=lfs merge=lfs -text
38
+ assets/error-analysis.png filter=lfs diff=lfs merge=lfs -text
39
+ assets/train-validation.png filter=lfs diff=lfs merge=lfs -text
40
+ src/data/IMDB[[:space:]]Dataset.csv filter=lfs diff=lfs merge=lfs -text
LICENSE.txt ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
+
7
+ 1. Definitions.
8
+
9
+ "License" shall mean the terms and conditions for use, reproduction,
10
+ and distribution as defined by Sections 1 through 9 of this document.
11
+
12
+ "Licensor" shall mean the copyright owner or entity authorized by
13
+ the copyright owner that is granting the License.
14
+
15
+ "Legal Entity" shall mean the union of the acting entity and all
16
+ other entities that control, are controlled by, or are under common
17
+ control with that entity. For the purposes of this definition,
18
+ "control" means (i) the power, direct or indirect, to cause the
19
+ direction or management of such entity, whether by contract or
20
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
21
+ outstanding shares, or (iii) beneficial ownership of such entity.
22
+
23
+ "You" (or "Your") shall mean an individual or Legal Entity
24
+ exercising permissions granted by this License.
25
+
26
+ "Source" form shall mean the preferred form for making modifications,
27
+ including but not limited to software source code, documentation
28
+ source, and configuration files.
29
+
30
+ "Object" form shall mean any form resulting from mechanical
31
+ transformation or translation of a Source form, including but
32
+ not limited to compiled object code, generated documentation,
33
+ and conversions to other media types.
34
+
35
+ "Work" shall mean the work of authorship, whether in Source or
36
+ Object form, made available under the License, as indicated by a
37
+ copyright notice that is included in or attached to the work
38
+ (an example is provided in the Appendix below).
39
+
40
+ "Derivative Works" shall mean any work, whether in Source or Object
41
+ form, that is based on (or derived from) the Work and for which the
42
+ editorial revisions, annotations, elaborations, or other modifications
43
+ represent, as a whole, an original work of authorship. For the purposes
44
+ of this License, Derivative Works shall not include works that remain
45
+ separable from, or merely link (or bind by name) to the interfaces of,
46
+ the Work and Derivative Works thereof.
47
+
48
+ "Contribution" shall mean any work of authorship, including
49
+ the original version of the Work and any modifications or additions
50
+ to that Work or Derivative Works thereof, that is intentionally
51
+ submitted to Licensor for inclusion in the Work by the copyright owner
52
+ or by an individual or Legal Entity authorized to submit on behalf of
53
+ the copyright owner. For the purposes of this definition, "submitted"
54
+ means any form of electronic, verbal, or written communication sent
55
+ to the Licensor or its representatives, including but not limited to
56
+ communication on electronic mailing lists, source code control systems,
57
+ and issue tracking systems that are managed by, or on behalf of, the
58
+ Licensor for the purpose of discussing and improving the Work, but
59
+ excluding communication that is conspicuously marked or otherwise
60
+ designated in writing by the copyright owner as "Not a Contribution."
61
+
62
+ "Contributor" shall mean Licensor and any individual or Legal Entity
63
+ on behalf of whom a Contribution has been received by Licensor and
64
+ subsequently incorporated within the Work.
65
+
66
+ 2. Grant of Copyright License. Subject to the terms and conditions of
67
+ this License, each Contributor hereby grants to You a perpetual,
68
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69
+ copyright license to reproduce, prepare Derivative Works of,
70
+ publicly display, publicly perform, sublicense, and distribute the
71
+ Work and such Derivative Works in Source or Object form.
72
+
73
+ 3. Grant of Patent License. Subject to the terms and conditions of
74
+ this License, each Contributor hereby grants to You a perpetual,
75
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76
+ (except as stated in this section) patent license to make, have made,
77
+ use, offer to sell, sell, import, and otherwise transfer the Work,
78
+ where such license applies only to those patent claims licensable
79
+ by such Contributor that are necessarily infringed by their
80
+ Contribution(s) alone or by combination of their Contribution(s)
81
+ with the Work to which such Contribution(s) was submitted. If You
82
+ institute patent litigation against any entity (including a
83
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
84
+ or a Contribution incorporated within the Work constitutes direct
85
+ or contributory patent infringement, then any patent licenses
86
+ granted to You under this License for that Work shall terminate
87
+ as of the date such litigation is filed.
88
+
89
+ 4. Redistribution. You may reproduce and distribute copies of the
90
+ Work or Derivative Works thereof in any medium, with or without
91
+ modifications, and in Source or Object form, provided that You
92
+ meet the following conditions:
93
+
94
+ (a) You must give any other recipients of the Work or
95
+ Derivative Works a copy of this License; and
96
+
97
+ (b) You must cause any modified files to carry prominent notices
98
+ stating that You changed the files; and
99
+
100
+ (c) You must retain, in the Source form of any Derivative Works
101
+ that You distribute, all copyright, patent, trademark, and
102
+ attribution notices from the Source form of the Work,
103
+ excluding those notices that do not pertain to any part of
104
+ the Derivative Works; and
105
+
106
+ (d) If the Work includes a "NOTICE" text file as part of its
107
+ distribution, then any Derivative Works that You distribute must
108
+ include a readable copy of the attribution notices contained
109
+ within such NOTICE file, excluding those notices that do not
110
+ pertain to any part of the Derivative Works, in at least one
111
+ of the following places: within a NOTICE text file distributed
112
+ as part of the Derivative Works; within the Source form or
113
+ documentation, if provided along with the Derivative Works; or,
114
+ within a display generated by the Derivative Works, if and
115
+ wherever such third-party notices normally appear. The contents
116
+ of the NOTICE file are for informational purposes only and
117
+ do not modify the License. You may add Your own attribution
118
+ notices within Derivative Works that You distribute, alongside
119
+ or as an addendum to the NOTICE text from the Work, provided
120
+ that such additional attribution notices cannot be construed
121
+ as modifying the License.
122
+
123
+ You may add Your own copyright statement to Your modifications and
124
+ may provide additional or different license terms and conditions
125
+ for use, reproduction, or distribution of Your modifications, or
126
+ for any such Derivative Works as a whole, provided Your use,
127
+ reproduction, and distribution of the Work otherwise complies with
128
+ the conditions stated in this License.
129
+
130
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
131
+ any Contribution intentionally submitted for inclusion in the Work
132
+ by You to the Licensor shall be under the terms and conditions of
133
+ this License, without any additional terms or conditions.
134
+ Notwithstanding the above, nothing herein shall supersede or modify
135
+ the terms of any separate license agreement you may have executed
136
+ with Licensor regarding such Contributions.
137
+
138
+ 6. Trademarks. This License does not grant permission to use the trade
139
+ names, trademarks, service marks, or product names of the Licensor,
140
+ except as required for reasonable and customary use in describing the
141
+ origin of the Work and reproducing the content of the NOTICE file.
142
+
143
+ 7. Disclaimer of Warranty. Unless required by applicable law or
144
+ agreed to in writing, Licensor provides the Work (and each
145
+ Contributor provides its Contributions) on an "AS IS" BASIS,
146
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147
+ implied, including, without limitation, any warranties or conditions
148
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149
+ PARTICULAR PURPOSE. You are solely responsible for determining the
150
+ appropriateness of using or redistributing the Work and assume any
151
+ risks associated with Your exercise of permissions under this License.
152
+
153
+ 8. Limitation of Liability. In no event and under no legal theory,
154
+ whether in tort (including negligence), contract, or otherwise,
155
+ unless required by applicable law (such as deliberate and grossly
156
+ negligent acts) or agreed to in writing, shall any Contributor be
157
+ liable to You for damages, including any direct, indirect, special,
158
+ incidental, or consequential damages of any character arising as a
159
+ result of this License or out of the use or inability to use the
160
+ Work (including but not limited to damages for loss of goodwill,
161
+ work stoppage, computer failure or malfunction, or any and all
162
+ other commercial damages or losses), even if such Contributor
163
+ has been advised of the possibility of such damages.
164
+
165
+ 9. Accepting Warranty or Additional Liability. While redistributing
166
+ the Work or Derivative Works thereof, You may choose to offer,
167
+ and charge a fee for, acceptance of support, warranty, indemnity,
168
+ or other liability obligations and/or rights consistent with this
169
+ License. However, in accepting such obligations, You may act only
170
+ on Your own behalf and on Your sole responsibility, not on behalf
171
+ of any other Contributor, and only if You agree to indemnify,
172
+ defend, and hold each Contributor harmless for any liability
173
+ incurred by, or claims asserted against, such Contributor by reason
174
+ of your accepting any such warranty or additional liability.
175
+
176
+ END OF TERMS AND CONDITIONS
177
+
178
+ APPENDIX: How to apply the Apache License to your work.
179
+
180
+ To apply the Apache License to your work, attach the following
181
+ boilerplate notice, with the fields enclosed by brackets "[]"
182
+ replaced with your own identifying information. (Don't include
183
+ the brackets!) The text should be enclosed in the appropriate
184
+ comment syntax for the file format. We also recommend that a
185
+ file or class name and description of purpose be included on the
186
+ same "printed page" as the copyright notice for easier
187
+ identification within third-party archives.
188
+
189
+ Copyright [yyyy] [name of copyright owner]
190
+
191
+ Licensed under the Apache License, Version 2.0 (the "License");
192
+ you may not use this file except in compliance with the License.
193
+ You may obtain a copy of the License at
194
+
195
+ http://www.apache.org/licenses/LICENSE-2.0
196
+
197
+ Unless required by applicable law or agreed to in writing, software
198
+ distributed under the License is distributed on an "AS IS" BASIS,
199
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200
+ See the License for the specific language governing permissions and
201
+ limitations under the License.
README_space.md ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ sdk: streamlit
3
+ sdk_version: 1.50.0
4
+ ---
5
+
6
+ # 🧪 Advanced ML Sentiment Lab
7
+
8
+ [![Streamlit](https://img.shields.io/badge/Powered%20by-Streamlit-FF4B4B)](https://streamlit.io/)<br>
9
+ [![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-orange.svg)](LICENSE)<br>
10
+ [![Made by Tarek Masryo](https://img.shields.io/badge/Made%20by-Tarek%20Masryo-blue)](https://github.com/tarekmasryo)
11
+
12
+ ---
13
+
14
+ ## 📌 Overview
15
+
16
+ Interactive **Streamlit + Plotly** app for **binary sentiment analysis**.
17
+
18
+ Upload any CSV with a **text column** and a **binary label**, then:
19
+
20
+ - Run quick EDA on text lengths, tokens, and class balance
21
+ - Build TF-IDF word + optional char features
22
+ - Train multiple classical models (LogReg / RF / GB / Naive Bayes)
23
+ - Tune the decision threshold with **FP/FN business costs**
24
+ - Inspect misclassified samples and test arbitrary texts live
25
+
26
+ Works well with the classic **IMDB 50K Reviews** dataset, but is generic enough for product reviews, tickets, surveys, etc.
27
+
28
+ ---
29
+
30
+ ## 📊 Dashboard Preview
31
+
32
+ ### EDA & KPIs
33
+ ![EDA](assets/eda-hero.png)
34
+
35
+ ### Train & Validation
36
+ ![Train & Validation](assets/train-validation.png)
37
+
38
+ ### Error Analysis
39
+ ![Error Analysis](assets/error-analysis.png)
40
+
41
+ ### Deploy & Interactive Prediction
42
+ ![Deploy](assets/deploy.png)
43
+
44
+ ---
45
+
46
+ ## 🚀 How to use (in this Space)
47
+
48
+ 1. **Load data**
49
+ - Upload a CSV file
50
+ - Or place `IMDB Dataset.csv` / `imdb.csv` in the Space and reload
51
+
52
+ 2. **Map columns**
53
+ - Choose the **text** column
54
+ - Choose the **label** column and map which values are *positive* vs *negative*
55
+
56
+ 3. **Train models**
57
+ - Go to **“Train & Validation”**
58
+ - Set TF-IDF options, pick models, click **Train models**
59
+
60
+ 4. **Analyse & deploy**
61
+ - Use **“Threshold & Cost”** to pick a business-aware threshold
62
+ - Check **“Compare Models”** + **“Error Analysis”**
63
+ - In **“Deploy”**, try any text and see the predicted sentiment + confidence bar
64
+
65
+ No data is stored server-side beyond the current session.
66
+
67
+ ---
68
+
69
+ ## 🧠 Under the hood
70
+
71
+ - **Features**
72
+ - Word TF-IDF (1–3 n-grams)
73
+ - Optional char TF-IDF (3–6 n-grams)
74
+
75
+ - **Models**
76
+ - Logistic Regression (balanced)
77
+ - Random Forest
78
+ - Gradient Boosting
79
+ - Multinomial Naive Bayes
80
+
81
+ - **Artifacts**
82
+ - Saved under `models_sentiment_lab/`:
83
+ - `vectorizers.joblib`, `models.joblib`, `results.joblib`, `metadata.joblib`
84
+ - Reused by Threshold, Compare, Error Analysis, and Deploy tabs
85
+
86
+ ---
87
+
88
+ ## 🖥 Run locally
89
+
90
+ ```bash
91
+ git clone https://github.com/tarekmasryo/advanced-ml-sentiment-lab.git
92
+ cd advanced-ml-sentiment-lab
93
+
94
+ python -m venv .venv
95
+ # Windows: .venv\Scripts\activate
96
+ source .venv/bin/activate
97
+
98
+ pip install -r requirements.txt
99
+ streamlit run app.py
100
+ ```
101
+
102
+ ---
103
+
104
+ ## 📄 License & credit
105
+
106
+ Code: **Apache 2.0**
107
+ Space & dashboard by **Tarek Masryo** 🚀
assets/deploy.png ADDED

Git LFS Details

  • SHA256: a354f4908c975da0bac7dc923e808f90ad65fe8bd3e31a9546041f266338b0e9
  • Pointer size: 131 Bytes
  • Size of remote file: 583 kB
assets/eda-hero.png ADDED

Git LFS Details

  • SHA256: 81c87ec87a9df69d8581a99babb1ed035fcd44a1bdd42cc26f159d139f429f43
  • Pointer size: 131 Bytes
  • Size of remote file: 775 kB
assets/error-analysis.png ADDED

Git LFS Details

  • SHA256: 70b4aa910724f21d1f2d1cf3c32121435ac2f62ed40267b8576c2c85a8252233
  • Pointer size: 131 Bytes
  • Size of remote file: 494 kB
assets/train-validation.png ADDED

Git LFS Details

  • SHA256: 4aeb71071541ae4491d069fb0bc7d2898135e7b7bec322e209a56b9b47af2128
  • Pointer size: 131 Bytes
  • Size of remote file: 460 kB
requirements.txt CHANGED
@@ -1,3 +1,7 @@
1
- altair
2
- pandas
3
- streamlit
 
 
 
 
 
1
+ streamlit==1.51.0
2
+ pandas==2.3.3
3
+ numpy==2.3.5
4
+ scikit-learn==1.7.2
5
+ scipy>=1.15.2
6
+ plotly==5.24.0
7
+ joblib==1.4.2
src/data/IMDB Dataset.csv ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dfc447764f82be365fa9c2beef4e8df89d3919e3da95f5088004797d79695aa2
3
+ size 66212309
src/streamlit_app.py CHANGED
@@ -1,40 +1,1573 @@
1
- import altair as alt
 
 
 
 
 
 
 
 
 
 
2
  import numpy as np
3
  import pandas as pd
4
  import streamlit as st
 
 
5
 
6
- """
7
- # Welcome to Streamlit!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
 
9
- Edit `/streamlit_app.py` to customize this app to your heart's desire :heart:.
10
- If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
11
- forums](https://discuss.streamlit.io).
 
 
 
 
 
 
12
 
13
- In the meantime, below is an example of what you can do with just a few lines of code:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  """
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
- num_points = st.slider("Number of points in spiral", 1, 10000, 1100)
17
- num_turns = st.slider("Number of turns in spiral", 1, 300, 31)
18
-
19
- indices = np.linspace(0, 1, num_points)
20
- theta = 2 * np.pi * num_turns * indices
21
- radius = indices
22
-
23
- x = radius * np.cos(theta)
24
- y = radius * np.sin(theta)
25
-
26
- df = pd.DataFrame({
27
- "x": x,
28
- "y": y,
29
- "idx": indices,
30
- "rand": np.random.randn(num_points),
31
- })
32
-
33
- st.altair_chart(alt.Chart(df, height=700, width=700)
34
- .mark_point(filled=True)
35
- .encode(
36
- x=alt.X("x", axis=None),
37
- y=alt.Y("y", axis=None),
38
- color=alt.Color("idx", legend=None, scale=alt.Scale()),
39
- size=alt.Size("rand", legend=None, scale=alt.Scale(range=[1, 150])),
40
- ))
 
1
+ # Advanced ML Sentiment Lab - Streamlit App
2
+
3
+ import warnings
4
+ warnings.filterwarnings("ignore")
5
+
6
+ import os
7
+ from pathlib import Path
8
+ from collections import Counter
9
+ from typing import List, Dict, Tuple, Optional
10
+ from urllib.parse import urlparse
11
+
12
  import numpy as np
13
  import pandas as pd
14
  import streamlit as st
15
+ import plotly.express as px
16
+ import plotly.graph_objects as go
17
 
18
+ from sklearn.model_selection import train_test_split
19
+ from sklearn.metrics import (
20
+ f1_score,
21
+ accuracy_score,
22
+ precision_score,
23
+ recall_score,
24
+ average_precision_score,
25
+ roc_auc_score,
26
+ roc_curve,
27
+ precision_recall_curve,
28
+ confusion_matrix,
29
+ )
30
+ from sklearn.feature_extraction.text import TfidfVectorizer
31
+ from sklearn.linear_model import LogisticRegression
32
+ from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
33
+ from sklearn.naive_bayes import MultinomialNB
34
+ from scipy.sparse import hstack
35
+ import joblib
36
+
37
+
38
+ # =========================================================
39
+ # App configuration
40
+ # =========================================================
41
+
42
+ st.set_page_config(
43
+ page_title="Advanced ML Sentiment Lab",
44
+ page_icon="🚀",
45
+ layout="wide",
46
+ initial_sidebar_state="expanded",
47
+ )
48
+
49
+ # Base paths (works locally and on Hugging Face Spaces)
50
+ BASE_DIR = Path(__file__).resolve().parent
51
+ MODELS_DIR = BASE_DIR / "models_sentiment_lab"
52
+ MODELS_DIR.mkdir(exist_ok=True)
53
+
54
+
55
+ # =========================================================
56
+ # Premium CSS (SaaS-style)
57
+ # =========================================================
58
+
59
+ APP_CSS = """
60
+ <style>
61
+ @import url('https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700;800&display=swap');
62
+
63
+ .stApp {
64
+ background: radial-gradient(circle at top, #151a2f 0, #020617 45%, #020617 100%);
65
+ color: #e5e7eb;
66
+ font-family: 'Inter', sans-serif;
67
+ }
68
+
69
+ .main .block-container {
70
+ max-width: 1600px;
71
+ padding-top: 1.5rem;
72
+ }
73
+
74
+ /* Hero */
75
+ .hero-premium {
76
+ padding: 30px 34px;
77
+ border-radius: 24px;
78
+ background: linear-gradient(
79
+ 135deg,
80
+ rgba(88, 80, 236, 0.35) 0%,
81
+ rgba(236, 72, 153, 0.22) 55%,
82
+ rgba(15, 23, 42, 0.98) 100%
83
+ );
84
+ border: 1px solid rgba(129, 140, 248, 0.55);
85
+ box-shadow:
86
+ 0 20px 70px rgba(15, 23, 42, 0.9),
87
+ 0 0 120px rgba(129, 140, 248, 0.4);
88
+ backdrop-filter: blur(16px);
89
+ margin-bottom: 26px;
90
+ }
91
+
92
+ .hero-title-pro {
93
+ font-size: 34px;
94
+ font-weight: 800;
95
+ letter-spacing: 0.02em;
96
+ background: linear-gradient(120deg, #e5e7eb 0%, #e0f2fe 30%, #f9a8d4 60%, #a5b4fc 100%);
97
+ -webkit-background-clip: text;
98
+ -webkit-text-fill-color: transparent;
99
+ margin-bottom: 10px;
100
+ }
101
+
102
+ .hero-subtitle-pro {
103
+ font-size: 15px;
104
+ color: #cbd5e1;
105
+ line-height: 1.7;
106
+ max-width: 840px;
107
+ }
108
+
109
+ .hero-badges {
110
+ margin-top: 16px;
111
+ display: flex;
112
+ flex-wrap: wrap;
113
+ gap: 8px;
114
+ }
115
+
116
+ /* Badges */
117
+ .badge-pill {
118
+ display: inline-flex;
119
+ align-items: center;
120
+ gap: 6px;
121
+ padding: 6px 14px;
122
+ border-radius: 999px;
123
+ font-size: 12px;
124
+ font-weight: 600;
125
+ background: radial-gradient(circle at top left, #6366f1, #8b5cf6);
126
+ color: #f9fafb;
127
+ box-shadow: 0 4px 16px rgba(129, 140, 248, 0.5);
128
+ }
129
+
130
+ .badge-soft {
131
+ background: linear-gradient(135deg, rgba(15, 23, 42, 0.96), rgba(55, 65, 81, 0.96));
132
+ border: 1px solid rgba(148, 163, 184, 0.6);
133
+ color: #cbd5e1;
134
+ box-shadow: none;
135
+ }
136
+
137
+ /* KPI cards */
138
+ .kpi-premium {
139
+ padding: 22px 20px;
140
+ border-radius: 20px;
141
+ background: radial-gradient(circle at top left, rgba(129, 140, 248, 0.16), rgba(15, 23, 42, 0.96));
142
+ border: 1px solid rgba(148, 163, 184, 0.5);
143
+ box-shadow: 0 14px 40px rgba(15, 23, 42, 0.9);
144
+ backdrop-filter: blur(12px);
145
+ transition: all 0.22s ease;
146
+ }
147
+
148
+ .kpi-premium:hover {
149
+ transform: translateY(-3px);
150
+ box-shadow: 0 20px 60px rgba(30, 64, 175, 0.7);
151
+ }
152
+
153
+ .kpi-icon {
154
+ font-size: 26px;
155
+ margin-bottom: 6px;
156
+ }
157
+
158
+ .kpi-label-pro {
159
+ font-size: 11px;
160
+ color: #94a3b8;
161
+ text-transform: uppercase;
162
+ letter-spacing: 0.12em;
163
+ font-weight: 600;
164
+ margin-bottom: 6px;
165
+ }
166
+
167
+ .kpi-value-pro {
168
+ font-size: 26px;
169
+ font-weight: 800;
170
+ background: linear-gradient(130deg, #e5e7eb 0%, #bfdbfe 40%, #c4b5fd 100%);
171
+ -webkit-background-clip: text;
172
+ -webkit-text-fill-color: transparent;
173
+ }
174
+
175
+ .kpi-trend {
176
+ font-size: 11px;
177
+ color: #22c55e;
178
+ margin-top: 2px;
179
+ }
180
 
181
+ /* Section headers */
182
+ .section-header-pro {
183
+ font-size: 22px;
184
+ font-weight: 800;
185
+ color: #e5e7eb;
186
+ margin: 18px 0 6px 0;
187
+ padding-bottom: 8px;
188
+ border-bottom: 1px solid rgba(148, 163, 184, 0.5);
189
+ }
190
 
191
+ .section-desc-pro {
192
+ font-size: 13px;
193
+ color: #9ca3af;
194
+ margin-bottom: 16px;
195
+ }
196
+
197
+ /* Tabs */
198
+ .stTabs [data-baseweb="tab-list"] {
199
+ gap: 6px;
200
+ background: radial-gradient(circle at top, rgba(15, 23, 42, 0.96), rgba(15, 23, 42, 0.9));
201
+ padding: 8px;
202
+ border-radius: 999px;
203
+ }
204
+
205
+ .stTabs [data-baseweb="tab"] {
206
+ border-radius: 999px;
207
+ padding: 8px 20px;
208
+ background: transparent;
209
+ color: #9ca3af;
210
+ font-size: 13px;
211
+ font-weight: 600;
212
+ }
213
+
214
+ .stTabs [aria-selected="true"] {
215
+ background: linear-gradient(135deg, #6366f1 0%, #8b5cf6 50%, #ec4899 100%);
216
+ color: #f9fafb !important;
217
+ box-shadow: 0 4px 16px rgba(129, 140, 248, 0.6);
218
+ }
219
+
220
+ /* Model cards */
221
+ .model-card {
222
+ padding: 18px 16px;
223
+ border-radius: 16px;
224
+ background: radial-gradient(circle at top left, rgba(15, 23, 42, 0.97), rgba(15, 23, 42, 0.94));
225
+ border: 1px solid rgba(148, 163, 184, 0.5);
226
+ margin-bottom: 12px;
227
+ }
228
+
229
+ .model-name {
230
+ font-size: 16px;
231
+ font-weight: 700;
232
+ margin-bottom: 10px;
233
+ }
234
+
235
+ .model-metrics {
236
+ display: grid;
237
+ grid-template-columns: repeat(4, minmax(0, 1fr));
238
+ gap: 8px;
239
+ }
240
+
241
+ .metric-box {
242
+ padding: 6px 8px;
243
+ border-radius: 10px;
244
+ background: rgba(30, 64, 175, 0.35);
245
+ }
246
+
247
+ .metric-label {
248
+ font-size: 10px;
249
+ color: #cbd5e1;
250
+ text-transform: uppercase;
251
+ letter-spacing: 0.08em;
252
+ }
253
+
254
+ .metric-value {
255
+ font-size: 14px;
256
+ font-weight: 700;
257
+ }
258
+
259
+ /* Info box */
260
+ .info-box {
261
+ padding: 14px 16px;
262
+ border-radius: 14px;
263
+ background: rgba(37, 99, 235, 0.14);
264
+ border-left: 4px solid #3b82f6;
265
+ margin: 10px 0 16px 0;
266
+ }
267
+
268
+ .info-box-title {
269
+ font-size: 13px;
270
+ font-weight: 700;
271
+ color: #93c5fd;
272
+ margin-bottom: 4px;
273
+ }
274
+
275
+ .info-box-text {
276
+ font-size: 12px;
277
+ color: #e5e7eb;
278
+ line-height: 1.6;
279
+ }
280
+
281
+ /* Threshold card */
282
+ .threshold-card {
283
+ padding: 18px;
284
+ border-radius: 18px;
285
+ background: radial-gradient(circle at top left, rgba(15, 23, 42, 0.97), rgba(15, 23, 42, 0.94));
286
+ border: 1px solid rgba(148, 163, 184, 0.5);
287
+ box-shadow: 0 12px 36px rgba(15, 23, 42, 0.9);
288
+ }
289
+
290
+ /* Prediction card */
291
+ .prediction-card {
292
+ padding: 20px 18px;
293
+ border-radius: 18px;
294
+ background: radial-gradient(circle at top left, rgba(15, 23, 42, 0.97), rgba(15, 23, 42, 0.95));
295
+ border: 1px solid rgba(129, 140, 248, 0.6);
296
+ box-shadow: 0 12px 40px rgba(30, 64, 175, 0.8);
297
+ margin-top: 8px;
298
+ }
299
+
300
+ .prediction-label {
301
+ font-size: 12px;
302
+ color: #9ca3af;
303
+ text-transform: uppercase;
304
+ letter-spacing: 0.12em;
305
+ }
306
+
307
+ .prediction-result {
308
+ font-size: 26px;
309
+ font-weight: 800;
310
+ margin: 6px 0;
311
+ }
312
+
313
+ .prediction-positive {
314
+ color: #22c55e;
315
+ }
316
+
317
+ .prediction-negative {
318
+ color: #f97373;
319
+ }
320
+
321
+ .prediction-confidence {
322
+ font-size: 14px;
323
+ color: #e5e7eb;
324
+ }
325
+
326
+ /* Progress bar */
327
+ .progress-bar {
328
+ width: 100%;
329
+ height: 8px;
330
+ border-radius: 999px;
331
+ background: rgba(15, 23, 42, 0.8);
332
+ overflow: hidden;
333
+ margin-top: 6px;
334
+ }
335
+
336
+ .progress-fill {
337
+ height: 100%;
338
+ border-radius: 999px;
339
+ background: linear-gradient(90deg, #22c55e 0%, #a3e635 50%, #facc15 100%);
340
+ }
341
+
342
+ /* Animation */
343
+ @keyframes pulse {
344
+ 0%, 100% { opacity: 1; }
345
+ 50% { opacity: 0.45; }
346
+ }
347
+
348
+ .loading-pulse {
349
+ animation: pulse 1.6s ease-in-out infinite;
350
+ }
351
+ </style>
352
  """
353
+ st.markdown(APP_CSS, unsafe_allow_html=True)
354
+
355
+
356
+ # =========================================================
357
+ # Utility functions
358
+ # =========================================================
359
+
360
+ def basic_clean(s: str) -> str:
361
+ import re, html
362
+ if not isinstance(s, str):
363
+ s = str(s)
364
+ s = html.unescape(s).lower()
365
+ s = re.sub(r"<br\s*/?>", " ", s)
366
+ s = re.sub(r"http\S+|www\S+", " ", s)
367
+ s = re.sub(r"[^a-z0-9\s']", " ", s)
368
+ s = re.sub(r"\s+", " ", s).strip()
369
+ return s
370
+
371
+
372
+ def _is_url(path: str) -> bool:
373
+ try:
374
+ parsed = urlparse(path)
375
+ return parsed.scheme in ("http", "https")
376
+ except Exception:
377
+ return False
378
+
379
+
380
+ @st.cache_data(show_spinner=True)
381
+ def load_default_sentiment_dataset() -> pd.DataFrame:
382
+ """
383
+ Try to automatically load IMDB Dataset from the repo or environment.
384
+ Priority:
385
+ 1) SENTIMENT_DATA_PATH / DATA_PATH / CSV_PATH env vars (file path)
386
+ 2) SENTIMENT_DATA_URL / DATA_URL / CSV_URL env vars (URL)
387
+ 3) data/IMDB Dataset.csv in common locations relative to this file.
388
+ """
389
+ # 1) Env path hints
390
+ env_path = None
391
+ for k in ("SENTIMENT_DATA_PATH", "DATA_PATH", "CSV_PATH"):
392
+ v = os.getenv(k)
393
+ if v:
394
+ env_path = v.strip()
395
+ break
396
+
397
+ env_url = None
398
+ for k in ("SENTIMENT_DATA_URL", "DATA_URL", "CSV_URL"):
399
+ v = os.getenv(k)
400
+ if v:
401
+ env_url = v.strip()
402
+ break
403
+
404
+ candidates: List[str] = []
405
+
406
+ if env_path:
407
+ candidates.append(env_path)
408
+ if env_url:
409
+ candidates.append(env_url)
410
+
411
+ rel_default = "data/IMDB Dataset.csv"
412
+ candidates.append(rel_default)
413
+
414
+ cwd = Path.cwd()
415
+ candidates.append(str(cwd / rel_default))
416
+
417
+ # When file is under src/data or repo/data
418
+ candidates.append(str(BASE_DIR / "data" / "IMDB Dataset.csv"))
419
+ candidates.append(str(BASE_DIR.parent / "data" / "IMDB Dataset.csv"))
420
+
421
+ # Directly next to the app
422
+ candidates.append(str(BASE_DIR / "IMDB Dataset.csv"))
423
+ candidates.append(str(BASE_DIR.parent / "IMDB Dataset.csv"))
424
+
425
+ tried: List[str] = []
426
+ last_err: Optional[Exception] = None
427
+
428
+ for src in candidates:
429
+ if not src or src in tried:
430
+ continue
431
+ tried.append(src)
432
+ try:
433
+ if _is_url(src):
434
+ df = pd.read_csv(src)
435
+ else:
436
+ p = Path(src)
437
+ if not p.exists():
438
+ continue
439
+ df = pd.read_csv(p)
440
+ if df is not None and not df.empty:
441
+ return df
442
+ except Exception as e:
443
+ last_err = e
444
+ continue
445
+
446
+ msg_lines = [
447
+ "Could not find dataset at 'data/IMDB Dataset.csv'. Tried:",
448
+ *[f"- {t}" for t in tried],
449
+ ]
450
+ if last_err is not None:
451
+ msg_lines.append(f"Last error: {last_err}")
452
+ raise FileNotFoundError("\n".join(msg_lines))
453
+
454
+
455
+ @st.cache_data(show_spinner=False)
456
+ def clean_df(
457
+ df: pd.DataFrame,
458
+ text_col: str,
459
+ label_col: str,
460
+ pos_label_str: str,
461
+ neg_label_str: str,
462
+ ) -> Tuple[pd.DataFrame, np.ndarray]:
463
+ out = df.copy()
464
+ out["text_raw"] = out[text_col].astype(str)
465
+ out["text_clean"] = out["text_raw"].map(basic_clean)
466
+ lab = out[label_col].astype(str)
467
+ y = np.where(lab == pos_label_str, 1, 0).astype(int)
468
+ return out, y
469
+
470
+
471
+ def build_advanced_features(
472
+ texts: List[str],
473
+ max_word_features: int,
474
+ use_char: bool,
475
+ char_max: int,
476
+ ):
477
+ word_vec = TfidfVectorizer(
478
+ ngram_range=(1, 3),
479
+ max_features=max_word_features,
480
+ min_df=2,
481
+ max_df=0.95,
482
+ )
483
+ Xw = word_vec.fit_transform(texts)
484
+
485
+ vecs = [word_vec]
486
+ mats = [Xw]
487
+
488
+ if use_char:
489
+ char_vec = TfidfVectorizer(
490
+ analyzer="char",
491
+ ngram_range=(3, 6),
492
+ max_features=char_max,
493
+ min_df=2,
494
+ )
495
+ Xc = char_vec.fit_transform(texts)
496
+ vecs.append(char_vec)
497
+ mats.append(Xc)
498
+
499
+ X_all = hstack(mats) if len(mats) > 1 else mats[0]
500
+ return X_all, tuple(vecs)
501
+
502
+
503
+ def train_multiple_models(X_train, y_train, models_config: Dict) -> Dict:
504
+ models = {}
505
+ for name, cfg in models_config.items():
506
+ if not cfg.get("enabled", False):
507
+ continue
508
+
509
+ if name == "Logistic Regression":
510
+ model = LogisticRegression(
511
+ C=cfg["C"],
512
+ max_iter=1000,
513
+ solver="liblinear",
514
+ n_jobs=-1,
515
+ class_weight="balanced",
516
+ random_state=42,
517
+ )
518
+ elif name == "Random Forest":
519
+ model = RandomForestClassifier(
520
+ n_estimators=cfg["n_estimators"],
521
+ max_depth=cfg["max_depth"],
522
+ min_samples_split=cfg["min_samples_split"],
523
+ n_jobs=-1,
524
+ class_weight="balanced",
525
+ random_state=42,
526
+ )
527
+ elif name == "Gradient Boosting":
528
+ model = GradientBoostingClassifier(
529
+ n_estimators=cfg["n_estimators"],
530
+ learning_rate=cfg["learning_rate"],
531
+ max_depth=cfg["max_depth"],
532
+ random_state=42,
533
+ )
534
+ elif name == "Naive Bayes":
535
+ model = MultinomialNB(alpha=cfg["alpha"])
536
+ else:
537
+ continue
538
+
539
+ model.fit(X_train, y_train)
540
+ models[name] = model
541
+
542
+ return models
543
+
544
+
545
+ def evaluate_model(model, X_val, y_val) -> Dict:
546
+ y_pred = model.predict(X_val)
547
+ try:
548
+ y_proba = model.predict_proba(X_val)[:, 1]
549
+ except Exception:
550
+ scores = model.decision_function(X_val)
551
+ y_proba = (scores - scores.min()) / (scores.max() - scores.min() + 1e-9)
552
+
553
+ metrics = {
554
+ "accuracy": accuracy_score(y_val, y_pred),
555
+ "precision": precision_score(y_val, y_pred, zero_division=0),
556
+ "recall": recall_score(y_val, y_pred, zero_division=0),
557
+ "f1": f1_score(y_val, y_pred, zero_division=0),
558
+ "roc_auc": roc_auc_score(y_val, y_proba),
559
+ "pr_auc": average_precision_score(y_val, y_proba),
560
+ "y_pred": y_pred,
561
+ "y_proba": y_proba,
562
+ }
563
+ return metrics
564
+
565
+
566
+ def compute_threshold_view(
567
+ y_true: np.ndarray,
568
+ y_proba: np.ndarray,
569
+ threshold: float,
570
+ cost_fp: float,
571
+ cost_fn: float,
572
+ ) -> Tuple[Dict, pd.DataFrame]:
573
+ y_pred_thr = (y_proba >= threshold).astype(int)
574
+ tn, fp, fn, tp = confusion_matrix(y_true, y_pred_thr).ravel()
575
+
576
+ metrics = {
577
+ "threshold": threshold,
578
+ "accuracy": accuracy_score(y_true, y_pred_thr),
579
+ "precision": precision_score(y_true, y_pred_thr, zero_division=0),
580
+ "recall": recall_score(y_true, y_pred_thr, zero_division=0),
581
+ "f1": f1_score(y_true, y_pred_thr, zero_division=0),
582
+ "specificity": tn / (tn + fp + 1e-9),
583
+ "fp": int(fp),
584
+ "fn": int(fn),
585
+ "tp": int(tp),
586
+ "tn": int(tn),
587
+ }
588
+ metrics["cost"] = metrics["fp"] * cost_fp + metrics["fn"] * cost_fn
589
+
590
+ grid = np.linspace(0.05, 0.95, 37)
591
+ rows = []
592
+ for t in grid:
593
+ y_pred_g = (y_proba >= t).astype(int)
594
+ tn_g, fp_g, fn_g, tp_g = confusion_matrix(y_true, y_pred_g).ravel()
595
+ f1_g = f1_score(y_true, y_pred_g, zero_division=0)
596
+ cost_g = fp_g * cost_fp + fn_g * cost_fn
597
+ rows.append(
598
+ {
599
+ "threshold": t,
600
+ "f1": f1_g,
601
+ "fp": fp_g,
602
+ "fn": fn_g,
603
+ "cost": cost_g,
604
+ }
605
+ )
606
+ df_curve = pd.DataFrame(rows)
607
+ return metrics, df_curve
608
+
609
+
610
+ # =========================================================
611
+ # Sidebar & dataset loading
612
+ # =========================================================
613
+
614
+ st.sidebar.markdown("### 🚀 Advanced ML Sentiment Lab")
615
+ st.sidebar.markdown("---")
616
+
617
+ st.sidebar.markdown("### Dataset source")
618
+ dataset_mode = st.sidebar.radio(
619
+ "How do you want to provide the dataset?",
620
+ options=["Auto (IMDB from repo)", "Upload CSV"],
621
+ index=0,
622
+ )
623
+
624
+ df: Optional[pd.DataFrame] = None
625
+
626
+ if dataset_mode == "Upload CSV":
627
+ upload = st.sidebar.file_uploader(
628
+ "Upload CSV dataset",
629
+ type=["csv"],
630
+ help="Small custom datasets work best here.",
631
+ )
632
+ if upload is not None:
633
+ try:
634
+ df = pd.read_csv(upload)
635
+ except Exception as e:
636
+ st.sidebar.error(f"Could not read uploaded CSV: {e}")
637
+ else:
638
+ try:
639
+ df = load_default_sentiment_dataset()
640
+ except Exception as e:
641
+ st.markdown(
642
+ """
643
+ <div class="hero-premium">
644
+ <div class="hero-title-pro">Advanced ML Sentiment Lab</div>
645
+ <div class="hero-subtitle-pro">
646
+ Dataset could not be loaded automatically.
647
+ Make sure <code>data/IMDB Dataset.csv</code> exists in the repo
648
+ (or set SENTIMENT_DATA_PATH / DATA_PATH), or switch to "Upload CSV"
649
+ in the sidebar.
650
+ </div>
651
+ <div class="hero-badges">
652
+ <span class="badge-pill">Text + binary label</span>
653
+ <span class="badge-pill">TF-IDF word &amp; character features</span>
654
+ <span class="badge-soft">Threshold tuning with business cost</span>
655
+ <span class="badge-soft">Artifacts saved under models_sentiment_lab/</span>
656
+ </div>
657
+ </div>
658
+ """,
659
+ unsafe_allow_html=True,
660
+ )
661
+ st.error(f"Dataset error: {e}")
662
+ st.stop()
663
+
664
+ if df is None or df.empty:
665
+ st.error("No dataset available. Provide a CSV via the sidebar.")
666
+ st.stop()
667
+
668
+ all_cols = list(df.columns)
669
+
670
+ st.sidebar.markdown("### Column mapping")
671
+
672
+ # Guess text column
673
+ default_text_idx = 0
674
+ for i, c in enumerate(all_cols):
675
+ if str(c).lower() in ["review", "text", "comment", "content", "message", "body"]:
676
+ default_text_idx = i
677
+ break
678
+
679
+ text_col = st.sidebar.selectbox("Text column", all_cols, index=default_text_idx)
680
+
681
+ label_candidates = [c for c in all_cols if c != text_col]
682
+ if not label_candidates:
683
+ st.error("Dataset must have at least 2 columns (text + label).")
684
+ st.stop()
685
+
686
+ default_label_idx = 0
687
+ for i, c in enumerate(label_candidates):
688
+ if str(c).lower() in ["sentiment", "label", "target", "y", "class"]:
689
+ default_label_idx = i
690
+ break
691
+
692
+ label_col = st.sidebar.selectbox("Label column", label_candidates, index=default_label_idx)
693
+
694
+ label_values = df[label_col].astype(str).dropna().value_counts().index.tolist()
695
+ if len(label_values) < 2:
696
+ st.error("Label column must have at least 2 distinct values.")
697
+ st.stop()
698
+
699
+ st.sidebar.markdown("### Label mapping")
700
+ pos_label_str = st.sidebar.selectbox("Positive class (1)", label_values, index=0)
701
+ neg_label_str = st.sidebar.selectbox(
702
+ "Negative class (0)", label_values, index=1 if len(label_values) > 1 else 0
703
+ )
704
+
705
+ # Training sample size (to keep it fast)
706
+ st.sidebar.markdown("### Training subset")
707
+ max_train_rows = st.sidebar.slider(
708
+ "Max rows used for training",
709
+ min_value=5000,
710
+ max_value=50000,
711
+ value=10000,
712
+ step=5000,
713
+ help="Training uses a stratified subset to keep runtime under control.",
714
+ )
715
+
716
+ # =========================================================
717
+ # Data processing & dataset KPIs
718
+ # =========================================================
719
+
720
+ dfc, y = clean_df(
721
+ df,
722
+ text_col=text_col,
723
+ label_col=label_col,
724
+ pos_label_str=pos_label_str,
725
+ neg_label_str=neg_label_str,
726
+ )
727
+
728
+ n_rows = len(dfc)
729
+ n_pos = int((y == 1).sum())
730
+ n_neg = int((y == 0).sum())
731
+ pos_ratio = n_pos / max(1, n_rows)
732
+ avg_len = dfc["text_clean"].str.len().mean()
733
+ sample_vocab = len(set(" ".join(dfc["text_clean"].head(5000)).split()))
734
+
735
+ # =========================================================
736
+ # Hero + KPI cards
737
+ # =========================================================
738
+
739
+ st.markdown(
740
+ f"""
741
+ <div class="hero-premium">
742
+ <div class="hero-title-pro">Advanced ML Sentiment Lab</div>
743
+ <div class="hero-subtitle-pro">
744
+ Production-style sentiment analytics on <b>{n_rows:,}</b> samples.
745
+ Configure TF-IDF features, train multiple models, tune the decision threshold
746
+ under custom business costs, and inspect model errors.
747
+ </div>
748
+ <div class="hero-badges">
749
+ <span class="badge-pill">Text column: {text_col}</span>
750
+ <span class="badge-pill">Label column: {label_col}</span>
751
+ <span class="badge-soft">Binary labels: {pos_label_str} / {neg_label_str}</span>
752
+ </div>
753
+ </div>
754
+ """,
755
+ unsafe_allow_html=True,
756
+ )
757
+
758
+ k1, k2, k3, k4 = st.columns(4)
759
+ with k1:
760
+ st.markdown(
761
+ f"""
762
+ <div class="kpi-premium">
763
+ <div class="kpi-icon">📊</div>
764
+ <div class="kpi-label-pro">Total samples</div>
765
+ <div class="kpi-value-pro">{n_rows:,}</div>
766
+ <div class="kpi-trend">Cleaned for modeling</div>
767
+ </div>
768
+ """,
769
+ unsafe_allow_html=True,
770
+ )
771
+
772
+ with k2:
773
+ st.markdown(
774
+ f"""
775
+ <div class="kpi-premium">
776
+ <div class="kpi-icon">✅</div>
777
+ <div class="kpi-label-pro">Positive share</div>
778
+ <div class="kpi-value-pro">{pos_ratio*100:.1f}%</div>
779
+ <div class="kpi-trend">{n_pos:,} positive / {n_neg:,} negative</div>
780
+ </div>
781
+ """,
782
+ unsafe_allow_html=True,
783
+ )
784
+
785
+ with k3:
786
+ st.markdown(
787
+ f"""
788
+ <div class="kpi-premium">
789
+ <div class="kpi-icon">📝</div>
790
+ <div class="kpi-label-pro">Avg text length</div>
791
+ <div class="kpi-value-pro">{avg_len:.0f}</div>
792
+ <div class="kpi-trend">characters per record</div>
793
+ </div>
794
+ """,
795
+ unsafe_allow_html=True,
796
+ )
797
+
798
+ with k4:
799
+ st.markdown(
800
+ f"""
801
+ <div class="kpi-premium">
802
+ <div class="kpi-icon">📚</div>
803
+ <div class="kpi-label-pro">Sample vocabulary</div>
804
+ <div class="kpi-value-pro">{sample_vocab:,}</div>
805
+ <div class="kpi-trend">unique tokens (first 5k rows)</div>
806
+ </div>
807
+ """,
808
+ unsafe_allow_html=True,
809
+ )
810
+
811
+ # =========================================================
812
+ # Tabs
813
+ # =========================================================
814
+
815
+ tab_eda, tab_train, tab_threshold, tab_compare, tab_errors, tab_deploy = st.tabs(
816
+ ["EDA", "Train & Validation", "Threshold & Cost", "Compare Models", "Error Analysis", "Deploy"]
817
+ )
818
+
819
+ # =========================================================
820
+ # TAB 1: EDA
821
+ # =========================================================
822
+
823
+ with tab_eda:
824
+ st.markdown(
825
+ '<div class="section-header-pro">Exploratory data analysis</div>',
826
+ unsafe_allow_html=True,
827
+ )
828
+ st.markdown(
829
+ '<div class="section-desc-pro">Quick checks on class balance, text lengths, and token distribution.</div>',
830
+ unsafe_allow_html=True,
831
+ )
832
+
833
+ col1, col2 = st.columns(2)
834
+
835
+ with col1:
836
+ dfc["len_tokens"] = dfc["text_clean"].str.split().map(len)
837
+
838
+ fig_len = px.histogram(
839
+ dfc,
840
+ x="len_tokens",
841
+ nbins=50,
842
+ title="Token length distribution",
843
+ )
844
+ fig_len.update_layout(
845
+ plot_bgcolor="rgba(0,0,0,0)",
846
+ paper_bgcolor="rgba(0,0,0,0)",
847
+ font=dict(color="#e5e7eb"),
848
+ xaxis_title="Tokens per text",
849
+ yaxis_title="Count",
850
+ )
851
+ st.plotly_chart(fig_len, width="stretch")
852
+
853
+ dist_data = pd.DataFrame(
854
+ {
855
+ "Class": [neg_label_str, pos_label_str],
856
+ "Count": [n_neg, n_pos],
857
+ }
858
+ )
859
+ fig_class = px.pie(
860
+ dist_data,
861
+ values="Count",
862
+ names="Class",
863
+ title="Class distribution",
864
+ )
865
+ fig_class.update_layout(
866
+ plot_bgcolor="rgba(0,0,0,0)",
867
+ paper_bgcolor="rgba(0,0,0,0)",
868
+ font=dict(color="#e5e7eb"),
869
+ )
870
+ st.plotly_chart(fig_class, width="stretch")
871
+
872
+ with col2:
873
+ sample_size = min(10000, len(dfc))
874
+ cnt = Counter()
875
+ for t in dfc["text_clean"].sample(sample_size, random_state=42):
876
+ cnt.update(t.split())
877
+ top_tokens = pd.DataFrame(cnt.most_common(25), columns=["Token", "Frequency"])
878
+
879
+ fig_tokens = px.bar(
880
+ top_tokens,
881
+ x="Frequency",
882
+ y="Token",
883
+ orientation="h",
884
+ title="Top tokens (sample)",
885
+ )
886
+ fig_tokens.update_layout(
887
+ plot_bgcolor="rgba(0,0,0,0)",
888
+ paper_bgcolor="rgba(0,0,0,0)",
889
+ font=dict(color="#e5e7eb"),
890
+ showlegend=False,
891
+ yaxis={"categoryorder": "total ascending"},
892
+ )
893
+ st.plotly_chart(fig_tokens, width="stretch")
894
+
895
+ st.markdown("**Length statistics by class**")
896
+ st.dataframe(
897
+ dfc.groupby(label_col)["len_tokens"].describe().round(2),
898
+ width="stretch",
899
+ )
900
+
901
+ # =========================================================
902
+ # TAB 2: Train & Validation
903
+ # =========================================================
904
+
905
+ with tab_train:
906
+ st.markdown(
907
+ '<div class="section-header-pro">Multi-model training (single split)</div>',
908
+ unsafe_allow_html=True,
909
+ )
910
+ st.markdown(
911
+ '<div class="section-desc-pro">Configure TF-IDF, select models, then run a stratified train/validation split on a capped subset for fast turnaround.</div>',
912
+ unsafe_allow_html=True,
913
+ )
914
+
915
+ fe1, fe2, fe3 = st.columns(3)
916
+ with fe1:
917
+ max_word_features = st.slider(
918
+ "Max word features",
919
+ min_value=5000,
920
+ max_value=60000,
921
+ value=20000,
922
+ step=5000,
923
+ )
924
+ with fe2:
925
+ use_char = st.checkbox("Add character n-grams", value=True)
926
+ with fe3:
927
+ test_size = st.slider("Validation split (%)", 10, 40, 20, 5) / 100.0
928
+
929
+ st.markdown("---")
930
+ st.markdown("#### Model configuration")
931
+
932
+ models_config: Dict[str, Dict] = {}
933
+ mc1, mc2 = st.columns(2)
934
+
935
+ with mc1:
936
+ with st.expander("Logistic Regression", expanded=True):
937
+ en = st.checkbox("Enable Logistic Regression", value=True, key="lr_en_ultra")
938
+ C_val = st.slider(
939
+ "Regularization C", 0.1, 10.0, 2.0, 0.5, key="lr_C_ultra"
940
+ )
941
+ models_config["Logistic Regression"] = {"enabled": en, "C": C_val}
942
+
943
+ with st.expander("Random Forest"):
944
+ en = st.checkbox("Enable Random Forest", value=False, key="rf_en_ultra")
945
+ est = st.slider(
946
+ "n_estimators", 50, 300, 120, 50, key="rf_est_ultra"
947
+ )
948
+ depth = st.slider("max_depth", 5, 40, 18, 5, key="rf_depth_ultra")
949
+ split = st.slider(
950
+ "min_samples_split", 2, 20, 5, 1, key="rf_split_ultra"
951
+ )
952
+ models_config["Random Forest"] = {
953
+ "enabled": en,
954
+ "n_estimators": est,
955
+ "max_depth": depth,
956
+ "min_samples_split": split,
957
+ }
958
+
959
+ with mc2:
960
+ with st.expander("Gradient Boosting"):
961
+ en = st.checkbox("Enable Gradient Boosting", value=False, key="gb_en_ultra")
962
+ est = st.slider(
963
+ "n_estimators", 50, 300, 120, 50, key="gb_est_ultra"
964
+ )
965
+ lr = st.slider(
966
+ "learning_rate", 0.01, 0.3, 0.08, 0.01, key="gb_lr_ultra"
967
+ )
968
+ depth = st.slider("max_depth", 2, 8, 3, 1, key="gb_depth_ultra")
969
+ models_config["Gradient Boosting"] = {
970
+ "enabled": en,
971
+ "n_estimators": est,
972
+ "learning_rate": lr,
973
+ "max_depth": depth,
974
+ }
975
+
976
+ with st.expander("Naive Bayes"):
977
+ en = st.checkbox("Enable Naive Bayes", value=True, key="nb_en_ultra")
978
+ alpha = st.slider(
979
+ "alpha (smoothing)", 0.1, 3.0, 1.0, 0.1, key="nb_alpha_ultra"
980
+ )
981
+ models_config["Naive Bayes"] = {"enabled": en, "alpha": alpha}
982
+
983
+ st.markdown("---")
984
+
985
+ random_state = 42
986
+
987
+ if st.button("Train models", type="primary"):
988
+ enabled_models = [m for m, cfg in models_config.items() if cfg["enabled"]]
989
+ if not enabled_models:
990
+ st.warning("Enable at least one model before training.", icon="⚠️")
991
+ else:
992
+ progress = st.progress(0)
993
+ status = st.empty()
994
+
995
+ # Stratified subset for training
996
+ progress.progress(5)
997
+ status.markdown("Sampling rows for training (stratified)…")
998
+
999
+ n_total = len(dfc)
1000
+ train_rows = min(max_train_rows, n_total)
1001
+ indices = np.arange(n_total)
1002
+
1003
+ if train_rows < n_total:
1004
+ sample_idx, _ = train_test_split(
1005
+ indices,
1006
+ train_size=train_rows,
1007
+ stratify=y,
1008
+ random_state=random_state,
1009
+ )
1010
+ else:
1011
+ sample_idx = indices
1012
+
1013
+ df_train = dfc.iloc[sample_idx].copy()
1014
+ y_sample = y[sample_idx]
1015
+
1016
+ status.markdown("Cleaning and vectorising text…")
1017
+ progress.progress(20)
1018
+
1019
+ texts = df_train["text_clean"].tolist()
1020
+ X_all, vecs = build_advanced_features(
1021
+ texts,
1022
+ max_word_features=max_word_features,
1023
+ use_char=use_char,
1024
+ char_max=20000,
1025
+ )
1026
+
1027
+ status.markdown("Creating stratified train/validation split…")
1028
+ progress.progress(40)
1029
+
1030
+ local_idx = np.arange(len(df_train))
1031
+ train_loc, val_loc, y_train, y_val = train_test_split(
1032
+ local_idx,
1033
+ y_sample,
1034
+ test_size=test_size,
1035
+ stratify=y_sample,
1036
+ random_state=random_state,
1037
+ )
1038
+ X_train = X_all[train_loc]
1039
+ X_val = X_all[val_loc]
1040
+
1041
+ status.markdown("Training models…")
1042
+ progress.progress(65)
1043
+
1044
+ trained_models = train_multiple_models(X_train, y_train, models_config)
1045
+
1046
+ status.markdown("Evaluating models on validation set…")
1047
+ progress.progress(80)
1048
+
1049
+ all_results: Dict[str, Dict] = {}
1050
+ for name, model in trained_models.items():
1051
+ metrics = evaluate_model(model, X_val, y_val)
1052
+ all_results[name] = {"model": model, "metrics": metrics}
1053
+
1054
+ status.markdown("Saving artifacts…")
1055
+ progress.progress(92)
1056
+
1057
+ val_idx_global = df_train.index[val_loc]
1058
+
1059
+ joblib.dump(vecs, MODELS_DIR / "vectorizers.joblib")
1060
+ joblib.dump(trained_models, MODELS_DIR / "models.joblib")
1061
+ joblib.dump(all_results, MODELS_DIR / "results.joblib")
1062
+ joblib.dump(
1063
+ {
1064
+ "pos_label": pos_label_str,
1065
+ "neg_label": neg_label_str,
1066
+ "val_idx": val_idx_global,
1067
+ "y_val": y_val,
1068
+ "text_col": text_col,
1069
+ "label_col": label_col,
1070
+ },
1071
+ MODELS_DIR / "metadata.joblib",
1072
+ )
1073
+
1074
+ progress.progress(100)
1075
+ status.markdown("Training complete.")
1076
+
1077
+ st.success(f"Trained {len(trained_models)} model(s) on {len(df_train):,} rows.")
1078
+
1079
+ rows = []
1080
+ for name, res in all_results.items():
1081
+ m = res["metrics"]
1082
+ rows.append(
1083
+ {
1084
+ "Model": name,
1085
+ "Accuracy": f"{m['accuracy']:.4f}",
1086
+ "Precision": f"{m['precision']:.4f}",
1087
+ "Recall": f"{m['recall']:.4f}",
1088
+ "F1 (validation)": f"{m['f1']:.4f}",
1089
+ "ROC-AUC": f"{m['roc_auc']:.4f}",
1090
+ "PR-AUC": f"{m['pr_auc']:.4f}",
1091
+ }
1092
+ )
1093
+ res_df = pd.DataFrame(rows)
1094
+ st.markdown("#### Training summary")
1095
+ st.dataframe(res_df, width="stretch", hide_index=True)
1096
+
1097
+ # =========================================================
1098
+ # TAB 3: Threshold & Cost
1099
+ # =========================================================
1100
+
1101
+ with tab_threshold:
1102
+ st.markdown(
1103
+ '<div class="section-header-pro">Threshold tuning and business cost</div>',
1104
+ unsafe_allow_html=True,
1105
+ )
1106
+ st.markdown(
1107
+ '<div class="section-desc-pro">Pick a model, move the decision threshold, and inspect how metrics and expected cost change.</div>',
1108
+ unsafe_allow_html=True,
1109
+ )
1110
+
1111
+ results_path = MODELS_DIR / "results.joblib"
1112
+ meta_path = MODELS_DIR / "metadata.joblib"
1113
+
1114
+ if not results_path.exists() or not meta_path.exists():
1115
+ st.info("Train models in the previous tab to unlock threshold tuning.")
1116
+ else:
1117
+ all_results = joblib.load(results_path)
1118
+ metadata = joblib.load(meta_path)
1119
+ y_val = metadata["y_val"]
1120
+
1121
+ best_name = max(
1122
+ all_results.keys(),
1123
+ key=lambda n: all_results[n]["metrics"]["f1"],
1124
+ )
1125
+
1126
+ model_name = st.selectbox(
1127
+ "Model to analyse",
1128
+ options=list(all_results.keys()),
1129
+ index=list(all_results.keys()).index(best_name),
1130
+ )
1131
+
1132
+ metrics_base = all_results[model_name]["metrics"]
1133
+ y_proba = metrics_base["y_proba"]
1134
+
1135
+ col_thr, col_cost = st.columns([1.2, 1])
1136
+ with col_thr:
1137
+ threshold = st.slider(
1138
+ "Decision threshold for positive class",
1139
+ min_value=0.05,
1140
+ max_value=0.95,
1141
+ value=0.5,
1142
+ step=0.01,
1143
+ )
1144
+ with col_cost:
1145
+ cost_fp = st.number_input(
1146
+ "Cost of a false positive (FP)", min_value=0.0, value=1.0, step=0.5
1147
+ )
1148
+ cost_fn = st.number_input(
1149
+ "Cost of a false negative (FN)", min_value=0.0, value=5.0, step=0.5
1150
+ )
1151
+
1152
+ thr_metrics, df_curve = compute_threshold_view(
1153
+ y_true=y_val,
1154
+ y_proba=y_proba,
1155
+ threshold=threshold,
1156
+ cost_fp=cost_fp,
1157
+ cost_fn=cost_fn,
1158
+ )
1159
+
1160
+ c1, c2, c3, c4 = st.columns(4)
1161
+ with c1:
1162
+ st.metric("Accuracy", f"{thr_metrics['accuracy']:.4f}")
1163
+ with c2:
1164
+ st.metric("Precision", f"{thr_metrics['precision']:.4f}")
1165
+ with c3:
1166
+ st.metric("Recall", f"{thr_metrics['recall']:.4f}")
1167
+ with c4:
1168
+ st.metric("F1", f"{thr_metrics['f1']:.4f}")
1169
+
1170
+ c5, c6, c7, c8 = st.columns(4)
1171
+ with c5:
1172
+ st.metric("Specificity", f"{thr_metrics['specificity']:.4f}")
1173
+ with c6:
1174
+ st.metric("FP", thr_metrics["fp"])
1175
+ with c7:
1176
+ st.metric("FN", thr_metrics["fn"])
1177
+ with c8:
1178
+ st.metric("Total cost", f"{thr_metrics['cost']:.2f}")
1179
+
1180
+ st.markdown("##### F1 over threshold")
1181
+ fig_thr = px.line(
1182
+ df_curve,
1183
+ x="threshold",
1184
+ y="f1",
1185
+ title="F1 vs threshold",
1186
+ )
1187
+ fig_thr.update_layout(
1188
+ plot_bgcolor="rgba(0,0,0,0)",
1189
+ paper_bgcolor="rgba(0,0,0,0)",
1190
+ font=dict(color="#e5e7eb"),
1191
+ )
1192
+ st.plotly_chart(fig_thr, width="stretch")
1193
+
1194
+ fig_cost = px.line(
1195
+ df_curve,
1196
+ x="threshold",
1197
+ y="cost",
1198
+ title=f"Estimated cost (FP cost={cost_fp}, FN cost={cost_fn})",
1199
+ )
1200
+ fig_cost.update_layout(
1201
+ plot_bgcolor="rgba(0,0,0,0)",
1202
+ paper_bgcolor="rgba(0,0,0,0)",
1203
+ font=dict(color="#e5e7eb"),
1204
+ )
1205
+ st.plotly_chart(fig_cost, width="stretch")
1206
+
1207
+ # =========================================================
1208
+ # TAB 4: Compare models
1209
+ # =========================================================
1210
+
1211
+ with tab_compare:
1212
+ st.markdown(
1213
+ '<div class="section-header-pro">Model comparison</div>',
1214
+ unsafe_allow_html=True,
1215
+ )
1216
+ st.markdown(
1217
+ '<div class="section-desc-pro">Side-by-side comparison of metrics, ROC / PR curves, and confusion matrices.</div>',
1218
+ unsafe_allow_html=True,
1219
+ )
1220
+
1221
+ results_path = MODELS_DIR / "results.joblib"
1222
+ meta_path = MODELS_DIR / "metadata.joblib"
1223
+
1224
+ if not results_path.exists() or not meta_path.exists():
1225
+ st.info("Train models first to unlock comparison.")
1226
+ else:
1227
+ all_results = joblib.load(results_path)
1228
+ metadata = joblib.load(meta_path)
1229
+ y_val = metadata["y_val"]
1230
+
1231
+ st.markdown("#### Model cards")
1232
+ cols = st.columns(len(all_results))
1233
+ for (name, res), col in zip(all_results.items(), cols):
1234
+ m = res["metrics"]
1235
+ with col:
1236
+ st.markdown(
1237
+ f"""
1238
+ <div class="model-card">
1239
+ <div class="model-name">{name}</div>
1240
+ <div class="model-metrics">
1241
+ <div class="metric-box">
1242
+ <div class="metric-label">ACC</div>
1243
+ <div class="metric-value">{m['accuracy']:.3f}</div>
1244
+ </div>
1245
+ <div class="metric-box">
1246
+ <div class="metric-label">F1</div>
1247
+ <div class="metric-value">{m['f1']:.3f}</div>
1248
+ </div>
1249
+ <div class="metric-box">
1250
+ <div class="metric-label">ROC</div>
1251
+ <div class="metric-value">{m['roc_auc']:.3f}</div>
1252
+ </div>
1253
+ <div class="metric-box">
1254
+ <div class="metric-label">PR</div>
1255
+ <div class="metric-value">{m['pr_auc']:.3f}</div>
1256
+ </div>
1257
+ </div>
1258
+ </div>
1259
+ """,
1260
+ unsafe_allow_html=True,
1261
+ )
1262
+
1263
+ r1, r2 = st.columns(2)
1264
+ with r1:
1265
+ st.markdown("##### ROC curves")
1266
+ fig_roc = go.Figure()
1267
+ for name, res in all_results.items():
1268
+ fpr, tpr, _ = roc_curve(y_val, res["metrics"]["y_proba"])
1269
+ auc_score = res["metrics"]["roc_auc"]
1270
+ fig_roc.add_trace(
1271
+ go.Scatter(
1272
+ x=fpr,
1273
+ y=tpr,
1274
+ mode="lines",
1275
+ name=f"{name} (AUC={auc_score:.3f})",
1276
+ )
1277
+ )
1278
+ fig_roc.add_trace(
1279
+ go.Scatter(
1280
+ x=[0, 1],
1281
+ y=[0, 1],
1282
+ mode="lines",
1283
+ name="Random",
1284
+ line=dict(dash="dash", color="gray"),
1285
+ )
1286
+ )
1287
+ fig_roc.update_layout(
1288
+ xaxis_title="False positive rate",
1289
+ yaxis_title="True positive rate",
1290
+ plot_bgcolor="rgba(0,0,0,0)",
1291
+ paper_bgcolor="rgba(0,0,0,0)",
1292
+ font=dict(color="#e5e7eb"),
1293
+ )
1294
+ st.plotly_chart(fig_roc, width="stretch")
1295
+
1296
+ with r2:
1297
+ st.markdown("##### Precision-Recall curves")
1298
+ fig_pr = go.Figure()
1299
+ for name, res in all_results.items():
1300
+ prec, rec, _ = precision_recall_curve(
1301
+ y_val, res["metrics"]["y_proba"]
1302
+ )
1303
+ pr_auc = res["metrics"]["pr_auc"]
1304
+ fig_pr.add_trace(
1305
+ go.Scatter(
1306
+ x=rec,
1307
+ y=prec,
1308
+ mode="lines",
1309
+ name=f"{name} (AUC={pr_auc:.3f})",
1310
+ fill="tonexty",
1311
+ )
1312
+ )
1313
+ fig_pr.update_layout(
1314
+ xaxis_title="Recall",
1315
+ yaxis_title="Precision",
1316
+ plot_bgcolor="rgba(0,0,0,0)",
1317
+ paper_bgcolor="rgba(0,0,0,0)",
1318
+ font=dict(color="#e5e7eb"),
1319
+ )
1320
+ st.plotly_chart(fig_pr, width="stretch")
1321
+
1322
+ st.markdown("##### Confusion matrices (validation set)")
1323
+ cm_cols = st.columns(len(all_results))
1324
+ for (name, res), col in zip(all_results.items(), cm_cols):
1325
+ m = res["metrics"]
1326
+ cm = confusion_matrix(y_val, m["y_pred"])
1327
+ fig_cm = px.imshow(
1328
+ cm,
1329
+ labels=dict(x="Predicted", y="Actual", color="Count"),
1330
+ x=[metadata["neg_label"], metadata["pos_label"]],
1331
+ y=[metadata["neg_label"], metadata["pos_label"]],
1332
+ text_auto=True,
1333
+ title=name,
1334
+ )
1335
+ fig_cm.update_layout(
1336
+ plot_bgcolor="rgba(0,0,0,0)",
1337
+ paper_bgcolor="rgba(0,0,0,0)",
1338
+ font=dict(color="#e5e7eb"),
1339
+ )
1340
+ with col:
1341
+ st.plotly_chart(fig_cm, width="stretch")
1342
+
1343
+ # =========================================================
1344
+ # TAB 5: Error analysis
1345
+ # =========================================================
1346
+
1347
+ with tab_errors:
1348
+ st.markdown(
1349
+ '<div class="section-header-pro">Error analysis</div>',
1350
+ unsafe_allow_html=True,
1351
+ )
1352
+ st.markdown(
1353
+ '<div class="section-desc-pro">Browse misclassified texts to see where the model struggles and how confident it was.</div>',
1354
+ unsafe_allow_html=True,
1355
+ )
1356
+
1357
+ results_path = MODELS_DIR / "results.joblib"
1358
+ meta_path = MODELS_DIR / "metadata.joblib"
1359
+
1360
+ if not results_path.exists() or not meta_path.exists():
1361
+ st.info("Train models first to unlock error analysis.")
1362
+ else:
1363
+ all_results = joblib.load(results_path)
1364
+ metadata = joblib.load(meta_path)
1365
+ y_val = metadata["y_val"]
1366
+ val_idx = metadata["val_idx"]
1367
+
1368
+ best_name = max(
1369
+ all_results.keys(),
1370
+ key=lambda n: all_results[n]["metrics"]["f1"],
1371
+ )
1372
+ model_name = st.selectbox(
1373
+ "Model to inspect",
1374
+ options=list(all_results.keys()),
1375
+ index=list(all_results.keys()).index(best_name),
1376
+ )
1377
+
1378
+ m = all_results[model_name]["metrics"]
1379
+ y_pred = m["y_pred"]
1380
+ y_proba = m["y_proba"]
1381
+
1382
+ # Use .loc because val_idx is based on original index
1383
+ val_df = dfc.loc[val_idx].copy()
1384
+ val_df["true_label"] = np.where(
1385
+ y_val == 1, metadata["pos_label"], metadata["neg_label"]
1386
+ )
1387
+ val_df["pred_label"] = np.where(
1388
+ y_pred == 1, metadata["pos_label"], metadata["neg_label"]
1389
+ )
1390
+ val_df["proba_pos"] = y_proba
1391
+ val_df["correct"] = (y_val == y_pred)
1392
+ val_df["error_type"] = np.where(
1393
+ val_df["correct"],
1394
+ "Correct",
1395
+ np.where(y_val == 1, "False negative", "False positive"),
1396
+ )
1397
+
1398
+ col_f1, col_f2 = st.columns([1, 1])
1399
+ with col_f1:
1400
+ only_errors = st.checkbox("Show only misclassified samples", value=True)
1401
+ with col_f2:
1402
+ sort_mode = st.selectbox(
1403
+ "Sort by",
1404
+ options=[
1405
+ "Most confident errors",
1406
+ "Least confident predictions",
1407
+ "Random",
1408
+ ],
1409
+ )
1410
+
1411
+ df_view = val_df.copy()
1412
+ if only_errors:
1413
+ df_view = df_view[~df_view["correct"]]
1414
+
1415
+ if sort_mode == "Most confident errors":
1416
+ df_view["conf"] = np.abs(df_view["proba_pos"] - 0.5)
1417
+ df_view = df_view.sort_values("conf", ascending=False)
1418
+ elif sort_mode == "Least confident predictions":
1419
+ df_view["conf"] = np.abs(df_view["proba_pos"] - 0.5)
1420
+ df_view = df_view.sort_values("conf", ascending=True)
1421
+ else:
1422
+ df_view = df_view.sample(frac=1, random_state=42)
1423
+
1424
+ top_n = st.slider("Rows to show", 10, 200, 50, 10)
1425
+ cols_show = [
1426
+ "text_raw",
1427
+ "true_label",
1428
+ "pred_label",
1429
+ "proba_pos",
1430
+ "error_type",
1431
+ ]
1432
+ st.dataframe(
1433
+ df_view[cols_show].head(top_n),
1434
+ width="stretch",
1435
+ )
1436
+
1437
+ # =========================================================
1438
+ # TAB 6: Deploy
1439
+ # =========================================================
1440
+
1441
+ with tab_deploy:
1442
+ st.markdown(
1443
+ '<div class="section-header-pro">Deployment & interactive prediction</div>',
1444
+ unsafe_allow_html=True,
1445
+ )
1446
+ st.markdown(
1447
+ '<div class="section-desc-pro">Pick the best model, test arbitrary texts, and reuse the same logic in an API or batch job.</div>',
1448
+ unsafe_allow_html=True,
1449
+ )
1450
+
1451
+ models_path = MODELS_DIR / "models.joblib"
1452
+ vecs_path = MODELS_DIR / "vectorizers.joblib"
1453
+ results_path = MODELS_DIR / "results.joblib"
1454
+ meta_path = MODELS_DIR / "metadata.joblib"
1455
+
1456
+ if not (
1457
+ models_path.exists()
1458
+ and vecs_path.exists()
1459
+ and results_path.exists()
1460
+ and meta_path.exists()
1461
+ ):
1462
+ st.info("Train models first to enable deployment.")
1463
+ else:
1464
+ models = joblib.load(models_path)
1465
+ vecs = joblib.load(vecs_path)
1466
+ all_results = joblib.load(results_path)
1467
+ metadata = joblib.load(meta_path)
1468
+
1469
+ best_name = max(
1470
+ all_results.keys(),
1471
+ key=lambda n: all_results[n]["metrics"]["f1"],
1472
+ )
1473
+
1474
+ model_choice = st.selectbox(
1475
+ "Model for deployment",
1476
+ options=["Best (by F1)"] + list(models.keys()),
1477
+ index=0,
1478
+ )
1479
+
1480
+ if model_choice == "Best (by F1)":
1481
+ deploy_name = best_name
1482
+ st.info(f"Using {best_name} (best F1 on validation).")
1483
+ else:
1484
+ deploy_name = model_choice
1485
+
1486
+ model = models[deploy_name]
1487
+ word_vec = vecs[0]
1488
+ char_vec = vecs[1] if len(vecs) > 1 else None
1489
+
1490
+ if "deploy_text" not in st.session_state:
1491
+ st.session_state["deploy_text"] = ""
1492
+
1493
+ c_in, c_out = st.columns([1.4, 1.1])
1494
+ with c_in:
1495
+ st.markdown("#### Input text")
1496
+ example_col1, example_col2, example_col3 = st.columns(3)
1497
+ with example_col1:
1498
+ if st.button("Positive example"):
1499
+ st.session_state["deploy_text"] = (
1500
+ "Absolutely loved this. Great quality, fast delivery, and "
1501
+ "I would happily buy again."
1502
+ )
1503
+ with example_col2:
1504
+ if st.button("Mixed example"):
1505
+ st.session_state["deploy_text"] = (
1506
+ "Some parts were decent, but overall it felt overpriced and a bit disappointing."
1507
+ )
1508
+ with example_col3:
1509
+ if st.button("Negative example"):
1510
+ st.session_state["deploy_text"] = (
1511
+ "Terrible experience. Support was unhelpful and the product broke quickly."
1512
+ )
1513
+
1514
+ text_input = st.text_area(
1515
+ "Write or paste any text",
1516
+ height=160,
1517
+ value=st.session_state["deploy_text"],
1518
+ )
1519
+ predict_btn = st.button("Predict sentiment")
1520
+
1521
+ with c_out:
1522
+ if predict_btn and text_input.strip():
1523
+ clean_text = basic_clean(text_input)
1524
+ Xw = word_vec.transform([clean_text])
1525
+ if char_vec is not None:
1526
+ Xc = char_vec.transform([clean_text])
1527
+ X_test = hstack([Xw, Xc])
1528
+ else:
1529
+ X_test = Xw
1530
+
1531
+ try:
1532
+ proba = float(model.predict_proba(X_test)[0, 1])
1533
+ except Exception:
1534
+ scores = model.decision_function(X_test)
1535
+ proba = float(
1536
+ (scores - scores.min()) / (scores.max() - scores.min() + 1e-9)
1537
+ )
1538
+
1539
+ label_int = int(proba >= 0.5)
1540
+ label_str = (
1541
+ metadata["pos_label"] if label_int == 1 else metadata["neg_label"]
1542
+ )
1543
+ conf_pct = proba * 100.0 if label_int == 1 else (1.0 - proba) * 100.0
1544
+
1545
+ st.markdown(
1546
+ """
1547
+ <div class="prediction-card">
1548
+ <div class="prediction-label">Predicted sentiment</div>
1549
+ """,
1550
+ unsafe_allow_html=True,
1551
+ )
1552
+
1553
+ cls = "prediction-positive" if label_int == 1 else "prediction-negative"
1554
+ st.markdown(
1555
+ f'<div class="prediction-result {cls}">{label_str}</div>',
1556
+ unsafe_allow_html=True,
1557
+ )
1558
+ st.markdown(
1559
+ f'<div class="prediction-confidence">{conf_pct:.1f}% confidence</div>',
1560
+ unsafe_allow_html=True,
1561
+ )
1562
+
1563
+ width_pct = int(conf_pct)
1564
 
1565
+ st.markdown(
1566
+ f"""
1567
+ <div class="progress-bar">
1568
+ <div class="progress-fill" style="width:{width_pct}%;"></div>
1569
+ </div>
1570
+ </div>
1571
+ """,
1572
+ unsafe_allow_html=True,
1573
+ )