β€οΈ Heart Disease Prediction (scikit-learn Random Forest)
A classic machine learning model that predicts the likelihood of heart disease from structured patient medical attributes (tabular data).
This repository contains a joblib-serialized scikit-learn model trained and evaluated in an end-to-end Jupyter Notebook workflow.
Model Details
- Developed by: brej-29
- Model type:
RandomForestClassifier(scikit-learn) - Task: Binary classification (tabular)
- Output labels:
0β No heart disease1β Heart disease present
- Saved artifact:
heart_disease_model.joblib - Training notebook:
HeartDiseasePredictionProject.ipynb - Source code / project repo: https://github.com/brej-29/Logicmojo-AIML-Assignments-heart-disease-prediction-ml
- License: MIT
Intended Use
Direct Use
- Educational / portfolio demonstration of an end-to-end ML pipeline:
- EDA β modeling β hyperparameter tuning β evaluation β persistence
- Research prototyping and experimentation with classical ML on healthcare-like tabular data
Out-of-Scope Use (Important)
- Not for clinical diagnosis
- Not a medical device
- Not validated for real-world patient care
- Do not use this model as the sole basis for medical decisions.
Training Data
The model was trained on a tabular dataset included in the project repository as heart-disease.csv.
- Rows: 303
- Columns: 14
- Features: 13
- Target: 1 (
target)
- Target distribution:
1: 1650: 138
Features (Input Schema)
The model expects 13 columns:
| Feature | Description |
|---|---|
age |
Age in years |
sex |
Sex (commonly encoded as 1 = male, 0 = female) |
cp |
Chest pain type (categorical encoded as integers) |
trestbps |
Resting blood pressure |
chol |
Serum cholesterol |
fbs |
Fasting blood sugar (binary) |
restecg |
Resting ECG results (categorical encoded as integers) |
thalach |
Maximum heart rate achieved |
exang |
Exercise-induced angina (binary) |
oldpeak |
ST depression induced by exercise relative to rest |
slope |
Slope of peak exercise ST segment (categorical encoded as integers) |
ca |
Number of major vessels (categorical encoded as integers) |
thal |
Thalassemia category (categorical encoded as integers) |
Training Procedure
Data Split
train_test_split(test_size=0.2)- Randomness controlled via
np.random.seed(42)in the notebook
Candidate Models Explored
- Logistic Regression
- K-Nearest Neighbors (KNN)
- Random Forest
Hyperparameter Tuning
RandomizedSearchCVused to tune Random Forestcv=5,n_iter=20
- Best Random Forest hyperparameters found in the notebook:
n_estimators=210max_depth=3min_samples_split=4min_samples_leaf=19
Final Model
The saved model (heart_disease_model.joblib) corresponds to:
RandomForestClassifier(n_estimators=210, max_depth=3, min_samples_split=4, min_samples_leaf=19)
Evaluation
Baseline Test Accuracy (single 80/20 split)
- KNN: ~0.689
- Logistic Regression: ~0.885
- Random Forest: ~0.836
Final Model Performance
- Loaded saved model test accuracy: 0.87
Cross-Validated Metrics (5-fold mean) β Final Random Forest
- Accuracy: 0.8479781421
- Precision: 0.8215873016
- Recall: 0.9272727273
- F1: 0.8705403543
Note: The notebook also visualizes confusion matrices and ROC curves for model comparison.
How to Use
1) Install dependencies
pip install scikit-learn joblib pandas numpy huggingface_hub
2) Load the model from Hugging Face Hub
from huggingface_hub import hf_hub_download
import joblib
import pandas as pd
# Replace with your HF repo id, e.g. "brej-29/heart-disease-prediction-rf"
repo_id = "YOUR_USERNAME/YOUR_MODEL_REPO"
model_path = hf_hub_download(
repo_id=repo_id,
filename="heart_disease_model.joblib"
)
model = joblib.load(model_path)
# Example input (values are placeholders; use correctly-encoded values)
sample = pd.DataFrame([{
"age": 57,
"sex": 1,
"cp": 0,
"trestbps": 120,
"chol": 354,
"fbs": 0,
"restecg": 1,
"thalach": 163,
"exang": 1,
"oldpeak": 0.6,
"slope": 2,
"ca": 0,
"thal": 2
}])
pred = model.predict(sample)[0]
proba = model.predict_proba(sample)[0, 1] # probability of class "1"
print("Prediction:", int(pred))
print("P(heart disease):", float(proba))
Input Requirements
- Provide all 13 feature columns
- Ensure categorical features (
cp,restecg,slope,ca,thal) follow the same integer encoding as used in training - Numeric types should be valid numbers (
int/float); no missing values
Bias, Risks, and Limitations
- Small dataset (303 rows) β results may not generalize to broader populations
- Encoding-dependent: categorical values must match training conventions
- No clinical validation: metrics are from offline evaluation only
- False negatives are possible (missed risk) β do not use for medical screening without rigorous validation
Environmental Impact
Training and evaluation were performed using classical ML methods on a small tabular dataset and are expected to have minimal compute and carbon impact (CPU-only, short runtime).
Technical Specifications
- Framework: scikit-learn
- Model format: joblib (
heart_disease_model.joblib) - Inference type: CPU-friendly tabular prediction
Model Card Authors
- BrejBala
Contact
For questions/feedback, please open an issue on the GitHub repository:
https://github.com/brej-29/Logicmojo-AIML-Assignments-heart-disease-prediction-ml
Evaluation results
- Test Accuracy on heart-disease.csv (included in project repo)self-reported0.870
- 5-fold CV Accuracy (mean) on heart-disease.csv (included in project repo)self-reported0.848
- 5-fold CV Precision (mean) on heart-disease.csv (included in project repo)self-reported0.822
- 5-fold CV Recall (mean) on heart-disease.csv (included in project repo)self-reported0.927
- 5-fold CV F1 (mean) on heart-disease.csv (included in project repo)self-reported0.871