❤️ Heart Disease Prediction (scikit-learn Random Forest)

A classic machine learning model that predicts the likelihood of heart disease from structured patient medical attributes (tabular data).
This repository contains a joblib-serialized scikit-learn model trained and evaluated in an end-to-end Jupyter Notebook workflow.

Model Details

Developed by: brej-29
Model type: RandomForestClassifier (scikit-learn)
Task: Binary classification (tabular)
Output labels:
- 0 → No heart disease
- 1 → Heart disease present
Saved artifact: heart_disease_model.joblib
Training notebook: HeartDiseasePredictionProject.ipynb
Source code / project repo: https://github.com/brej-29/Logicmojo-AIML-Assignments-heart-disease-prediction-ml
License: MIT

Intended Use

Direct Use

Educational / portfolio demonstration of an end-to-end ML pipeline:
- EDA → modeling → hyperparameter tuning → evaluation → persistence
Research prototyping and experimentation with classical ML on healthcare-like tabular data

Out-of-Scope Use (Important)

Not for clinical diagnosis
Not a medical device
Not validated for real-world patient care
Do not use this model as the sole basis for medical decisions.

Training Data

The model was trained on a tabular dataset included in the project repository as heart-disease.csv.

Rows: 303
Columns: 14
- Features: 13
- Target: 1 (target)
Target distribution:
- 1: 165
- 0: 138

Features (Input Schema)

The model expects 13 columns:

Feature	Description
`age`	Age in years
`sex`	Sex (commonly encoded as 1 = male, 0 = female)
`cp`	Chest pain type (categorical encoded as integers)
`trestbps`	Resting blood pressure
`chol`	Serum cholesterol
`fbs`	Fasting blood sugar (binary)
`restecg`	Resting ECG results (categorical encoded as integers)
`thalach`	Maximum heart rate achieved
`exang`	Exercise-induced angina (binary)
`oldpeak`	ST depression induced by exercise relative to rest
`slope`	Slope of peak exercise ST segment (categorical encoded as integers)
`ca`	Number of major vessels (categorical encoded as integers)
`thal`	Thalassemia category (categorical encoded as integers)

Training Procedure

Data Split

train_test_split(test_size=0.2)
Randomness controlled via np.random.seed(42) in the notebook

Candidate Models Explored

Logistic Regression
K-Nearest Neighbors (KNN)
Random Forest

Hyperparameter Tuning

RandomizedSearchCV used to tune Random Forest
- cv=5, n_iter=20
Best Random Forest hyperparameters found in the notebook:
- n_estimators=210
- max_depth=3
- min_samples_split=4
- min_samples_leaf=19

Final Model

The saved model (heart_disease_model.joblib) corresponds to:

RandomForestClassifier(n_estimators=210, max_depth=3, min_samples_split=4, min_samples_leaf=19)

Evaluation

Baseline Test Accuracy (single 80/20 split)

KNN: ~0.689
Logistic Regression: ~0.885
Random Forest: ~0.836

Final Model Performance

Loaded saved model test accuracy: 0.87

Cross-Validated Metrics (5-fold mean) — Final Random Forest

Accuracy: 0.8479781421
Precision: 0.8215873016
Recall: 0.9272727273
F1: 0.8705403543

Note: The notebook also visualizes confusion matrices and ROC curves for model comparison.

How to Use

1) Install dependencies

pip install scikit-learn joblib pandas numpy huggingface_hub

2) Load the model from Hugging Face Hub

from huggingface_hub import hf_hub_download
import joblib
import pandas as pd

# Replace with your HF repo id, e.g. "brej-29/heart-disease-prediction-rf"
repo_id = "YOUR_USERNAME/YOUR_MODEL_REPO"

model_path = hf_hub_download(
    repo_id=repo_id,
    filename="heart_disease_model.joblib"
)

model = joblib.load(model_path)

# Example input (values are placeholders; use correctly-encoded values)
sample = pd.DataFrame([{
    "age": 57,
    "sex": 1,
    "cp": 0,
    "trestbps": 120,
    "chol": 354,
    "fbs": 0,
    "restecg": 1,
    "thalach": 163,
    "exang": 1,
    "oldpeak": 0.6,
    "slope": 2,
    "ca": 0,
    "thal": 2
}])

pred = model.predict(sample)[0]
proba = model.predict_proba(sample)[0, 1]  # probability of class "1"

print("Prediction:", int(pred))
print("P(heart disease):", float(proba))

Input Requirements

Provide all 13 feature columns
Ensure categorical features (cp, restecg, slope, ca, thal) follow the same integer encoding as used in training
Numeric types should be valid numbers (int/float); no missing values

Bias, Risks, and Limitations

Small dataset (303 rows) → results may not generalize to broader populations
Encoding-dependent: categorical values must match training conventions
No clinical validation: metrics are from offline evaluation only
False negatives are possible (missed risk) — do not use for medical screening without rigorous validation

Environmental Impact

Training and evaluation were performed using classical ML methods on a small tabular dataset and are expected to have minimal compute and carbon impact (CPU-only, short runtime).

Technical Specifications

Framework: scikit-learn
Model format: joblib (heart_disease_model.joblib)
Inference type: CPU-friendly tabular prediction

Model Card Authors

BrejBala

Contact

For questions/feedback, please open an issue on the GitHub repository:
https://github.com/brej-29/Logicmojo-AIML-Assignments-heart-disease-prediction-ml

Downloads last month: -; Downloads are not tracked for this model. How to track

Evaluation results

Test Accuracy on heart-disease.csv (included in project repo)
self-reported

0.870
5-fold CV Accuracy (mean) on heart-disease.csv (included in project repo)
self-reported

0.848
5-fold CV Precision (mean) on heart-disease.csv (included in project repo)
self-reported

0.822
5-fold CV Recall (mean) on heart-disease.csv (included in project repo)
self-reported

0.927
5-fold CV F1 (mean) on heart-disease.csv (included in project repo)
self-reported

0.871