Exploring Heart Disease Predictions with Decision Trees ✨🌳

Exploring Heart Disease Predictions with Decision Trees ✨🌳

Published on Amaka’s Analytical Hub✨

Welcome back to my blog! If you're a data enthusiast or someone curious about using machine learning to tackle real-world problems, you’re in for a treat. In this post, we’ll dive deep into predicting heart disease using the powerful and intuitive Decision Tree Classifier. Ready to explore?


🔍 Why This Project?

Heart disease is a global health issue, with millions affected yearly. Predicting it early can save lives. This project aims to explore the predictive power of Decision Trees using the Indicators of Heart Disease dataset. With clear explanations and visualizations, I’ll show you how to:

  1. Understand the dataset: What factors indicate heart disease?

  2. Preprocess data: Clean, encode, and prepare the data.

  3. Build and fine-tune a Decision Tree: From selecting features to optimizing the tree.

  4. Visualize results: Making predictions and interpreting the tree.


🔠 Exploring the Dataset

The dataset we’re working with contains demographic, behavioral, and health-related information. Here are some key columns:

  • HeartDisease: Target variable (Yes/No).

  • BMI: Body Mass Index.

  • Smoking: Smoking habits (Yes/No).

  • GenHealth: General health status (Poor, Fair, Good, Very Good, Excellent).

Here’s a quick look:

import pandas as pd

df = pd.read_csv("heartDisease_2020_sampling.csv")
print(df.head())
HeartDiseaseBMISmokingGenHealth...
No29.12NoVery Good...
Yes33.91YesGood...

Size of the dataset: 306,000 rows × 18 columns


🌐 Preprocessing the Data

Data cleaning is the secret sauce of great models. Let’s transform our dataset for analysis:

1. Label Encoding

Convert categorical features (like HeartDisease and GenHealth) into numerical values:

from sklearn.preprocessing import LabelEncoder

def label_encoder(df, columns):
    for col in columns:
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col])
    return df

# Apply encoding
df = label_encoder(df, ["HeartDisease", "GenHealth"])

2. One-Hot Encoding

For features with multiple categories, create binary columns:

df = pd.get_dummies(df, drop_first=True)

🌳 Building a Decision Tree Classifier

Let’s build a model to predict whether someone has heart disease based on their features. Here’s the process:

Splitting the Dataset

We divide the data into training (80%) and testing (20%) sets:

from sklearn.model_selection import train_test_split

X = df.drop("HeartDisease", axis=1)
y = df["HeartDisease"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Training the Model

We’ll train a Decision Tree Classifier and find the optimal tree depth to avoid overfitting by testing depths from 1 to 20:

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

best_depth, best_test_acc = None, 0

for depth in range(1, 21):
    clf = DecisionTreeClassifier(max_depth=depth, random_state=42)
    clf.fit(X_train, y_train)

    # Evaluate performance
    train_acc = accuracy_score(y_train, clf.predict(X_train))
    test_acc = accuracy_score(y_test, clf.predict(X_test))

    if test_acc > best_test_acc:
        best_depth, best_test_acc = depth, test_acc

🔍 Results and Insights

After fine-tuning, here are the results:

  • Optimal Tree Depth: 7

  • Testing Accuracy: 87% 🎉

Here’s the confusion matrix for predictions:

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, clf.predict(X_test))
print(cm)

Interpretation: The model effectively distinguishes between individuals with and without heart disease. It’s especially good at reducing false negatives—a critical factor in medical predictions.


📊 Visualizing the Decision Tree

Decision Trees are easy to interpret. Here’s how you can visualize yours:

from sklearn.tree import export_text

print(export_text(clf, feature_names=list(X.columns)))

For a more interactive visualization:

from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))
plot_tree(clf, feature_names=X.columns, class_names=["No", "Yes"], filled=True)
plt.show()

🌟 Fun Discoveries

Exploring the leaf nodes revealed some interesting patterns:

leaf_nodes = clf.apply(X_test)
patients_in_leaf = X_test[leaf_nodes == leaf_id]
print(patients_in_leaf.head())

Key Observations:

  1. BMI and GenHealth were consistent indicators in predicting heart disease.

  2. Patients with poor general health and high BMI were more likely to fall into the “positive” category.


💡 Lessons Learned

  1. Preprocessing is Key: Encoding techniques like label and one-hot encoding made the dataset model-ready.

  2. Model Tuning: Experimenting with hyperparameters (like tree depth) is essential to prevent overfitting.

  3. Data Science is Fun: Especially when you uncover meaningful patterns that can make a difference in real life.


🌍 GitHub Repository

Want to explore the code? Check out the complete project on my GitHub Repository. Feel free to clone it, play around, and share your insights!


🙏 Thanks for Reading!

I’d love to hear your thoughts. Have you worked with Decision Trees before? Got ideas on how to improve this model? Let’s chat! Drop a comment below or connect with me on LinkedIn.