Published on Amaka’s Analytical Hub✨
Welcome back to my blog! If you're a data enthusiast or someone curious about using machine learning to tackle real-world problems, you’re in for a treat. In this post, we’ll dive deep into predicting heart disease using the powerful and intuitive Decision Tree Classifier. Ready to explore?
🔍 Why This Project?
Heart disease is a global health issue, with millions affected yearly. Predicting it early can save lives. This project aims to explore the predictive power of Decision Trees using the Indicators of Heart Disease dataset. With clear explanations and visualizations, I’ll show you how to:
Understand the dataset: What factors indicate heart disease?
Preprocess data: Clean, encode, and prepare the data.
Build and fine-tune a Decision Tree: From selecting features to optimizing the tree.
Visualize results: Making predictions and interpreting the tree.
🔠 Exploring the Dataset
The dataset we’re working with contains demographic, behavioral, and health-related information. Here are some key columns:
HeartDisease: Target variable (Yes/No).
BMI: Body Mass Index.
Smoking: Smoking habits (Yes/No).
GenHealth: General health status (Poor, Fair, Good, Very Good, Excellent).
Here’s a quick look:
import pandas as pd
df = pd.read_csv("heartDisease_2020_sampling.csv")
print(df.head())
HeartDisease | BMI | Smoking | GenHealth | ... |
No | 29.12 | No | Very Good | ... |
Yes | 33.91 | Yes | Good | ... |
Size of the dataset: 306,000 rows × 18 columns
🌐 Preprocessing the Data
Data cleaning is the secret sauce of great models. Let’s transform our dataset for analysis:
1. Label Encoding
Convert categorical features (like HeartDisease
and GenHealth
) into numerical values:
from sklearn.preprocessing import LabelEncoder
def label_encoder(df, columns):
for col in columns:
le = LabelEncoder()
df[col] = le.fit_transform(df[col])
return df
# Apply encoding
df = label_encoder(df, ["HeartDisease", "GenHealth"])
2. One-Hot Encoding
For features with multiple categories, create binary columns:
df = pd.get_dummies(df, drop_first=True)
🌳 Building a Decision Tree Classifier
Let’s build a model to predict whether someone has heart disease based on their features. Here’s the process:
Splitting the Dataset
We divide the data into training (80%) and testing (20%) sets:
from sklearn.model_selection import train_test_split
X = df.drop("HeartDisease", axis=1)
y = df["HeartDisease"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Training the Model
We’ll train a Decision Tree Classifier and find the optimal tree depth to avoid overfitting by testing depths from 1 to 20:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
best_depth, best_test_acc = None, 0
for depth in range(1, 21):
clf = DecisionTreeClassifier(max_depth=depth, random_state=42)
clf.fit(X_train, y_train)
# Evaluate performance
train_acc = accuracy_score(y_train, clf.predict(X_train))
test_acc = accuracy_score(y_test, clf.predict(X_test))
if test_acc > best_test_acc:
best_depth, best_test_acc = depth, test_acc
🔍 Results and Insights
After fine-tuning, here are the results:
Optimal Tree Depth:
7
Testing Accuracy:
87%
🎉
Here’s the confusion matrix for predictions:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, clf.predict(X_test))
print(cm)
Interpretation: The model effectively distinguishes between individuals with and without heart disease. It’s especially good at reducing false negatives—a critical factor in medical predictions.
📊 Visualizing the Decision Tree
Decision Trees are easy to interpret. Here’s how you can visualize yours:
from sklearn.tree import export_text
print(export_text(clf, feature_names=list(X.columns)))
For a more interactive visualization:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 8))
plot_tree(clf, feature_names=X.columns, class_names=["No", "Yes"], filled=True)
plt.show()
🌟 Fun Discoveries
Exploring the leaf nodes revealed some interesting patterns:
leaf_nodes = clf.apply(X_test)
patients_in_leaf = X_test[leaf_nodes == leaf_id]
print(patients_in_leaf.head())
Key Observations:
BMI and GenHealth were consistent indicators in predicting heart disease.
Patients with poor general health and high BMI were more likely to fall into the “positive” category.
💡 Lessons Learned
Preprocessing is Key: Encoding techniques like label and one-hot encoding made the dataset model-ready.
Model Tuning: Experimenting with hyperparameters (like tree depth) is essential to prevent overfitting.
Data Science is Fun: Especially when you uncover meaningful patterns that can make a difference in real life.
🌍 GitHub Repository
Want to explore the code? Check out the complete project on my GitHub Repository. Feel free to clone it, play around, and share your insights!
🙏 Thanks for Reading!
I’d love to hear your thoughts. Have you worked with Decision Trees before? Got ideas on how to improve this model? Let’s chat! Drop a comment below or connect with me on LinkedIn.