Checking Python packages

# import os
# # Set working directory manually on Gadi to be able to load csv files
# user = os.getenv('USER')
# os.chdir('/scratch/cd82/'+user+'/notebooks/')
# !which python
# !where python
# Package installation if not already installed
# !pip install numpy
# !pip install scikit-learn
# import importlib
# import os

# packages = ['numpy', 'pandas', 'matplotlib', 'seaborn', 'sklearn', 'xgboost', 'shap']

# for pkg in packages:
#     try:
#         module = importlib.import_module(pkg)
#         path = os.path.dirname(module.__file__)
#         print(f"{pkg}: {path}")
#     except ImportError:
#         print(f"{pkg}: Not installed")

Decision Tree Basics

Decision Trees are supervised learning models used for both classification and regression tasks. They work by recursively splitting the dataset based on feature values to reduce impurity.

How It Works

  • For classification, trees use metrics like Gini impurity or Entropy to decide the best split.
  • For regression, they typically minimize Mean Squared Error (MSE).

The tree starts at a root and splits the data into branches based on feature thresholds, creating a path to a decision leaf.

🔍 How Splitting Works in Decision Trees

🧪 Classification: Gini Impurity and Entropy

To decide the best feature and threshold to split on, decision trees evaluate impurity at each possible split. Lower impurity means a better split.

✅ Gini Impurity

Gini measures how often a randomly chosen element would be incorrectly labeled if it was randomly labeled according to the distribution in the node:

\[ \text{Gini} = 1 - \sum_{i=1}^{C} p_i^2 \]

Where: - $ C $ is the number of classes
- $ p_i $ is the proportion of class $ i $

✅ Entropy (Information Gain)

Entropy measures the disorder or uncertainty of the classes:

\[ \text{Entropy} = - \sum_{i=1}^{C} p_i \log_2(p_i) \]

A split is chosen to minimize the weighted impurity (Gini or Entropy) of the resulting child nodes.


📈 Regression: Mean Squared Error (MSE)

In regression trees, the quality of a split is measured using Mean Squared Error, which calculates how far predictions are from actual values.

✅ MSE Formula

\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y})^2 \]

Where: - $ y_i $ are the true values
- $ {y} $ is the mean value of the current region
- $ n $ is the number of samples

The best split minimizes the total MSE across the child nodes.


🎯 Final Prediction

  • Classification Tree: predicts the majority class in a leaf.
  • Regression Tree: predicts the mean target value of samples in a leaf.

Key Hyperparameters

  • max_depth: Maximum number of splits down any path.
  • min_samples_split: Minimum samples needed to split a node.
  • min_samples_leaf: Minimum samples in a leaf node.
  • criterion: Splitting metric (gini, entropy, squared_error).

Importing packages

# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing, fetch_openml
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree
from sklearn.model_selection import train_test_split
import seaborn as sns
%matplotlib inline

Classification on Heart Disease dataset

# Load dataset 
df = pd.read_csv("heart.csv")

# Display dataframe
display(df.head())
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
0 52 1 0 125 212 0 1 168 0 1.0 2 2 3 0
1 53 1 0 140 203 1 0 155 1 3.1 0 0 3 0
2 70 1 0 145 174 0 1 125 1 2.6 0 0 3 0
3 61 1 0 148 203 0 1 161 0 0.0 2 1 3 0
4 62 0 0 138 294 1 1 106 0 1.9 1 3 2 0

target (or sometimes named num in original datasets):

Binary classification:

0 → No heart disease

1 → Heart disease present

Column Name Description
age Age in years
sex Sex (1 = male; 0 = female)
cp Chest pain type (0–3, categorical: typical angina to asymptomatic)
trestbps Resting blood pressure (in mm Hg)
chol Serum cholesterol in mg/dl
fbs Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
restecg Resting electrocardiographic results (0, 1, 2)
thalach Maximum heart rate achieved
exang Exercise-induced angina (1 = yes; 0 = no)
oldpeak ST depression induced by exercise relative to rest
slope Slope of the peak exercise ST segment (0–2)
ca Number of major vessels colored by fluoroscopy (0–3)
thal Thalassemia (1 = normal; 2 = fixed defect; 3 = reversible defect)
# show numerical columns
display(df.describe())
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
count 1025.000000 1025.000000 1025.000000 1025.000000 1025.00000 1025.000000 1025.000000 1025.000000 1025.000000 1025.000000 1025.000000 1025.000000 1025.000000 1025.000000
mean 54.434146 0.695610 0.942439 131.611707 246.00000 0.149268 0.529756 149.114146 0.336585 1.071512 1.385366 0.754146 2.323902 0.513171
std 9.072290 0.460373 1.029641 17.516718 51.59251 0.356527 0.527878 23.005724 0.472772 1.175053 0.617755 1.030798 0.620660 0.500070
min 29.000000 0.000000 0.000000 94.000000 126.00000 0.000000 0.000000 71.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 48.000000 0.000000 0.000000 120.000000 211.00000 0.000000 0.000000 132.000000 0.000000 0.000000 1.000000 0.000000 2.000000 0.000000
50% 56.000000 1.000000 1.000000 130.000000 240.00000 0.000000 1.000000 152.000000 0.000000 0.800000 1.000000 0.000000 2.000000 1.000000
75% 61.000000 1.000000 2.000000 140.000000 275.00000 0.000000 1.000000 166.000000 1.000000 1.800000 2.000000 1.000000 3.000000 1.000000
max 77.000000 1.000000 3.000000 200.000000 564.00000 1.000000 2.000000 202.000000 1.000000 6.200000 2.000000 4.000000 3.000000 1.000000
df.hist(figsize=(10,6))
plt.suptitle('Iris Feature Distributions')
plt.tight_layout()
plt.show()

# Load dataset 
df = pd.read_csv("heart.csv")

X = df.drop("target", axis=1)
y = df["target"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf = DecisionTreeClassifier(max_depth=3, random_state=42)
clf.fit(X_train, y_train)

plt.figure(figsize=(20,10))
plot_tree(clf, feature_names=X.columns, class_names=["No Disease", "Disease"], filled=True)
plt.title("Decision Tree (max_depth=3)")
plt.show()

from sklearn.metrics import classification_report, ConfusionMatrixDisplay

from sklearn.metrics import accuracy_score

# Predict on test set
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Display
print(f"Accuracy: {accuracy:.2f}")

ConfusionMatrixDisplay.from_estimator(clf, X_test, y_test)
plt.title("Normalized Confusion Matrix")
plt.show()
Accuracy: 0.81

Regression with California Housing

# Load regression dataset (California housing)
housing = fetch_california_housing(as_frame=True)
df = housing.frame
# California Housing dataset
display(df.head())
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedHouseVal
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3.585
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3.521
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3.413
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3.422

MedHouseVal: Median house value (in hundreds of thousands of dollars). For example, a value of 2.5 means $250,000.

Feature Name Description
MedInc Median income in block group (in tens of thousands of dollars)
HouseAge Median age of houses in the block group
AveRooms Average number of rooms per household
AveBedrms Average number of bedrooms per household
Population Block group population
AveOccup Average number of household members
Latitude Latitude of the block group (geographic location)
Longitude Longitude of the block group (geographic location)
df.hist(figsize=(14,8))
plt.suptitle('California Housing Feature Distributions')
plt.show()

# Train a Decision Tree Regressor
X = housing.data
y = housing.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

reg = DecisionTreeRegressor(max_depth=3, random_state=42)
reg.fit(X_train, y_train)
DecisionTreeRegressor(max_depth=3, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
plt.figure(figsize=(12, 6))
plot_tree(reg, feature_names=housing.feature_names, filled=True)
plt.title('Decision Tree Regressor')
plt.show()

What is R² (Coefficient of Determination)?

The R² score measures how well the predictions of a regression model approximate the actual data. It is defined as:

\[ R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2} \]

Where: - $ y_i $ are the actual values, - $ _i $ are the predicted values, - $ {y} $ is the mean of the actual values.

Interpretation: - $ R^2 = 1 $: perfect prediction - $ R^2 = 0 $: model predicts no better than the mean - $ R^2 < 0 $: model performs worse than predicting the mean

from sklearn.metrics import r2_score

# Predict on test set
y_pred = reg.predict(X_test)

# Calculate R²
r2 = r2_score(y_test, y_pred)

# Display
print(f"R² Score: {r2:.2f}")
R² Score: 0.52