Content from 01 Introduction


Last updated on 2025-07-10 | Edit this page

Introduction to Classification

What is Classification?

Classification is a type of supervised learning where the goal is to predict categorical class labels. Given input data, a classification model attempts to assign it to one of several predefined classes.

Some examples include: - Email spam detection (spam vs. not spam) - Disease diagnosis (positive vs. negative) - Image recognition (cat, dog, or other)

Workshop Goals

By the end of this workshop, you will be able to: - Understand common classification algorithms - Apply them using Scikit-Learn, NumPy, Pandas, and Matplotlib - Evaluate and optimise models

Topics Covered

  1. Logistic Regression
  2. Support Vector Machines (SVM)
  3. Model Evaluation: Accuracy, Precision, Recall, F1-Score, ROC-AUC
  4. Neural Networks (MLPClassifier)
  5. Random Forest Classifier
  6. Optimisation and Tuning

Required Libraries

We will use the following Python libraries throughout the workshop: - NumPy – numerical operations - Pandas – data manipulation - Scikit-Learn – machine learning models and tools - Matplotlib – data visualisation - Seaborn - data visualisation


Let’s get started! 🚀

Installing Libraries

Uncomment and run the commands below only if packages are not installed.

PYTHON

# !pip install numpy
# !pip install pandas
# !pip install scikit-learn
# !pip install matplotlib
# !pip install seaborn

Check your environment has the necessary libraries installed

PYTHON

import numpy
print("NumPy version:", numpy.__version__)

import pandas
print("Pandas version:", pandas.__version__)

import sklearn
print("sklearn version:", sklearn.__version__)

import matplotlib
print("matplotlib version:", matplotlib.__version__)

import seaborn
print("sklearn version:", seaborn.__version__)

SH

NumPy version: 2.2.6
Pandas version: 2.2.3
sklearn version: 1.7.0
matplotlib version: 3.10.3
sklearn version: 0.13.2

PYTHON

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets

Preview Example Dataset

We use the load_breast_cancer() dataset from Scikit-Learn. It includes 30 numeric features extracted from breast mass images.

PYTHON

from sklearn.datasets import load_breast_cancer
import pandas as pd

data = load_breast_cancer()
X = data.data
y = data.target

df = pd.DataFrame(X, columns=data.feature_names)
df['target'] = y
df.head()
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension target
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 0
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 0
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 0
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 0
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 0

5 rows × 31 columns

PYTHON

df.describe()
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension target
count 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000
mean 14.127292 19.289649 91.969033 654.889104 0.096360 0.104341 0.088799 0.048919 0.181162 0.062798 25.677223 107.261213 880.583128 0.132369 0.254265 0.272188 0.114606 0.290076 0.083946 0.627417
std 3.524049 4.301036 24.298981 351.914129 0.014064 0.052813 0.079720 0.038803 0.027414 0.007060 6.146258 33.602542 569.356993 0.022832 0.157336 0.208624 0.065732 0.061867 0.018061 0.483918
min 6.981000 9.710000 43.790000 143.500000 0.052630 0.019380 0.000000 0.000000 0.106000 0.049960 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.156500 0.055040 0.000000
25% 11.700000 16.170000 75.170000 420.300000 0.086370 0.064920 0.029560 0.020310 0.161900 0.057700 21.080000 84.110000 515.300000 0.116600 0.147200 0.114500 0.064930 0.250400 0.071460 0.000000
50% 13.370000 18.840000 86.240000 551.100000 0.095870 0.092630 0.061540 0.033500 0.179200 0.061540 25.410000 97.660000 686.500000 0.131300 0.211900 0.226700 0.099930 0.282200 0.080040 1.000000
75% 15.780000 21.800000 104.100000 782.700000 0.105300 0.130400 0.130700 0.074000 0.195700 0.066120 29.720000 125.400000 1084.000000 0.146000 0.339100 0.382900 0.161400 0.317900 0.092080 1.000000
max 28.110000 39.280000 188.500000 2501.000000 0.163400 0.345400 0.426800 0.201200 0.304000 0.097440 49.540000 251.200000 4254.000000 0.222600 1.058000 1.252000 0.291000 0.663800 0.207500 1.000000

8 rows × 31 columns

PYTHON

df.hist(bins=20, figsize=(15, 10))
plt.tight_layout()
png
png

PYTHON

from sklearn.preprocessing import StandardScaler

# Apply StandardScaler
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns, index=df.index)
df_scaled.describe()
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension target
count 5.690000e+02 5.690000e+02 5.690000e+02 5.690000e+02 5.690000e+02 5.690000e+02 5.690000e+02 5.690000e+02 5.690000e+02 5.690000e+02 5.690000e+02 5.690000e+02 569.000000 5.690000e+02 5.690000e+02 5.690000e+02 5.690000e+02 5.690000e+02 5.690000e+02 5.690000e+02
mean -1.373633e-16 6.868164e-17 -1.248757e-16 -2.185325e-16 -8.366672e-16 1.873136e-16 4.995028e-17 -4.995028e-17 1.748260e-16 4.745277e-16 1.248757e-17 -3.746271e-16 0.000000 -2.372638e-16 -3.371644e-16 7.492542e-17 2.247763e-16 2.622390e-16 -5.744282e-16 -4.995028e-17
std 1.000880e+00 1.000880e+00 1.000880e+00 1.000880e+00 1.000880e+00 1.000880e+00 1.000880e+00 1.000880e+00 1.000880e+00 1.000880e+00 1.000880e+00 1.000880e+00 1.000880 1.000880e+00 1.000880e+00 1.000880e+00 1.000880e+00 1.000880e+00 1.000880e+00 1.000880e+00
min -2.029648e+00 -2.229249e+00 -1.984504e+00 -1.454443e+00 -3.112085e+00 -1.610136e+00 -1.114873e+00 -1.261820e+00 -2.744117e+00 -1.819865e+00 -2.223994e+00 -1.693361e+00 -1.222423 -2.682695e+00 -1.443878e+00 -1.305831e+00 -1.745063e+00 -2.160960e+00 -1.601839e+00 -1.297676e+00
25% -6.893853e-01 -7.259631e-01 -6.919555e-01 -6.671955e-01 -7.109628e-01 -7.470860e-01 -7.437479e-01 -7.379438e-01 -7.032397e-01 -7.226392e-01 -7.486293e-01 -6.895783e-01 -0.642136 -6.912304e-01 -6.810833e-01 -7.565142e-01 -7.563999e-01 -6.418637e-01 -6.919118e-01 -1.297676e+00
50% -2.150816e-01 -1.046362e-01 -2.359800e-01 -2.951869e-01 -3.489108e-02 -2.219405e-01 -3.422399e-01 -3.977212e-01 -7.162650e-02 -1.782793e-01 -4.351564e-02 -2.859802e-01 -0.341181 -4.684277e-02 -2.695009e-01 -2.182321e-01 -2.234689e-01 -1.274095e-01 -2.164441e-01 7.706085e-01
75% 4.693926e-01 5.841756e-01 4.996769e-01 3.635073e-01 6.361990e-01 4.938569e-01 5.260619e-01 6.469351e-01 5.307792e-01 4.709834e-01 6.583411e-01 5.402790e-01 0.357589 5.975448e-01 5.396688e-01 5.311411e-01 7.125100e-01 4.501382e-01 4.507624e-01 7.706085e-01
max 3.971288e+00 4.651889e+00 3.976130e+00 5.250529e+00 4.770911e+00 4.568425e+00 4.243589e+00 3.927930e+00 4.484751e+00 4.910919e+00 3.885905e+00 4.287337e+00 5.930172 3.955374e+00 5.112877e+00 4.700669e+00 2.685877e+00 6.046041e+00 6.846856e+00 7.706085e-01

8 rows × 31 columns

Content from 02 Logistic Regression


Last updated on 2025-07-10 | Edit this page

Logistic Regression with Breast Cancer Dataset

This notebook demonstrates how to use Logistic Regression, a fundamental classification algorithm, to predict whether a tumor is malignant or benign using the Breast Cancer Wisconsin dataset.

What is Logistic Regression?

Logistic Regression is a supervised learning algorithm used for binary classification.

It models the probability that an input \(\mathbf{x}\) belongs to class \(y=1\) using the logistic (sigmoid) function:

\[ P(y=1 | \mathbf{x}) = \frac{1}{1 + e^{- (\mathbf{w}^T \mathbf{x} + b)}} \]

The output is a probability between 0 and 1. We classify an observation as class 1 if the predicted probability exceeds a threshold (typically 0.5).

Step 1: Load and Explore the Data

We use the load_breast_cancer() dataset from Scikit-Learn. It includes 30 numeric features extracted from breast mass images.

PYTHON

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=31
)

# Normalize (Standardize) features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step 3: Split the Data and Train the Model

We split the dataset into training and testing sets, and fit a logistic regression model.

PYTHON

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)

Step 4: Evaluate the Model

Metrics Used: - Accuracy - Precision - Recall - F1-score - Confusion Matrix - ROC Curve and AUC

PYTHON

from sklearn.metrics import (
    classification_report,
    roc_curve, auc, accuracy_score, precision_score, recall_score, f1_score
)

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

SH

Accuracy: 0.9766081871345029
Precision: 0.981651376146789
Recall: 0.981651376146789
F1 Score: 0.981651376146789

Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.97      0.97        62
           1       0.98      0.98      0.98       109

    accuracy                           0.98       171
   macro avg       0.97      0.97      0.97       171
weighted avg       0.98      0.98      0.98       171

What is a Confusion Matrix?

A confusion matrix is a table used to describe the performance of a classification model. For binary classification:

Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)
  • Accuracy = (TP + TN) / (TP + TN + FP + FN)
  • Precision = TP / (TP + FP)
  • Recall (Sensitivity) = TP / (TP + FN)
  • F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

PYTHON

import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)
plt.title('Confusion Matrix')
plt.show()
png
png

What is an ROC Curve?

An ROC Curve (Receiver Operating Characteristic) plots:

  • True Positive Rate (Recall) on the Y-axis
  • False Positive Rate (1 - Specificity) on the X-axis

Each point on the curve corresponds to a different classification threshold. A model with perfect classification has a point in the top-left corner.

AUC (Area Under Curve) summarizes the ROC curve into a single value between 0 and 1: - AUC = 1: Perfect classifier - AUC = 0.5: No better than random guessing

PYTHON

y_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, label='AUC = ' + str(round(roc_auc, 2)))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()
png
png

Content from 03 Logistic Regression Optimization


Last updated on 2025-07-10 | Edit this page

Optimising a Logistic Regression Classifier

In this notebook, we demonstrate how to tune hyperparameters in a Logistic Regression model to improve performance.

Step 1: Load Breast Cancer Data

PYTHON

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=31
)

# Normalize (Standardize) features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step 2: Define a Function to Train and Evaluate

This function will: - Train the model - Return training and test accuracy

PYTHON

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

def train_and_evaluate(max_iter=5000, C=1, penalty='l2', solver='liblinear'):
    model = LogisticRegression(max_iter=max_iter, C=C, penalty=penalty, solver=solver)
    model.fit(X_train, y_train)
    train_acc = accuracy_score(y_train, model.predict(X_train))
    test_acc = accuracy_score(y_test, model.predict(X_test))
    return train_acc, test_acc

Step 3: Explore Effect of C (Inverse of Regularization Strength)

Controls the amount of regularization. Effect:

Smaller C → Stronger regularization → Prevents overfitting but may underfit.

Larger C → Weaker regularization → Better fit but may overfit.

PYTHON

import matplotlib.pyplot as plt

values = [ 1e-3, 1e-2, 1e-1, 1, 10, 100, 1000 ]
train_scores, test_scores = [], []

for v in values:
    tr, te = train_and_evaluate(C=v)
    train_scores.append(tr)
    test_scores.append(te)

labels = [str(s) for s in values]

plt.plot(labels, train_scores, marker='o', label='Train Acc')
plt.plot(labels, test_scores, marker='s', label='Test Acc')
plt.xlabel('Values')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.xticks(rotation=90, ha='right')
plt.show()
png
png

Step 4: Explore Effect of penalty (L1, L2 Regularization)

Penalty Key Effect When to Use
L1 Sparsity, feature selection You want simpler models or auto-feature selection
L2 Shrinkage, no sparsity You want to reduce overfitting without dropping features

PYTHON

import matplotlib.pyplot as plt

values = [ 'l1', 'l2' ]
train_scores, test_scores = [], []

for v in values:
    tr, te = train_and_evaluate(penalty=v)
    train_scores.append(tr)
    test_scores.append(te)

labels = [str(s) for s in values]

plt.plot(labels, train_scores, marker='o', label='Train Acc')
plt.plot(labels, test_scores, marker='s', label='Test Acc')
plt.xlabel('Values')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.xticks(rotation=90, ha='right')
plt.show()
png
png

Step 5: Explore Effect of max_iter

Scenario Effect
Too low Model may not converge → You’ll see warnings like “STOP: TOTAL NO. OF F, G EVALUATIONS EXCEEDS LIMIT”. Model coefficients may be inaccurate or unstable.
Sufficient (or slightly high) Model converges properly. No warnings. Coefficients stabilize at optimal values.
Very high (but converges early) No harm—most solvers stop automatically once convergence is reached (before hitting max_iter). However, unnecessarily large values can increase training time for very large datasets.

PYTHON

import matplotlib.pyplot as plt

values = [3, 4, 5, 6, 7, 10, 100 ]
train_scores, test_scores = [], []

for v in values:
    tr, te = train_and_evaluate(max_iter=v)
    train_scores.append(tr)
    test_scores.append(te)

labels = [str(s) for s in values]

plt.plot(labels, train_scores, marker='o', label='Train Acc')
plt.plot(labels, test_scores, marker='s', label='Test Acc')
plt.xlabel('Values')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.xticks(rotation=90, ha='right')
plt.show()

SH

C:\Users\moji1\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\LocalCache\local-packages\Python313\site-packages\sklearn\svm\_base.py:1250: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
C:\Users\moji1\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\LocalCache\local-packages\Python313\site-packages\sklearn\svm\_base.py:1250: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
C:\Users\moji1\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\LocalCache\local-packages\Python313\site-packages\sklearn\svm\_base.py:1250: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
C:\Users\moji1\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\LocalCache\local-packages\Python313\site-packages\sklearn\svm\_base.py:1250: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
C:\Users\moji1\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\LocalCache\local-packages\Python313\site-packages\sklearn\svm\_base.py:1250: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
png
png

Changing the Classification Threshold

Most classifiers output probabilities between 0 and 1. By default, the threshold for classification is 0.5. This means:

  • If predicted probability ≥ 0.5 → classify as positive
  • Else → classify as negative

Changing the Threshold:

  • Lower threshold → more positives predicted → higher recall, more false positives
  • Higher threshold → fewer positives predicted → higher precision, more false negatives

Choosing the right threshold depends on your application’s goals.

We’ll now visualize how the confusion matrix changes for two different thresholds.

PYTHON

from sklearn.metrics import ConfusionMatrixDisplay

thresholds = [0.3, 0.5, 0.7]         # List of three thresholds to compare

# Retrain model
manual_model = LogisticRegression()
manual_model.fit(X_train, y_train)

# Predict probabilities
y_proba = manual_model.predict_proba(X_test)[:, 1]

# Plot side-by-side confusion matrices
fig, axs = plt.subplots(1, 3, figsize=(12, 5))
for i, thresh in enumerate(thresholds):
    y_pred = (y_proba >= thresh).astype(int)
    ConfusionMatrixDisplay.from_predictions(y_test, y_pred, ax=axs[i])
    axs[i].set_title(f"Threshold = {thresh}")
plt.tight_layout()
plt.show()
png
png

Content from 04 Svm


Last updated on 2025-07-10 | Edit this page

Support Vector Machine (SVM) with Breast Cancer Dataset

This notebook demonstrates the use of Support Vector Machines (SVM) for classifying tumors in the Breast Cancer dataset.

What is an SVM?

Support Vector Machines are powerful supervised learning models for classification. An SVM finds the hyperplane that best separates data points from two classes.

It maximizes the margin, which is the distance between the hyperplane and the nearest points from each class (support vectors).

Step 1: Load the Breast Cancer Dataset

PYTHON

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=31
)

# Normalize (Standardize) features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step 3: Train an SVM Model

We use the SVC class from sklearn.svm with default kernel (RBF).

PYTHON

from sklearn.svm import SVC

model = SVC(probability=True)
model.fit(X_train, y_train)

#sk-container-id-1 { /* Definition of color scheme common for light and dark mode / –sklearn-color-text: #000; –sklearn-color-text-muted: #666; –sklearn-color-line: gray; / Definition of color scheme for unfitted estimators / –sklearn-color-unfitted-level-0: #fff5e6; –sklearn-color-unfitted-level-1: #f6e4d2; –sklearn-color-unfitted-level-2: #ffe0b3; –sklearn-color-unfitted-level-3: chocolate; / Definition of color scheme for fitted estimators */ –sklearn-color-fitted-level-0: #f0f8ff; –sklearn-color-fitted-level-1: #d4ebff; –sklearn-color-fitted-level-2: #b3dbfd; –sklearn-color-fitted-level-3: cornflowerblue;

/* Specific color for light theme */ –sklearn-color-text-on-default-background: var(–sg-text-color, var(–theme-code-foreground, var(–jp-content-font-color1, black))); –sklearn-color-background: var(–sg-background-color, var(–theme-background, var(–jp-layout-color0, white))); –sklearn-color-border-box: var(–sg-text-color, var(–theme-code-foreground, var(–jp-content-font-color1, black))); –sklearn-color-icon: #696969;

@media (prefers-color-scheme: dark) { /* Redefinition of color scheme for dark theme */ –sklearn-color-text-on-default-background: var(–sg-text-color, var(–theme-code-foreground, var(–jp-content-font-color1, white))); –sklearn-color-background: var(–sg-background-color, var(–theme-background, var(–jp-layout-color0, #111))); –sklearn-color-border-box: var(–sg-text-color, var(–theme-code-foreground, var(–jp-content-font-color1, white))); –sklearn-color-icon: #878787; } }

#sk-container-id-1 { color: var(–sklearn-color-text); }

#sk-container-id-1 pre { padding: 0; }

#sk-container-id-1 input.sk-hidden–visually { border: 0; clip: rect(1px 1px 1px 1px); clip: rect(1px, 1px, 1px, 1px); height: 1px; margin: -1px; overflow: hidden; padding: 0; position: absolute; width: 1px; }

#sk-container-id-1 div.sk-dashed-wrapped { border: 1px dashed var(–sklearn-color-line); margin: 0 0.4em 0.5em 0.4em; box-sizing: border-box; padding-bottom: 0.4em; background-color: var(–sklearn-color-background); }

#sk-container-id-1 div.sk-container { /* jupyter’s normalize.less sets [hidden] { display: none; } but bootstrap.min.css set [hidden] { display: none !important; } so we also need the !important here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */ display: inline-block !important; position: relative; }

#sk-container-id-1 div.sk-text-repr-fallback { display: none; }

div.sk-parallel-item, div.sk-serial, div.sk-item { /* draw centered vertical line to link estimators */ background-image: linear-gradient(var(–sklearn-color-text-on-default-background), var(–sklearn-color-text-on-default-background)); background-size: 2px 100%; background-repeat: no-repeat; background-position: center center; }

/* Parallel-specific style estimator block */

#sk-container-id-1 div.sk-parallel-item::after { content: ““; width: 100%; border-bottom: 2px solid var(–sklearn-color-text-on-default-background); flex-grow: 1; }

#sk-container-id-1 div.sk-parallel { display: flex; align-items: stretch; justify-content: center; background-color: var(–sklearn-color-background); position: relative; }

#sk-container-id-1 div.sk-parallel-item { display: flex; flex-direction: column; }

#sk-container-id-1 div.sk-parallel-item:first-child::after { align-self: flex-end; width: 50%; }

#sk-container-id-1 div.sk-parallel-item:last-child::after { align-self: flex-start; width: 50%; }

#sk-container-id-1 div.sk-parallel-item:only-child::after { width: 0; }

/* Serial-specific style estimator block */

#sk-container-id-1 div.sk-serial { display: flex; flex-direction: column; align-items: center; background-color: var(–sklearn-color-background); padding-right: 1em; padding-left: 1em; }

/* Toggleable style: style used for estimator/Pipeline/ColumnTransformer box that is clickable and can be expanded/collapsed. - Pipeline and ColumnTransformer use this feature and define the default style - Estimators will overwrite some part of the style using the sk-estimator class */

/* Pipeline and ColumnTransformer style (default) */

#sk-container-id-1 div.sk-toggleable { /* Default theme specific background. It is overwritten whether we have a specific estimator or a Pipeline/ColumnTransformer */ background-color: var(–sklearn-color-background); }

/* Toggleable label */ #sk-container-id-1 label.sk-toggleable__label { cursor: pointer; display: flex; width: 100%; margin-bottom: 0; padding: 0.5em; box-sizing: border-box; text-align: center; align-items: start; justify-content: space-between; gap: 0.5em; }

#sk-container-id-1 label.sk-toggleable__label .caption { font-size: 0.6rem; font-weight: lighter; color: var(–sklearn-color-text-muted); }

#sk-container-id-1 label.sk-toggleable__label-arrow:before { /* Arrow on the left of the label */ content: “▸”; float: left; margin-right: 0.25em; color: var(–sklearn-color-icon); }

#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before { color: var(–sklearn-color-text); }

/* Toggleable content - dropdown */

#sk-container-id-1 div.sk-toggleable__content { display: none; text-align: left; /* unfitted */ background-color: var(–sklearn-color-unfitted-level-0); }

#sk-container-id-1 div.sk-toggleable__content.fitted { /* fitted */ background-color: var(–sklearn-color-fitted-level-0); }

#sk-container-id-1 div.sk-toggleable__content pre { margin: 0.2em; border-radius: 0.25em; color: var(–sklearn-color-text); /* unfitted */ background-color: var(–sklearn-color-unfitted-level-0); }

#sk-container-id-1 div.sk-toggleable__content.fitted pre { /* unfitted */ background-color: var(–sklearn-color-fitted-level-0); }

#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content { /* Expand drop-down */ display: block; width: 100%; overflow: visible; }

#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before { content: “▾”; }

/* Pipeline/ColumnTransformer-specific style */

#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label { color: var(–sklearn-color-text); background-color: var(–sklearn-color-unfitted-level-2); }

#sk-container-id-1 div.sk-label.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label { background-color: var(–sklearn-color-fitted-level-2); }

/* Estimator-specific style */

/* Colorize estimator box */ #sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label { /* unfitted */ background-color: var(–sklearn-color-unfitted-level-2); }

#sk-container-id-1 div.sk-estimator.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label { /* fitted */ background-color: var(–sklearn-color-fitted-level-2); }

#sk-container-id-1 div.sk-label label.sk-toggleable__label, #sk-container-id-1 div.sk-label label { /* The background is the default theme color */ color: var(–sklearn-color-text-on-default-background); }

/* On hover, darken the color of the background */ #sk-container-id-1 div.sk-label:hover label.sk-toggleable__label { color: var(–sklearn-color-text); background-color: var(–sklearn-color-unfitted-level-2); }

/* Label box, darken color on hover, fitted */ #sk-container-id-1 div.sk-label.fitted:hover label.sk-toggleable__label.fitted { color: var(–sklearn-color-text); background-color: var(–sklearn-color-fitted-level-2); }

/* Estimator label */

#sk-container-id-1 div.sk-label label { font-family: monospace; font-weight: bold; display: inline-block; line-height: 1.2em; }

#sk-container-id-1 div.sk-label-container { text-align: center; }

/* Estimator-specific / #sk-container-id-1 div.sk-estimator { font-family: monospace; border: 1px dotted var(–sklearn-color-border-box); border-radius: 0.25em; box-sizing: border-box; margin-bottom: 0.5em; / unfitted */ background-color: var(–sklearn-color-unfitted-level-0); }

#sk-container-id-1 div.sk-estimator.fitted { /* fitted */ background-color: var(–sklearn-color-fitted-level-0); }

/* on hover / #sk-container-id-1 div.sk-estimator:hover { / unfitted */ background-color: var(–sklearn-color-unfitted-level-2); }

#sk-container-id-1 div.sk-estimator.fitted:hover { /* fitted */ background-color: var(–sklearn-color-fitted-level-2); }

/* Specification for estimator info (e.g. “i” and “?”) */

/* Common style for “i” and “?” */

.sk-estimator-doc-link, a:link.sk-estimator-doc-link, a:visited.sk-estimator-doc-link { float: right; font-size: smaller; line-height: 1em; font-family: monospace; background-color: var(–sklearn-color-background); border-radius: 1em; height: 1em; width: 1em; text-decoration: none !important; margin-left: 0.5em; text-align: center; /* unfitted */ border: var(–sklearn-color-unfitted-level-1) 1pt solid; color: var(–sklearn-color-unfitted-level-1); }

.sk-estimator-doc-link.fitted, a:link.sk-estimator-doc-link.fitted, a:visited.sk-estimator-doc-link.fitted { /* fitted */ border: var(–sklearn-color-fitted-level-1) 1pt solid; color: var(–sklearn-color-fitted-level-1); }

/* On hover / div.sk-estimator:hover .sk-estimator-doc-link:hover, .sk-estimator-doc-link:hover, div.sk-label-container:hover .sk-estimator-doc-link:hover, .sk-estimator-doc-link:hover { / unfitted */ background-color: var(–sklearn-color-unfitted-level-3); color: var(–sklearn-color-background); text-decoration: none; }

div.sk-estimator.fitted:hover .sk-estimator-doc-link.fitted:hover, .sk-estimator-doc-link.fitted:hover, div.sk-label-container:hover .sk-estimator-doc-link.fitted:hover, .sk-estimator-doc-link.fitted:hover { /* fitted */ background-color: var(–sklearn-color-fitted-level-3); color: var(–sklearn-color-background); text-decoration: none; }

/* Span, style for the box shown on hovering the info icon / .sk-estimator-doc-link span { display: none; z-index: 9999; position: relative; font-weight: normal; right: .2ex; padding: .5ex; margin: .5ex; width: min-content; min-width: 20ex; max-width: 50ex; color: var(–sklearn-color-text); box-shadow: 2pt 2pt 4pt #999; / unfitted */ background: var(–sklearn-color-unfitted-level-0); border: .5pt solid var(–sklearn-color-unfitted-level-3); }

.sk-estimator-doc-link.fitted span { /* fitted */ background: var(–sklearn-color-fitted-level-0); border: var(–sklearn-color-fitted-level-3); }

.sk-estimator-doc-link:hover span { display: block; }

/* “?”-specific style due to the `` HTML tag */

#sk-container-id-1 a.estimator_doc_link { float: right; font-size: 1rem; line-height: 1em; font-family: monospace; background-color: var(–sklearn-color-background); border-radius: 1rem; height: 1rem; width: 1rem; text-decoration: none; /* unfitted */ color: var(–sklearn-color-unfitted-level-1); border: var(–sklearn-color-unfitted-level-1) 1pt solid; }

#sk-container-id-1 a.estimator_doc_link.fitted { /* fitted */ border: var(–sklearn-color-fitted-level-1) 1pt solid; color: var(–sklearn-color-fitted-level-1); }

/* On hover / #sk-container-id-1 a.estimator_doc_link:hover { / unfitted */ background-color: var(–sklearn-color-unfitted-level-3); color: var(–sklearn-color-background); text-decoration: none; }

#sk-container-id-1 a.estimator_doc_link.fitted:hover { /* fitted */ background-color: var(–sklearn-color-fitted-level-3); }

.estimator-table summary { padding: .5rem; font-family: monospace; cursor: pointer; }

.estimator-table details[open] { padding-left: 0.1rem; padding-right: 0.1rem; padding-bottom: 0.3rem; }

.estimator-table .parameters-table { margin-left: auto !important; margin-right: auto !important; }

.estimator-table .parameters-table tr:nth-child(odd) { background-color: #fff; }

.estimator-table .parameters-table tr:nth-child(even) { background-color: #f6f6f6; }

.estimator-table .parameters-table tr:hover { background-color: #e0e0e0; }

.estimator-table table td { border: 1px solid rgba(106, 105, 104, 0.232); }

.user-set td { color:rgb(255, 94, 0); text-align: left; }

.user-set td.value pre { color:rgb(255, 94, 0) !important; background-color: transparent !important; }

.default td { color: black; text-align: left; }

.user-set td i, .default td i { color: black; }

.copy-paste-icon { background-image: url(data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCA0NDggNTEyIj48IS0tIUZvbnQgQXdlc29tZSBGcmVlIDYuNy4yIGJ5IEBmb250YXdlc29tZSAtIGh0dHBzOi8vZm9udGF3ZXNvbWUuY29tIExpY2Vuc2UgLSBodHRwczovL2ZvbnRhd2Vzb21lLmNvbS9saWNlbnNlL2ZyZWUgQ29weXJpZ2h0IDIwMjUgRm9udGljb25zLCBJbmMuLS0+PHBhdGggZD0iTTIwOCAwTDMzMi4xIDBjMTIuNyAwIDI0LjkgNS4xIDMzLjkgMTQuMWw2Ny45IDY3LjljOSA5IDE0LjEgMjEuMiAxNC4xIDMzLjlMNDQ4IDMzNmMwIDI2LjUtMjEuNSA0OC00OCA0OGwtMTkyIDBjLTI2LjUgMC00OC0yMS41LTQ4LTQ4bDAtMjg4YzAtMjYuNSAyMS41LTQ4IDQ4LTQ4ek00OCAxMjhsODAgMCAwIDY0LTY0IDAgMCAyNTYgMTkyIDAgMC0zMiA2NCAwIDAgNDhjMCAyNi41LTIxLjUgNDgtNDggNDhMNDggNTEyYy0yNi41IDAtNDgtMjEuNS00OC00OEwwIDE3NmMwLTI2LjUgMjEuNS00OCA0OC00OHoiLz48L3N2Zz4=); background-repeat: no-repeat; background-size: 14px 14px; background-position: 0; display: inline-block; width: 14px; height: 14px; cursor: pointer; } SVC(probability=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.SVC?Documentation for SVCiFitted

Parameters

C 1.0
kernel ‘rbf’
degree 3
gamma ‘scale’
coef0 0.0
shrinking True
probability True
tol 0.001
cache_size 200
class_weight None
verbose False
max_iter -1
decision_function_shape ‘ovr’
break_ties False
random_state None

function copyToClipboard(text, element) { // Get the parameter prefix from the closest toggleable content const toggleableContent = element.closest(’.sk-toggleable__content’); const paramPrefix = toggleableContent ? toggleableContent.dataset.paramPrefix : ’’; const fullParamName = paramPrefix ? ${paramPrefix}${text} : text;

SH

const originalStyle = element.style;
const computedStyle = window.getComputedStyle(element);
const originalWidth = computedStyle.width;
const originalHTML = element.innerHTML.replace('Copied!', '');

navigator.clipboard.writeText(fullParamName)
    .then(() => {
        element.style.width = originalWidth;
        element.style.color = 'green';
        element.innerHTML = "Copied!";

        setTimeout(() => {
            element.innerHTML = originalHTML;
            element.style = originalStyle;
        }, 2000);
    })
    .catch(err => {
        console.error('Failed to copy:', err);
        element.style.color = 'red';
        element.innerHTML = "Failed!";
        setTimeout(() => {
            element.innerHTML = originalHTML;
            element.style = originalStyle;
        }, 2000);
    });
return false;

}

document.querySelectorAll(‘.fa-regular.fa-copy’).forEach(function(element) { const toggleableContent = element.closest(’.sk-toggleable__content’); const paramPrefix = toggleableContent ? toggleableContent.dataset.paramPrefix : ’’; const paramName = element.parentElement.nextElementSibling.textContent.trim(); const fullParamName = paramPrefix ? ${paramPrefix}${paramName} : paramName;

SH

element.setAttribute('title', fullParamName);

});

Step 4: Evaluate the Model

PYTHON

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    ConfusionMatrixDisplay, classification_report, roc_curve, auc
)

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

SH

Accuracy: 0.9766081871345029
Precision: 0.972972972972973
Recall: 0.9908256880733946
F1 Score: 0.9818181818181818

Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.95      0.97        62
           1       0.97      0.99      0.98       109

    accuracy                           0.98       171
   macro avg       0.98      0.97      0.97       171
weighted avg       0.98      0.98      0.98       171

What is a Confusion Matrix?

A confusion matrix is a summary of prediction results:

Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)
  • Accuracy, Precision, Recall, F1 Score are all derived from this table.

PYTHON

import matplotlib.pyplot as plt
import seaborn as sns

ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)
plt.title('SVM Confusion Matrix')
plt.show()
png
png

What is an ROC Curve?

The ROC Curve shows the trade-off between True Positive Rate (Recall) and False Positive Rate. The AUC (Area Under Curve) summarizes the performance into a single number.

PYTHON

y_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, label='AUC = ' + str(round(roc_auc, 2)))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('SVM ROC Curve')
plt.legend()
plt.show()
png
png

Content from 05 Svm Optimization


Last updated on 2025-07-10 | Edit this page

Optimising a Support Vector Machine (SVM) Classifier

In this notebook, we demonstrate how to tune hyperparameters in a Support Vector Machine (SVM) classifier to improve classification performance.

We will use the Breast Cancer Wisconsin dataset to: - Train an initial SVM model. - Explore the effects of the most important hyperparameters: - C — Regularization parameter controlling the trade-off between achieving low training error and low testing error. - gamma — Kernel coefficient controlling how much influence each data point has on the decision boundary. - kernel — Defines the type of decision boundary (linear or nonlinear), with options like 'linear', 'rbf', and 'poly'.

Support Vector Machines are powerful models for classification tasks and can handle both linear and non-linear relationships through the use of kernel functions. By tuning these hyperparameters, we can significantly improve model performance.

Step 1: Load Breast Cancer Data

PYTHON

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=31
)

# Normalize (Standardize) features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Exploring the Effect of the C Hyperparameter

The C parameter in SVM controls the regularization strength: - Low C values → More regularization → Smoother decision boundary, possibly underfitting. - High C values → Less regularization → Fits the training data more closely, potentially overfitting.

We will train SVM models with different C values and observe how it affects accuracy.

PYTHON

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Test different values of C
C_values = [0.01, 0.1, 1, 10, 100]
train_scores = []
test_scores = []

for C in C_values:
    model = SVC(C=C, kernel='rbf', gamma='scale')
    model.fit(X_train, y_train)
    train_acc = accuracy_score(y_train, model.predict(X_train))
    test_acc = accuracy_score(y_test, model.predict(X_test))
    train_scores.append(train_acc)
    test_scores.append(test_acc)

# Plot the results
plt.figure(figsize=(8, 6))
plt.semilogx(C_values, train_scores, marker='o', label='Train Accuracy')
plt.semilogx(C_values, test_scores, marker='s', label='Test Accuracy')
plt.xlabel('C (log scale)')
plt.ylabel('Accuracy')
plt.title('Effect of Regularization Parameter C on SVM')
plt.legend()
plt.grid(True)
plt.show()
png
png

Exploring the Effect of the gamma Hyperparameter

The gamma parameter controls the influence of each individual training example: - Low gamma values → Far-reaching influence → Smoother decision boundary, possibly underfitting. - High gamma values → Very localized influence → Complex decision boundary, potentially overfitting.

We will now examine how varying gamma affects the model’s performance, while keeping C fixed.

PYTHON

# Test different values of gamma
gamma_values = [0.001, 0.01, 0.1, 1, 10]
train_scores = []
test_scores = []

for gamma in gamma_values:
    model = SVC(C=1, kernel='rbf', gamma=gamma)
    model.fit(X_train, y_train)
    train_acc = accuracy_score(y_train, model.predict(X_train))
    test_acc = accuracy_score(y_test, model.predict(X_test))
    train_scores.append(train_acc)
    test_scores.append(test_acc)

# Plot the results
plt.figure(figsize=(8, 6))
plt.semilogx(gamma_values, train_scores, marker='o', label='Train Accuracy')
plt.semilogx(gamma_values, test_scores, marker='s', label='Test Accuracy')
plt.xlabel('gamma (log scale)')
plt.ylabel('Accuracy')
plt.title('Effect of gamma on SVM')
plt.legend()
plt.grid(True)
plt.show()
png
png

Exploring the Effect of the kernel Hyperparameter

The kernel parameter in SVM specifies the type of transformation applied to the input data: - 'linear' → No transformation; linear decision boundary. - 'rbf' (Radial Basis Function) → Maps data into higher dimensions for non-linear boundaries. - 'poly' → Polynomial transformations, allowing more flexible decision boundaries depending on degree.

We will now compare different kern

PYTHON

# Test different kernel types
kernel_types = ['linear', 'rbf', 'poly']
train_scores = []
test_scores = []

for kernel in kernel_types:
    model = SVC(C=1, kernel=kernel, gamma='scale')
    model.fit(X_train, y_train)
    train_acc = accuracy_score(y_train, model.predict(X_train))
    test_acc = accuracy_score(y_test, model.predict(X_test))
    train_scores.append(train_acc)
    test_scores.append(test_acc)

# Show numeric results
for k, tr, te in zip(kernel_types, train_scores, test_scores):
    print(f"Kernel: {k} → Train Accuracy: {tr:.3f}, Test Accuracy: {te:.3f}")

# Plot the results
import numpy as np

x = np.arange(len(kernel_types))
width = 0.35

plt.figure(figsize=(8, 6))
plt.bar(x - width/2, train_scores, width, label='Train Accuracy')
plt.bar(x + width/2, test_scores, width, label='Test Accuracy')
plt.xticks(x, kernel_types)
plt.xlabel('Kernel Type')
plt.ylabel('Accuracy')
plt.title('Comparison of SVM Kernels')
plt.legend()
plt.grid(True, axis='y')
plt.show()

SH

Kernel: linear → Train Accuracy: 0.987, Test Accuracy: 0.988
Kernel: rbf → Train Accuracy: 0.985, Test Accuracy: 0.977
Kernel: poly → Train Accuracy: 0.910, Test Accuracy: 0.883
png
png

Changing the Classification Threshold

Most classifiers output probabilities between 0 and 1. By default, the threshold for classification is 0.5. This means:

  • If predicted probability ≥ 0.5 → classify as positive
  • Else → classify as negative

Changing the Threshold:

  • Lower threshold → more positives predicted → higher recall, more false positives
  • Higher threshold → fewer positives predicted → higher precision, more false negatives

Choosing the right threshold depends on your application’s goals.

We’ll now visualize how the confusion matrix changes for two different thresholds.

PYTHON

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

thresholds = [0.3, 0.5, 0.7]         # List of two thresholds to compare

# Train SVM with probability estimates enabled
model = SVC(C=1, kernel='rbf', gamma='scale', probability=True)
model.fit(X_train, y_train)

# Predict probabilities
y_proba = model.predict_proba(X_test)[:, 1]  # Probability of class 1

# Plot side-by-side confusion matrices
fig, axs = plt.subplots(1, 3, figsize=(12, 5))
for i, thresh in enumerate(thresholds):
    y_pred = (y_proba >= thresh).astype(int)
    ConfusionMatrixDisplay.from_predictions(y_test, y_pred, ax=axs[i])
    axs[i].set_title(f"Threshold = {thresh}")
plt.tight_layout()
plt.show()
png
png

Content from 06 Model Evaluation


Last updated on 2025-07-10 | Edit this page

Model Evaluation: Comparing Logistic Regression and SVM

We compare two classifiers: - Logistic Regression - SVM

Using metrics: Accuracy, Precision, Recall, F1 Score, Confusion Matrix, and ROC-AUC.

Step 1: Load the Data

PYTHON

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=31
)

# Normalize (Standardize) features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step 2: Train the Models

PYTHON

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

lr_model = LogisticRegression(max_iter=5000)
lr_model.fit(X_train, y_train)

svm_model = SVC(probability=True)
svm_model.fit(X_train, y_train)

Step 3: Evaluation Metrics

What is Accuracy?

Accuracy is the proportion of correct predictions:

\[ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} \]

Can be misleading if classes are imbalanced.

PYTHON

from sklearn.metrics import accuracy_score

lr_acc = accuracy_score(y_test, lr_model.predict(X_test))
svm_acc = accuracy_score(y_test, svm_model.predict(X_test))

print(f"Logistic egression Accuracy: {lr_acc:.2f}")
print(f"SVM Accuracy: {svm_acc:.2f}")

SH

Logistic egression Accuracy: 0.98
SVM Accuracy: 0.98

What are Precision, Recall and F1 Score?

  • Precision: \(\frac{TP}{TP + FP}\)
  • Recall: \(\frac{TP}{TP + FN}\)
  • F1 Score: Harmonic mean of precision and recall

\[ F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]

PYTHON

from sklearn.metrics import precision_score, recall_score, f1_score

for name, model in [("SVM", svm_model), ("Neural Net", lr_model)]:
    y_pred = model.predict(X_test)
    print(f"\n{name} Metrics:")
    print(f"Precision: {precision_score(y_test, y_pred):.2f}")
    print(f"Recall: {recall_score(y_test, y_pred):.2f}")
    print(f"F1 Score: {f1_score(y_test, y_pred):.2f}")

SH

SVM Metrics:
Precision: 0.97
Recall: 0.99
F1 Score: 0.98

Neural Net Metrics:
Precision: 0.98
Recall: 0.98
F1 Score: 0.98

What is a Confusion Matrix?

A confusion matrix shows the breakdown of correct and incorrect classifications.

Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)

PYTHON

from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt

fig, axs = plt.subplots(1, 2, figsize=(12, 5))
ConfusionMatrixDisplay.from_estimator(svm_model, X_test, y_test, ax=axs[0])
axs[0].set_title("SVM Confusion Matrix")
ConfusionMatrixDisplay.from_estimator(lr_model, X_test, y_test, ax=axs[1])
axs[1].set_title("Neural Network Confusion Matrix")
plt.tight_layout()
plt.show()
png
png

What is the ROC Curve?

ROC = Receiver Operating Characteristic Curve

  • Plots TPR vs FPR
  • AUC = Area Under the ROC Curve Closer to 1 = better model.

PYTHON

from sklearn.metrics import roc_curve, auc

svm_probs = svm_model.predict_proba(X_test)[:, 1]
nn_probs = lr_model.predict_proba(X_test)[:, 1]

svm_fpr, svm_tpr, _ = roc_curve(y_test, svm_probs)
nn_fpr, nn_tpr, _ = roc_curve(y_test, nn_probs)
svm_auc = auc(svm_fpr, svm_tpr)
nn_auc = auc(nn_fpr, nn_tpr)

plt.figure(figsize=(8, 6))
plt.plot(svm_fpr, svm_tpr, label=f"SVM (AUC = {svm_auc:.2f})")
plt.plot(nn_fpr, nn_tpr, label=f"Logistic Regression (AUC = {nn_auc:.2f})")
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()
png
png

Conclusion

Both models perform well, but:

  • Neural Net may achieve higher recall
  • SVM may offer higher precision

Evaluation metrics guide us to choose the best model for our real-world use case.

Content from 07 Neural Networks


Last updated on 2025-07-10 | Edit this page

Neural Network (MLPClassifier) with Breast Cancer Dataset

In this notebook, we will use a simple Multi-layer Perceptron (MLP) neural network to classify breast tumors.

What is an MLP?

An MLP is a type of feedforward neural network consisting of one or more hidden layers. Each neuron computes a weighted sum of its inputs and passes the result through a nonlinear activation function.

MLPs are suitable for classification tasks and are trained using backpropagation to minimize loss.

Step 1: Load the Breast Cancer Dataset

PYTHON

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=31
)

# Normalize (Standardize) features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step 3: Train an MLPClassifier

We use MLPClassifier from Scikit-Learn with one hidden layer.

PYTHON

from sklearn.neural_network import MLPClassifier

model = MLPClassifier(hidden_layer_sizes=(50,), max_iter=2000, random_state=42)
model.fit(X_train, y_train)

Step 4: Evaluate the Model

PYTHON

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    ConfusionMatrixDisplay, classification_report, roc_curve, auc
)

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

SH

Accuracy: 0.9824561403508771
Precision: 0.9732142857142857
Recall: 1.0
F1 Score: 0.9864253393665159

Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.95      0.98        62
           1       0.97      1.00      0.99       109

    accuracy                           0.98       171
   macro avg       0.99      0.98      0.98       171
weighted avg       0.98      0.98      0.98       171

What is a Confusion Matrix?

A confusion matrix shows how well the model distinguishes between classes:

Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)

This matrix lets us compute metrics like accuracy, precision, recall, and F1 score.

PYTHON

import matplotlib.pyplot as plt
import seaborn as sns

ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)
plt.title('Neural Network Confusion Matrix')
plt.show()
png
png

What is an ROC Curve?

The ROC Curve shows the trade-off between True Positive Rate and False Positive Rate. AUC quantifies this performance. Closer to 1.0 = better classifier.

PYTHON

y_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, label='AUC = ' + str(round(roc_auc, 2)))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Neural Network ROC Curve')
plt.legend()
plt.show()
png
png

Content from 08 Neural Networks Optimization


Last updated on 2025-07-10 | Edit this page

Optimising a Neural Network Classifier

In this notebook, we demonstrate how to tune hyperparameters in a neural network model to improve performance.

We will focus on: - hidden_layer_sizes - alpha (regularization) - learning_rate_init

We’ll also visualize how these parameters affect accuracy, and look for signs of overfitting or underfitting.

Step 1: Load Breast Cancer Data

Load with Normalisation

PYTHON

# from sklearn.datasets import load_breast_cancer
# from sklearn.model_selection import train_test_split
# from sklearn.preprocessing import StandardScaler
# import pandas as pd

# # Load dataset
# data = load_breast_cancer()
# X = data.data
# y = data.target

# # Split dataset
# X_train, X_test, y_train, y_test = train_test_split(
#     X, y, test_size=0.3, random_state=31
# )

# # Normalize (Standardize) features
# scaler = StandardScaler()
# X_train = scaler.fit_transform(X_train)
# X_test = scaler.transform(X_test)

Load without Normalisation To see the effects of different network sizes and other hyperparameters we use the non-normlaised dataset. For teh easier task of calssifying using normalised data most values for hyperparameters result in a very good accuracy.

PYTHON

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=31
)

Step 2: Define a Function to Train and Evaluate

This function will: - Train the MLP model - Return training and test accuracy

PYTHON

from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
import numpy as np

def train_and_evaluate(hidden_layer_sizes=(200, 400, 400, 200), alpha=0.0001, lr=0.001):
    model = MLPClassifier(hidden_layer_sizes=hidden_layer_sizes,
                           alpha=alpha,
                           learning_rate_init=lr,
                           max_iter=2000,
                           random_state=42)
    model.fit(X_train, y_train)
    train_acc = accuracy_score(y_train, model.predict(X_train))
    test_acc = accuracy_score(y_test, model.predict(X_test))
    return train_acc, test_acc

Step 3: Explore Effect of hidden_layer_sizes

The number of neurons and layers controls the model’s capacity.

  • Too small: underfitting
  • Too large: overfitting

PYTHON

import matplotlib.pyplot as plt

values = [ (50,50), (50, 50, 50), (200, 400, 400, 200), (200, 400, 400, 400, 400, 200), (200, 400, 800, 800, 800, 400, 400, 200)]
labels = [0, 1, 2, 3, 4]
train_scores, test_scores = [], []

for s in values:
    tr, te = train_and_evaluate(hidden_layer_sizes=s)
    train_scores.append(tr)
    test_scores.append(te)

labels = [str(s) for s in values]

plt.plot(labels, train_scores, marker='o', label='Train Acc')
plt.plot(labels, test_scores, marker='s', label='Test Acc')
plt.xlabel('Hidden Layer Sizes')
plt.ylabel('Accuracy')
plt.title('Effect of Network Size')
plt.legend()
plt.grid(True)
plt.xticks(rotation=90, ha='right')
plt.show()
png
png

Step 4: Explore Effect of alpha (L2 Regularization)

alpha prevents overfitting by penalising large weights.

  • Low alpha: can overfit
  • High alpha: can underfit

PYTHON

alphas =  [1e-1, 3e-1, 5e-1, 7e-1, 1e1]
train_scores, test_scores = [], []

for a in alphas:
    tr, te = train_and_evaluate(alpha=a)
    train_scores.append(tr)
    test_scores.append(te)

plt.semilogx(alphas, train_scores, marker='o', label='Train Acc')
plt.semilogx(alphas, test_scores, marker='s', label='Test Acc')
plt.xlabel('alpha (log scale)')
plt.ylabel('Accuracy')
plt.title('Effect of Regularization Strength')
plt.legend()
plt.grid(True)
plt.show()
png
png

Step 5: Explore Effect of learning_rate_init

This controls how fast the model updates its weights.

  • Too small: slow convergence
  • Too large: may never converge

PYTHON

lrs = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1]
train_scores, test_scores = [], []

for lr in lrs:
    tr, te = train_and_evaluate(lr=lr)
    train_scores.append(tr)
    test_scores.append(te)

plt.plot(lrs, train_scores, marker='o', label='Train Acc')
plt.plot(lrs, test_scores, marker='s', label='Test Acc')
plt.xscale('log')
plt.xlabel('Learning Rate (log scale)')
plt.ylabel('Accuracy')
plt.title('Effect of Learning Rate')
plt.legend()
plt.grid(True)
plt.show()
png
png

Conclusion

  • Neural networks are sensitive to hyperparameters
  • Use visualisation to find sweet spot
  • Avoid overfitting by tuning alpha and hidden_layer_sizes
  • Don’t pick hyperparameters blindly – use grid search or cross-validation

Changing the Classification Threshold

Most classifiers like neural networks output probabilities between 0 and 1. By default, the threshold for classification is 0.5. This means:

  • If predicted probability ≥ 0.5 → classify as positive
  • Else → classify as negative

Changing the Threshold:

  • Lower threshold → more positives predicted → higher recall, more false positives
  • Higher threshold → fewer positives predicted → higher precision, more false negatives

Choosing the right threshold depends on your application’s goals.

We’ll now visualize how the confusion matrix changes for two different thresholds.

PYTHON

from sklearn.metrics import ConfusionMatrixDisplay

thresholds = [0.3, 0.5, 0.7]         # List of two thresholds to compare

# Retrain model
manual_model = MLPClassifier()
manual_model.fit(X_train, y_train)

# Predict probabilities
y_proba = manual_model.predict_proba(X_test)[:, 1]

# Plot side-by-side confusion matrices
fig, axs = plt.subplots(1, 3, figsize=(12, 5))
for i, thresh in enumerate(thresholds):
    y_pred = (y_proba >= thresh).astype(int)
    ConfusionMatrixDisplay.from_predictions(y_test, y_pred, ax=axs[i])
    axs[i].set_title(f"Threshold = {thresh}")
plt.tight_layout()
plt.show()

SH

C:\Users\moji1\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\LocalCache\local-packages\Python313\site-packages\sklearn\neural_network\_multilayer_perceptron.py:780: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(
png
png

Content from 09 Random Forest


Last updated on 2025-07-10 | Edit this page

Random Forest Classifier with Breast Cancer Dataset

This notebook demonstrates the use of Random Forest Classifier for classifying tumors in the Breast Cancer dataset.

What is a Random Forest?

Random Forest is a powerful ensemble learning method for classification. It builds multiple decision trees and combines their predictions for improved accuracy and robustness.

Each tree is trained on a random subset of the data and features, reducing overfitting and improving generalization.

Step 1: Load the Breast Cancer Dataset

PYTHON

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=31
)

# Normalize (Standardize) features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step 3: Train an SVM Model

We use the SVC class from sklearn.svm with default kernel (RBF).

PYTHON

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

Step 4: Evaluate the Model

PYTHON

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    ConfusionMatrixDisplay, classification_report, roc_curve, auc
)

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

SH

Accuracy: 0.9590643274853801
Precision: 0.9636363636363636
Recall: 0.9724770642201835
F1 Score: 0.9680365296803652

Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.94      0.94        62
           1       0.96      0.97      0.97       109

    accuracy                           0.96       171
   macro avg       0.96      0.95      0.96       171
weighted avg       0.96      0.96      0.96       171

What is a Confusion Matrix?

A confusion matrix is a summary of prediction results:

Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)
  • Accuracy, Precision, Recall, F1 Score are all derived from this table.

PYTHON

import matplotlib.pyplot as plt
import seaborn as sns

ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)
plt.title('SVM Confusion Matrix')
plt.show()
png
png

What is an ROC Curve?

The ROC Curve shows the trade-off between True Positive Rate (Recall) and False Positive Rate. The AUC (Area Under Curve) summarizes the performance into a single number.

PYTHON

y_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, label='AUC = ' + str(round(roc_auc, 2)))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('SVM ROC Curve')
plt.legend()
plt.show()
png
png

Content from 10 Random Forest Optimization


Last updated on 2025-07-10 | Edit this page

Optimising a Random Forest Classifier

In this notebook, we demonstrate how to tune hyperparameters in a Random Forest Classifier to improve classification performance.

We will use the Breast Cancer Wisconsin dataset to: - Train an initial Random Forest model. - Explore the effects of the most important hyperparameters: - n_estimators — Number of trees in the forest. - max_depth — Maximum depth of each tree. - min_samples_split — Minimum number of samples required to split an internal node.

Random Forest is a powerful ensemble method for classification tasks that builds multiple decision trees and merges their outputs for improved accuracy and robustness. By tuning these hyperparameters, we can significantly improve model performance.

Step 1: Load Breast Cancer Data

PYTHON

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=31
)

# Normalize (Standardize) features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Exploring the Effect of the n_estimators Hyperparameter

The n_estimators parameter in Random Forest controls the number of trees in the forest: - Low n_estimators values → Fewer trees → Faster training but possibly underfitting. - High n_estimators values → More trees → Better performance but increased computation.

We will train Random Forest models with different n_estimators values and observe how it affects accuracy.

PYTHON

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Test different values of n_estimators
n_estimators_values = [1, 2 ,4, 8, 10, 20, 30, 50, 70, 90]
train_scores = []
test_scores = []

for n in n_estimators_values:
    model = RandomForestClassifier(n_estimators=n, random_state=42)
    model.fit(X_train, y_train)
    train_acc = accuracy_score(y_train, model.predict(X_train))
    test_acc = accuracy_score(y_test, model.predict(X_test))
    train_scores.append(train_acc)
    test_scores.append(test_acc)

# Plot the results
plt.figure(figsize=(8, 6))
plt.plot(n_estimators_values, train_scores, marker='o', label='Train Accuracy')
plt.plot(n_estimators_values, test_scores, marker='s', label='Test Accuracy')
plt.xlabel('n_estimators (Number of Trees)')
plt.ylabel('Accuracy')
plt.title('Effect of n_estimators on Random Forest')
plt.legend()
plt.grid(True)
plt.show()
png
png

Exploring the Effect of the max_depth Hyperparameter

The max_depth parameter controls the maximum depth of each tree: - Low max_depth values → Shallow trees → Simpler models, possibly underfitting. - High max_depth values → Deeper trees → More complex models, potentially overfitting.

We will now examine how varying max_depth affects the model’s performance, while keeping n_estimators fixed.

PYTHON

# Test different values of max_depth
max_depth_values = [2, 4, 6, 8, 10, 20]
train_scores = []
test_scores = []

for depth in max_depth_values:
    model = RandomForestClassifier(n_estimators=100, max_depth=depth, random_state=42)
    model.fit(X_train, y_train)
    train_acc = accuracy_score(y_train, model.predict(X_train))
    test_acc = accuracy_score(y_test, model.predict(X_test))
    train_scores.append(train_acc)
    test_scores.append(test_acc)

# Plot the results
plt.figure(figsize=(8, 6))
plt.plot(max_depth_values, train_scores, marker='o', label='Train Accuracy')
plt.plot(max_depth_values, test_scores, marker='s', label='Test Accuracy')
plt.xlabel('max_depth')
plt.ylabel('Accuracy')
plt.title('Effect of max_depth on Random Forest')
plt.legend()
plt.grid(True)
plt.show()
png
png

Exploring the Effect of the min_samples_split Hyperparameter

The min_samples_split parameter controls the minimum number of samples required to split an internal node: - Low min_samples_split values → More splits → Complex trees, potentially overfitting. - High min_samples_split values → Fewer splits → Simpler trees, possibly underfitting.

We will now compare different min_samples_split values to observe their impact on performance.

PYTHON

# Test different values of min_samples_split
min_samples_split_values = [2, 5, 10, 20, 50]
train_scores = []
test_scores = []

for min_split in min_samples_split_values:
    model = RandomForestClassifier(n_estimators=100, max_depth=6, min_samples_split=min_split, random_state=42)
    model.fit(X_train, y_train)
    train_acc = accuracy_score(y_train, model.predict(X_train))
    test_acc = accuracy_score(y_test, model.predict(X_test))
    train_scores.append(train_acc)
    test_scores.append(test_acc)

# Show numeric results
for m, tr, te in zip(min_samples_split_values, train_scores, test_scores):
    print(f"min_samples_split: {m} → Train Accuracy: {tr:.3f}, Test Accuracy: {te:.3f}")

# Plot the results
import numpy as np

x = np.arange(len(min_samples_split_values))
width = 0.35

# plt.figure(figsize=(8, 6))
# plt.bar(x - width/2, train_scores, width, label='Train Accuracy')
# plt.bar(x + width/2, test_scores, width, label='Test Accuracy')
# plt.xticks(x, min_samples_split_values)
# plt.xlabel('min_samples_split')
# plt.ylabel('Accuracy')
# plt.title('Effect of min_samples_split on Random Forest')
# plt.legend()
# plt.grid(True, axis='y')
# plt.show()

# Plot the results
plt.figure(figsize=(8, 6))
plt.plot( min_samples_split_values, train_scores, marker='o', label='Train Accuracy')
plt.plot( min_samples_split_values, test_scores, marker='s', label='Test Accuracy')
plt.xlabel('max_depth')
plt.ylabel('Accuracy')
plt.title('Effect of max_depth on Random Forest')
plt.legend()
plt.grid(True)
plt.show()

SH

min_samples_split: 2 → Train Accuracy: 0.997, Test Accuracy: 0.965
min_samples_split: 5 → Train Accuracy: 0.995, Test Accuracy: 0.959
min_samples_split: 10 → Train Accuracy: 0.980, Test Accuracy: 0.947
min_samples_split: 20 → Train Accuracy: 0.972, Test Accuracy: 0.947
min_samples_split: 50 → Train Accuracy: 0.972, Test Accuracy: 0.942
png
png

Changing the Classification Threshold

Most classifiers output probabilities between 0 and 1. By default, the threshold for classification is 0.5. This means:

  • If predicted probability ≥ 0.5 → classify as positive
  • Else → classify as negative

Changing the Threshold:

  • Lower threshold → more positives predicted → higher recall, more false positives
  • Higher threshold → fewer positives predicted → higher precision, more false negatives

Choosing the right threshold depends on your application’s goals.

We’ll now visualize how the confusion matrix changes for two different thresholds.

PYTHON

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

thresholds = [0.3, 0.5, 0.7]         # List of two thresholds to compare

# Train Random Forest with probability estimates
model = RandomForestClassifier(n_estimators=100, max_depth=6, min_samples_split=5, random_state=42)
model.fit(X_train, y_train)

# Predict probabilities
y_proba = model.predict_proba(X_test)[:, 1]  # Probability of class 1

# Plot side-by-side confusion matrices
fig, axs = plt.subplots(1, 3, figsize=(12, 5))
for i, thresh in enumerate(thresholds):
    y_pred = (y_proba >= thresh).astype(int)
    ConfusionMatrixDisplay.from_predictions(y_test, y_pred, ax=axs[i])
    axs[i].set_title(f"Threshold = {thresh}")
plt.tight_layout()
plt.show()
png
png