Content from 01 Introduction
Last updated on 2025-07-10 | Edit this page
Introduction to Classification
What is Classification?
Classification is a type of supervised learning where the goal is to predict categorical class labels. Given input data, a classification model attempts to assign it to one of several predefined classes.
Some examples include: - Email spam detection (spam vs. not spam) - Disease diagnosis (positive vs. negative) - Image recognition (cat, dog, or other)
Workshop Goals
By the end of this workshop, you will be able to: - Understand common classification algorithms - Apply them using Scikit-Learn, NumPy, Pandas, and Matplotlib - Evaluate and optimise models
Topics Covered
- Logistic Regression
- Support Vector Machines (SVM)
- Model Evaluation: Accuracy, Precision, Recall, F1-Score, ROC-AUC
- Neural Networks (MLPClassifier)
- Random Forest Classifier
- Optimisation and Tuning
Required Libraries
We will use the following Python libraries throughout the workshop: -
NumPy – numerical operations - Pandas – data
manipulation - Scikit-Learn – machine learning models and
tools - Matplotlib – data visualisation -
Seaborn - data visualisation
Let’s get started! 🚀
Check your environment has the necessary libraries installed
PYTHON
import numpy
print("NumPy version:", numpy.__version__)
import pandas
print("Pandas version:", pandas.__version__)
import sklearn
print("sklearn version:", sklearn.__version__)
import matplotlib
print("matplotlib version:", matplotlib.__version__)
import seaborn
print("sklearn version:", seaborn.__version__)
Preview Example Dataset
We use the load_breast_cancer() dataset from
Scikit-Learn. It includes 30 numeric features extracted from breast mass
images.
PYTHON
from sklearn.datasets import load_breast_cancer
import pandas as pd
data = load_breast_cancer()
X = data.data
y = data.target
df = pd.DataFrame(X, columns=data.feature_names)
df['target'] = y
df.head()
| mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | … | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | … | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | 0 |
| 1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | … | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | 0 |
| 2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | … | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | 0 |
| 3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | 0.09744 | … | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | 0 |
| 4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | 0.05883 | … | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | 0 |
5 rows × 31 columns
| mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | … | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | … | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 |
| mean | 14.127292 | 19.289649 | 91.969033 | 654.889104 | 0.096360 | 0.104341 | 0.088799 | 0.048919 | 0.181162 | 0.062798 | … | 25.677223 | 107.261213 | 880.583128 | 0.132369 | 0.254265 | 0.272188 | 0.114606 | 0.290076 | 0.083946 | 0.627417 |
| std | 3.524049 | 4.301036 | 24.298981 | 351.914129 | 0.014064 | 0.052813 | 0.079720 | 0.038803 | 0.027414 | 0.007060 | … | 6.146258 | 33.602542 | 569.356993 | 0.022832 | 0.157336 | 0.208624 | 0.065732 | 0.061867 | 0.018061 | 0.483918 |
| min | 6.981000 | 9.710000 | 43.790000 | 143.500000 | 0.052630 | 0.019380 | 0.000000 | 0.000000 | 0.106000 | 0.049960 | … | 12.020000 | 50.410000 | 185.200000 | 0.071170 | 0.027290 | 0.000000 | 0.000000 | 0.156500 | 0.055040 | 0.000000 |
| 25% | 11.700000 | 16.170000 | 75.170000 | 420.300000 | 0.086370 | 0.064920 | 0.029560 | 0.020310 | 0.161900 | 0.057700 | … | 21.080000 | 84.110000 | 515.300000 | 0.116600 | 0.147200 | 0.114500 | 0.064930 | 0.250400 | 0.071460 | 0.000000 |
| 50% | 13.370000 | 18.840000 | 86.240000 | 551.100000 | 0.095870 | 0.092630 | 0.061540 | 0.033500 | 0.179200 | 0.061540 | … | 25.410000 | 97.660000 | 686.500000 | 0.131300 | 0.211900 | 0.226700 | 0.099930 | 0.282200 | 0.080040 | 1.000000 |
| 75% | 15.780000 | 21.800000 | 104.100000 | 782.700000 | 0.105300 | 0.130400 | 0.130700 | 0.074000 | 0.195700 | 0.066120 | … | 29.720000 | 125.400000 | 1084.000000 | 0.146000 | 0.339100 | 0.382900 | 0.161400 | 0.317900 | 0.092080 | 1.000000 |
| max | 28.110000 | 39.280000 | 188.500000 | 2501.000000 | 0.163400 | 0.345400 | 0.426800 | 0.201200 | 0.304000 | 0.097440 | … | 49.540000 | 251.200000 | 4254.000000 | 0.222600 | 1.058000 | 1.252000 | 0.291000 | 0.663800 | 0.207500 | 1.000000 |
8 rows × 31 columns

PYTHON
from sklearn.preprocessing import StandardScaler
# Apply StandardScaler
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns, index=df.index)
df_scaled.describe()
| mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | … | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5.690000e+02 | 5.690000e+02 | 5.690000e+02 | 5.690000e+02 | 5.690000e+02 | 5.690000e+02 | 5.690000e+02 | 5.690000e+02 | 5.690000e+02 | 5.690000e+02 | … | 5.690000e+02 | 5.690000e+02 | 569.000000 | 5.690000e+02 | 5.690000e+02 | 5.690000e+02 | 5.690000e+02 | 5.690000e+02 | 5.690000e+02 | 5.690000e+02 |
| mean | -1.373633e-16 | 6.868164e-17 | -1.248757e-16 | -2.185325e-16 | -8.366672e-16 | 1.873136e-16 | 4.995028e-17 | -4.995028e-17 | 1.748260e-16 | 4.745277e-16 | … | 1.248757e-17 | -3.746271e-16 | 0.000000 | -2.372638e-16 | -3.371644e-16 | 7.492542e-17 | 2.247763e-16 | 2.622390e-16 | -5.744282e-16 | -4.995028e-17 |
| std | 1.000880e+00 | 1.000880e+00 | 1.000880e+00 | 1.000880e+00 | 1.000880e+00 | 1.000880e+00 | 1.000880e+00 | 1.000880e+00 | 1.000880e+00 | 1.000880e+00 | … | 1.000880e+00 | 1.000880e+00 | 1.000880 | 1.000880e+00 | 1.000880e+00 | 1.000880e+00 | 1.000880e+00 | 1.000880e+00 | 1.000880e+00 | 1.000880e+00 |
| min | -2.029648e+00 | -2.229249e+00 | -1.984504e+00 | -1.454443e+00 | -3.112085e+00 | -1.610136e+00 | -1.114873e+00 | -1.261820e+00 | -2.744117e+00 | -1.819865e+00 | … | -2.223994e+00 | -1.693361e+00 | -1.222423 | -2.682695e+00 | -1.443878e+00 | -1.305831e+00 | -1.745063e+00 | -2.160960e+00 | -1.601839e+00 | -1.297676e+00 |
| 25% | -6.893853e-01 | -7.259631e-01 | -6.919555e-01 | -6.671955e-01 | -7.109628e-01 | -7.470860e-01 | -7.437479e-01 | -7.379438e-01 | -7.032397e-01 | -7.226392e-01 | … | -7.486293e-01 | -6.895783e-01 | -0.642136 | -6.912304e-01 | -6.810833e-01 | -7.565142e-01 | -7.563999e-01 | -6.418637e-01 | -6.919118e-01 | -1.297676e+00 |
| 50% | -2.150816e-01 | -1.046362e-01 | -2.359800e-01 | -2.951869e-01 | -3.489108e-02 | -2.219405e-01 | -3.422399e-01 | -3.977212e-01 | -7.162650e-02 | -1.782793e-01 | … | -4.351564e-02 | -2.859802e-01 | -0.341181 | -4.684277e-02 | -2.695009e-01 | -2.182321e-01 | -2.234689e-01 | -1.274095e-01 | -2.164441e-01 | 7.706085e-01 |
| 75% | 4.693926e-01 | 5.841756e-01 | 4.996769e-01 | 3.635073e-01 | 6.361990e-01 | 4.938569e-01 | 5.260619e-01 | 6.469351e-01 | 5.307792e-01 | 4.709834e-01 | … | 6.583411e-01 | 5.402790e-01 | 0.357589 | 5.975448e-01 | 5.396688e-01 | 5.311411e-01 | 7.125100e-01 | 4.501382e-01 | 4.507624e-01 | 7.706085e-01 |
| max | 3.971288e+00 | 4.651889e+00 | 3.976130e+00 | 5.250529e+00 | 4.770911e+00 | 4.568425e+00 | 4.243589e+00 | 3.927930e+00 | 4.484751e+00 | 4.910919e+00 | … | 3.885905e+00 | 4.287337e+00 | 5.930172 | 3.955374e+00 | 5.112877e+00 | 4.700669e+00 | 2.685877e+00 | 6.046041e+00 | 6.846856e+00 | 7.706085e-01 |
8 rows × 31 columns
Content from 02 Logistic Regression
Last updated on 2025-07-10 | Edit this page
Logistic Regression with Breast Cancer Dataset
This notebook demonstrates how to use Logistic Regression, a fundamental classification algorithm, to predict whether a tumor is malignant or benign using the Breast Cancer Wisconsin dataset.
What is Logistic Regression?
Logistic Regression is a supervised learning algorithm used for binary classification.
It models the probability that an input \(\mathbf{x}\) belongs to class \(y=1\) using the logistic (sigmoid) function:
\[ P(y=1 | \mathbf{x}) = \frac{1}{1 + e^{- (\mathbf{w}^T \mathbf{x} + b)}} \]
The output is a probability between 0 and 1. We classify an
observation as class 1 if the predicted probability exceeds
a threshold (typically 0.5).
Step 1: Load and Explore the Data
We use the load_breast_cancer() dataset from
Scikit-Learn. It includes 30 numeric features extracted from breast mass
images.
PYTHON
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=31
)
# Normalize (Standardize) features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Step 3: Split the Data and Train the Model
We split the dataset into training and testing sets, and fit a logistic regression model.
Step 4: Evaluate the Model
Metrics Used: - Accuracy - Precision - Recall - F1-score - Confusion Matrix - ROC Curve and AUC
PYTHON
from sklearn.metrics import (
classification_report,
roc_curve, auc, accuracy_score, precision_score, recall_score, f1_score
)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
SH
Accuracy: 0.9766081871345029
Precision: 0.981651376146789
Recall: 0.981651376146789
F1 Score: 0.981651376146789
Classification Report:
precision recall f1-score support
0 0.97 0.97 0.97 62
1 0.98 0.98 0.98 109
accuracy 0.98 171
macro avg 0.97 0.97 0.97 171
weighted avg 0.98 0.98 0.98 171
What is a Confusion Matrix?
A confusion matrix is a table used to describe the performance of a classification model. For binary classification:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) | False Negative (FN) |
| Actual Negative | False Positive (FP) | True Negative (TN) |
-
Accuracy = (TP + TN) / (TP + TN + FP + FN)
-
Precision = TP / (TP + FP)
-
Recall (Sensitivity) = TP / (TP + FN)
- F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
PYTHON
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)
plt.title('Confusion Matrix')
plt.show()

What is an ROC Curve?
An ROC Curve (Receiver Operating Characteristic) plots:
- True Positive Rate (Recall) on the Y-axis
- False Positive Rate (1 - Specificity) on the X-axis
Each point on the curve corresponds to a different classification threshold. A model with perfect classification has a point in the top-left corner.
AUC (Area Under Curve) summarizes the ROC curve into a single value between 0 and 1: - AUC = 1: Perfect classifier - AUC = 0.5: No better than random guessing
PYTHON
y_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label='AUC = ' + str(round(roc_auc, 2)))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

Content from 03 Logistic Regression Optimization
Last updated on 2025-07-10 | Edit this page
Optimising a Logistic Regression Classifier
In this notebook, we demonstrate how to tune hyperparameters in a Logistic Regression model to improve performance.
Step 1: Load Breast Cancer Data
PYTHON
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=31
)
# Normalize (Standardize) features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Step 2: Define a Function to Train and Evaluate
This function will: - Train the model - Return training and test accuracy
PYTHON
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np
def train_and_evaluate(max_iter=5000, C=1, penalty='l2', solver='liblinear'):
model = LogisticRegression(max_iter=max_iter, C=C, penalty=penalty, solver=solver)
model.fit(X_train, y_train)
train_acc = accuracy_score(y_train, model.predict(X_train))
test_acc = accuracy_score(y_test, model.predict(X_test))
return train_acc, test_acc
Step 3: Explore Effect of C (Inverse of Regularization Strength)
Controls the amount of regularization. Effect:
Smaller C → Stronger regularization → Prevents overfitting but may underfit.
Larger C → Weaker regularization → Better fit but may overfit.
PYTHON
import matplotlib.pyplot as plt
values = [ 1e-3, 1e-2, 1e-1, 1, 10, 100, 1000 ]
train_scores, test_scores = [], []
for v in values:
tr, te = train_and_evaluate(C=v)
train_scores.append(tr)
test_scores.append(te)
labels = [str(s) for s in values]
plt.plot(labels, train_scores, marker='o', label='Train Acc')
plt.plot(labels, test_scores, marker='s', label='Test Acc')
plt.xlabel('Values')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.xticks(rotation=90, ha='right')
plt.show()

Step 4: Explore Effect of penalty (L1, L2
Regularization)
| Penalty | Key Effect | When to Use |
|---|---|---|
| L1 | Sparsity, feature selection | You want simpler models or auto-feature selection |
| L2 | Shrinkage, no sparsity | You want to reduce overfitting without dropping features |
PYTHON
import matplotlib.pyplot as plt
values = [ 'l1', 'l2' ]
train_scores, test_scores = [], []
for v in values:
tr, te = train_and_evaluate(penalty=v)
train_scores.append(tr)
test_scores.append(te)
labels = [str(s) for s in values]
plt.plot(labels, train_scores, marker='o', label='Train Acc')
plt.plot(labels, test_scores, marker='s', label='Test Acc')
plt.xlabel('Values')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.xticks(rotation=90, ha='right')
plt.show()

Step 5: Explore Effect of max_iter
| Scenario | Effect |
|---|---|
| Too low | Model may not converge → You’ll see warnings like “STOP: TOTAL NO. OF F, G EVALUATIONS EXCEEDS LIMIT”. Model coefficients may be inaccurate or unstable. |
| Sufficient (or slightly high) | Model converges properly. No warnings. Coefficients stabilize at optimal values. |
| Very high (but converges early) | No harm—most solvers stop automatically once
convergence is reached (before hitting max_iter). However,
unnecessarily large values can increase training time for very large
datasets. |
PYTHON
import matplotlib.pyplot as plt
values = [3, 4, 5, 6, 7, 10, 100 ]
train_scores, test_scores = [], []
for v in values:
tr, te = train_and_evaluate(max_iter=v)
train_scores.append(tr)
test_scores.append(te)
labels = [str(s) for s in values]
plt.plot(labels, train_scores, marker='o', label='Train Acc')
plt.plot(labels, test_scores, marker='s', label='Test Acc')
plt.xlabel('Values')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.xticks(rotation=90, ha='right')
plt.show()
SH
C:\Users\moji1\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\LocalCache\local-packages\Python313\site-packages\sklearn\svm\_base.py:1250: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
warnings.warn(
C:\Users\moji1\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\LocalCache\local-packages\Python313\site-packages\sklearn\svm\_base.py:1250: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
warnings.warn(
C:\Users\moji1\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\LocalCache\local-packages\Python313\site-packages\sklearn\svm\_base.py:1250: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
warnings.warn(
C:\Users\moji1\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\LocalCache\local-packages\Python313\site-packages\sklearn\svm\_base.py:1250: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
warnings.warn(
C:\Users\moji1\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\LocalCache\local-packages\Python313\site-packages\sklearn\svm\_base.py:1250: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
warnings.warn(

Changing the Classification Threshold
Most classifiers output probabilities between 0 and 1. By default, the threshold for classification is 0.5. This means:
- If predicted probability ≥ 0.5 → classify as positive
- Else → classify as negative
Changing the Threshold:
- Lower threshold → more positives predicted → higher recall, more false positives
- Higher threshold → fewer positives predicted → higher precision, more false negatives
Choosing the right threshold depends on your application’s goals.
We’ll now visualize how the confusion matrix changes for two different thresholds.
PYTHON
from sklearn.metrics import ConfusionMatrixDisplay
thresholds = [0.3, 0.5, 0.7] # List of three thresholds to compare
# Retrain model
manual_model = LogisticRegression()
manual_model.fit(X_train, y_train)
# Predict probabilities
y_proba = manual_model.predict_proba(X_test)[:, 1]
# Plot side-by-side confusion matrices
fig, axs = plt.subplots(1, 3, figsize=(12, 5))
for i, thresh in enumerate(thresholds):
y_pred = (y_proba >= thresh).astype(int)
ConfusionMatrixDisplay.from_predictions(y_test, y_pred, ax=axs[i])
axs[i].set_title(f"Threshold = {thresh}")
plt.tight_layout()
plt.show()

Content from 04 Svm
Last updated on 2025-07-10 | Edit this page
Support Vector Machine (SVM) with Breast Cancer Dataset
This notebook demonstrates the use of Support Vector Machines (SVM) for classifying tumors in the Breast Cancer dataset.
What is an SVM?
Support Vector Machines are powerful supervised learning models for classification. An SVM finds the hyperplane that best separates data points from two classes.
It maximizes the margin, which is the distance between the hyperplane and the nearest points from each class (support vectors).
Step 1: Load the Breast Cancer Dataset
PYTHON
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=31
)
# Normalize (Standardize) features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Step 3: Train an SVM Model
We use the SVC class from sklearn.svm with
default kernel (RBF).
#sk-container-id-1 { /* Definition of color scheme common for light and dark mode / –sklearn-color-text: #000; –sklearn-color-text-muted: #666; –sklearn-color-line: gray; / Definition of color scheme for unfitted estimators / –sklearn-color-unfitted-level-0: #fff5e6; –sklearn-color-unfitted-level-1: #f6e4d2; –sklearn-color-unfitted-level-2: #ffe0b3; –sklearn-color-unfitted-level-3: chocolate; / Definition of color scheme for fitted estimators */ –sklearn-color-fitted-level-0: #f0f8ff; –sklearn-color-fitted-level-1: #d4ebff; –sklearn-color-fitted-level-2: #b3dbfd; –sklearn-color-fitted-level-3: cornflowerblue;
/* Specific color for light theme */ –sklearn-color-text-on-default-background: var(–sg-text-color, var(–theme-code-foreground, var(–jp-content-font-color1, black))); –sklearn-color-background: var(–sg-background-color, var(–theme-background, var(–jp-layout-color0, white))); –sklearn-color-border-box: var(–sg-text-color, var(–theme-code-foreground, var(–jp-content-font-color1, black))); –sklearn-color-icon: #696969;
@media (prefers-color-scheme: dark) { /* Redefinition of color scheme for dark theme */ –sklearn-color-text-on-default-background: var(–sg-text-color, var(–theme-code-foreground, var(–jp-content-font-color1, white))); –sklearn-color-background: var(–sg-background-color, var(–theme-background, var(–jp-layout-color0, #111))); –sklearn-color-border-box: var(–sg-text-color, var(–theme-code-foreground, var(–jp-content-font-color1, white))); –sklearn-color-icon: #878787; } }
#sk-container-id-1 { color: var(–sklearn-color-text); }
#sk-container-id-1 pre { padding: 0; }
#sk-container-id-1 input.sk-hidden–visually { border: 0; clip: rect(1px 1px 1px 1px); clip: rect(1px, 1px, 1px, 1px); height: 1px; margin: -1px; overflow: hidden; padding: 0; position: absolute; width: 1px; }
#sk-container-id-1 div.sk-dashed-wrapped { border: 1px dashed var(–sklearn-color-line); margin: 0 0.4em 0.5em 0.4em; box-sizing: border-box; padding-bottom: 0.4em; background-color: var(–sklearn-color-background); }
#sk-container-id-1 div.sk-container { /* jupyter’s
normalize.less sets
[hidden] { display: none; } but bootstrap.min.css set
[hidden] { display: none !important; } so we also need the
!important here to be able to override the default hidden
behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755
*/ display: inline-block !important; position: relative; }
#sk-container-id-1 div.sk-text-repr-fallback { display: none; }
div.sk-parallel-item, div.sk-serial, div.sk-item { /* draw centered vertical line to link estimators */ background-image: linear-gradient(var(–sklearn-color-text-on-default-background), var(–sklearn-color-text-on-default-background)); background-size: 2px 100%; background-repeat: no-repeat; background-position: center center; }
/* Parallel-specific style estimator block */
#sk-container-id-1 div.sk-parallel-item::after { content: ““; width: 100%; border-bottom: 2px solid var(–sklearn-color-text-on-default-background); flex-grow: 1; }
#sk-container-id-1 div.sk-parallel { display: flex; align-items: stretch; justify-content: center; background-color: var(–sklearn-color-background); position: relative; }
#sk-container-id-1 div.sk-parallel-item { display: flex; flex-direction: column; }
#sk-container-id-1 div.sk-parallel-item:first-child::after { align-self: flex-end; width: 50%; }
#sk-container-id-1 div.sk-parallel-item:last-child::after { align-self: flex-start; width: 50%; }
#sk-container-id-1 div.sk-parallel-item:only-child::after { width: 0; }
/* Serial-specific style estimator block */
#sk-container-id-1 div.sk-serial { display: flex; flex-direction: column; align-items: center; background-color: var(–sklearn-color-background); padding-right: 1em; padding-left: 1em; }
/* Toggleable style: style used for
estimator/Pipeline/ColumnTransformer box that is clickable and can be
expanded/collapsed. - Pipeline and ColumnTransformer use this feature
and define the default style - Estimators will overwrite some part of
the style using the sk-estimator class */
/* Pipeline and ColumnTransformer style (default) */
#sk-container-id-1 div.sk-toggleable { /* Default theme specific background. It is overwritten whether we have a specific estimator or a Pipeline/ColumnTransformer */ background-color: var(–sklearn-color-background); }
/* Toggleable label */ #sk-container-id-1 label.sk-toggleable__label { cursor: pointer; display: flex; width: 100%; margin-bottom: 0; padding: 0.5em; box-sizing: border-box; text-align: center; align-items: start; justify-content: space-between; gap: 0.5em; }
#sk-container-id-1 label.sk-toggleable__label .caption { font-size: 0.6rem; font-weight: lighter; color: var(–sklearn-color-text-muted); }
#sk-container-id-1 label.sk-toggleable__label-arrow:before { /* Arrow on the left of the label */ content: “▸”; float: left; margin-right: 0.25em; color: var(–sklearn-color-icon); }
#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before { color: var(–sklearn-color-text); }
/* Toggleable content - dropdown */
#sk-container-id-1 div.sk-toggleable__content { display: none; text-align: left; /* unfitted */ background-color: var(–sklearn-color-unfitted-level-0); }
#sk-container-id-1 div.sk-toggleable__content.fitted { /* fitted */ background-color: var(–sklearn-color-fitted-level-0); }
#sk-container-id-1 div.sk-toggleable__content pre { margin: 0.2em; border-radius: 0.25em; color: var(–sklearn-color-text); /* unfitted */ background-color: var(–sklearn-color-unfitted-level-0); }
#sk-container-id-1 div.sk-toggleable__content.fitted pre { /* unfitted */ background-color: var(–sklearn-color-fitted-level-0); }
#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content { /* Expand drop-down */ display: block; width: 100%; overflow: visible; }
#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before { content: “▾”; }
/* Pipeline/ColumnTransformer-specific style */
#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label { color: var(–sklearn-color-text); background-color: var(–sklearn-color-unfitted-level-2); }
#sk-container-id-1 div.sk-label.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label { background-color: var(–sklearn-color-fitted-level-2); }
/* Estimator-specific style */
/* Colorize estimator box */ #sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label { /* unfitted */ background-color: var(–sklearn-color-unfitted-level-2); }
#sk-container-id-1 div.sk-estimator.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label { /* fitted */ background-color: var(–sklearn-color-fitted-level-2); }
#sk-container-id-1 div.sk-label label.sk-toggleable__label, #sk-container-id-1 div.sk-label label { /* The background is the default theme color */ color: var(–sklearn-color-text-on-default-background); }
/* On hover, darken the color of the background */ #sk-container-id-1 div.sk-label:hover label.sk-toggleable__label { color: var(–sklearn-color-text); background-color: var(–sklearn-color-unfitted-level-2); }
/* Label box, darken color on hover, fitted */ #sk-container-id-1 div.sk-label.fitted:hover label.sk-toggleable__label.fitted { color: var(–sklearn-color-text); background-color: var(–sklearn-color-fitted-level-2); }
/* Estimator label */
#sk-container-id-1 div.sk-label label { font-family: monospace; font-weight: bold; display: inline-block; line-height: 1.2em; }
#sk-container-id-1 div.sk-label-container { text-align: center; }
/* Estimator-specific / #sk-container-id-1 div.sk-estimator { font-family: monospace; border: 1px dotted var(–sklearn-color-border-box); border-radius: 0.25em; box-sizing: border-box; margin-bottom: 0.5em; / unfitted */ background-color: var(–sklearn-color-unfitted-level-0); }
#sk-container-id-1 div.sk-estimator.fitted { /* fitted */ background-color: var(–sklearn-color-fitted-level-0); }
/* on hover / #sk-container-id-1 div.sk-estimator:hover { / unfitted */ background-color: var(–sklearn-color-unfitted-level-2); }
#sk-container-id-1 div.sk-estimator.fitted:hover { /* fitted */ background-color: var(–sklearn-color-fitted-level-2); }
/* Specification for estimator info (e.g. “i” and “?”) */
/* Common style for “i” and “?” */
.sk-estimator-doc-link, a:link.sk-estimator-doc-link, a:visited.sk-estimator-doc-link { float: right; font-size: smaller; line-height: 1em; font-family: monospace; background-color: var(–sklearn-color-background); border-radius: 1em; height: 1em; width: 1em; text-decoration: none !important; margin-left: 0.5em; text-align: center; /* unfitted */ border: var(–sklearn-color-unfitted-level-1) 1pt solid; color: var(–sklearn-color-unfitted-level-1); }
.sk-estimator-doc-link.fitted, a:link.sk-estimator-doc-link.fitted, a:visited.sk-estimator-doc-link.fitted { /* fitted */ border: var(–sklearn-color-fitted-level-1) 1pt solid; color: var(–sklearn-color-fitted-level-1); }
/* On hover / div.sk-estimator:hover .sk-estimator-doc-link:hover, .sk-estimator-doc-link:hover, div.sk-label-container:hover .sk-estimator-doc-link:hover, .sk-estimator-doc-link:hover { / unfitted */ background-color: var(–sklearn-color-unfitted-level-3); color: var(–sklearn-color-background); text-decoration: none; }
div.sk-estimator.fitted:hover .sk-estimator-doc-link.fitted:hover, .sk-estimator-doc-link.fitted:hover, div.sk-label-container:hover .sk-estimator-doc-link.fitted:hover, .sk-estimator-doc-link.fitted:hover { /* fitted */ background-color: var(–sklearn-color-fitted-level-3); color: var(–sklearn-color-background); text-decoration: none; }
/* Span, style for the box shown on hovering the info icon / .sk-estimator-doc-link span { display: none; z-index: 9999; position: relative; font-weight: normal; right: .2ex; padding: .5ex; margin: .5ex; width: min-content; min-width: 20ex; max-width: 50ex; color: var(–sklearn-color-text); box-shadow: 2pt 2pt 4pt #999; / unfitted */ background: var(–sklearn-color-unfitted-level-0); border: .5pt solid var(–sklearn-color-unfitted-level-3); }
.sk-estimator-doc-link.fitted span { /* fitted */ background: var(–sklearn-color-fitted-level-0); border: var(–sklearn-color-fitted-level-3); }
.sk-estimator-doc-link:hover span { display: block; }
/* “?”-specific style due to the `` HTML tag */
#sk-container-id-1 a.estimator_doc_link { float: right; font-size: 1rem; line-height: 1em; font-family: monospace; background-color: var(–sklearn-color-background); border-radius: 1rem; height: 1rem; width: 1rem; text-decoration: none; /* unfitted */ color: var(–sklearn-color-unfitted-level-1); border: var(–sklearn-color-unfitted-level-1) 1pt solid; }
#sk-container-id-1 a.estimator_doc_link.fitted { /* fitted */ border: var(–sklearn-color-fitted-level-1) 1pt solid; color: var(–sklearn-color-fitted-level-1); }
/* On hover / #sk-container-id-1 a.estimator_doc_link:hover { / unfitted */ background-color: var(–sklearn-color-unfitted-level-3); color: var(–sklearn-color-background); text-decoration: none; }
#sk-container-id-1 a.estimator_doc_link.fitted:hover { /* fitted */ background-color: var(–sklearn-color-fitted-level-3); }
.estimator-table summary { padding: .5rem; font-family: monospace; cursor: pointer; }
.estimator-table details[open] { padding-left: 0.1rem; padding-right: 0.1rem; padding-bottom: 0.3rem; }
.estimator-table .parameters-table { margin-left: auto !important; margin-right: auto !important; }
.estimator-table .parameters-table tr:nth-child(odd) { background-color: #fff; }
.estimator-table .parameters-table tr:nth-child(even) { background-color: #f6f6f6; }
.estimator-table .parameters-table tr:hover { background-color: #e0e0e0; }
.estimator-table table td { border: 1px solid rgba(106, 105, 104, 0.232); }
.user-set td { color:rgb(255, 94, 0); text-align: left; }
.user-set td.value pre { color:rgb(255, 94, 0) !important; background-color: transparent !important; }
.default td { color: black; text-align: left; }
.user-set td i, .default td i { color: black; }
.copy-paste-icon { background-image: url(data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCA0NDggNTEyIj48IS0tIUZvbnQgQXdlc29tZSBGcmVlIDYuNy4yIGJ5IEBmb250YXdlc29tZSAtIGh0dHBzOi8vZm9udGF3ZXNvbWUuY29tIExpY2Vuc2UgLSBodHRwczovL2ZvbnRhd2Vzb21lLmNvbS9saWNlbnNlL2ZyZWUgQ29weXJpZ2h0IDIwMjUgRm9udGljb25zLCBJbmMuLS0+PHBhdGggZD0iTTIwOCAwTDMzMi4xIDBjMTIuNyAwIDI0LjkgNS4xIDMzLjkgMTQuMWw2Ny45IDY3LjljOSA5IDE0LjEgMjEuMiAxNC4xIDMzLjlMNDQ4IDMzNmMwIDI2LjUtMjEuNSA0OC00OCA0OGwtMTkyIDBjLTI2LjUgMC00OC0yMS41LTQ4LTQ4bDAtMjg4YzAtMjYuNSAyMS41LTQ4IDQ4LTQ4ek00OCAxMjhsODAgMCAwIDY0LTY0IDAgMCAyNTYgMTkyIDAgMC0zMiA2NCAwIDAgNDhjMCAyNi41LTIxLjUgNDgtNDggNDhMNDggNTEyYy0yNi41IDAtNDgtMjEuNS00OC00OEwwIDE3NmMwLTI2LjUgMjEuNS00OCA0OC00OHoiLz48L3N2Zz4=); background-repeat: no-repeat; background-size: 14px 14px; background-position: 0; display: inline-block; width: 14px; height: 14px; cursor: pointer; } SVC(probability=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.SVC?Documentation for SVCiFitted
Parameters
| C | 1.0 | |
|---|---|---|
| kernel | ‘rbf’ | |
| degree | 3 | |
| gamma | ‘scale’ | |
| coef0 | 0.0 | |
| shrinking | True | |
| probability | True | |
| tol | 0.001 | |
| cache_size | 200 | |
| class_weight | None | |
| verbose | False | |
| max_iter | -1 | |
| decision_function_shape | ‘ovr’ | |
| break_ties | False | |
| random_state | None |
function copyToClipboard(text, element) { // Get the parameter prefix
from the closest toggleable content const toggleableContent =
element.closest(’.sk-toggleable__content’); const paramPrefix =
toggleableContent ? toggleableContent.dataset.paramPrefix : ’’; const
fullParamName = paramPrefix ? ${paramPrefix}${text} :
text;
SH
const originalStyle = element.style;
const computedStyle = window.getComputedStyle(element);
const originalWidth = computedStyle.width;
const originalHTML = element.innerHTML.replace('Copied!', '');
navigator.clipboard.writeText(fullParamName)
.then(() => {
element.style.width = originalWidth;
element.style.color = 'green';
element.innerHTML = "Copied!";
setTimeout(() => {
element.innerHTML = originalHTML;
element.style = originalStyle;
}, 2000);
})
.catch(err => {
console.error('Failed to copy:', err);
element.style.color = 'red';
element.innerHTML = "Failed!";
setTimeout(() => {
element.innerHTML = originalHTML;
element.style = originalStyle;
}, 2000);
});
return false;
}
document.querySelectorAll(‘.fa-regular.fa-copy’).forEach(function(element)
{ const toggleableContent = element.closest(’.sk-toggleable__content’);
const paramPrefix = toggleableContent ?
toggleableContent.dataset.paramPrefix : ’’; const paramName =
element.parentElement.nextElementSibling.textContent.trim(); const
fullParamName = paramPrefix ? ${paramPrefix}${paramName} :
paramName;
});
Step 4: Evaluate the Model
PYTHON
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
ConfusionMatrixDisplay, classification_report, roc_curve, auc
)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
SH
Accuracy: 0.9766081871345029
Precision: 0.972972972972973
Recall: 0.9908256880733946
F1 Score: 0.9818181818181818
Classification Report:
precision recall f1-score support
0 0.98 0.95 0.97 62
1 0.97 0.99 0.98 109
accuracy 0.98 171
macro avg 0.98 0.97 0.97 171
weighted avg 0.98 0.98 0.98 171
What is a Confusion Matrix?
A confusion matrix is a summary of prediction results:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) | False Negative (FN) |
| Actual Negative | False Positive (FP) | True Negative (TN) |
- Accuracy, Precision, Recall, F1 Score are all derived from this table.
PYTHON
import matplotlib.pyplot as plt
import seaborn as sns
ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)
plt.title('SVM Confusion Matrix')
plt.show()

What is an ROC Curve?
The ROC Curve shows the trade-off between True Positive Rate (Recall) and False Positive Rate. The AUC (Area Under Curve) summarizes the performance into a single number.
PYTHON
y_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label='AUC = ' + str(round(roc_auc, 2)))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('SVM ROC Curve')
plt.legend()
plt.show()

Content from 05 Svm Optimization
Last updated on 2025-07-10 | Edit this page
Optimising a Support Vector Machine (SVM) Classifier
In this notebook, we demonstrate how to tune hyperparameters in a Support Vector Machine (SVM) classifier to improve classification performance.
We will use the Breast Cancer Wisconsin dataset to:
- Train an initial SVM model. - Explore the effects of the most
important hyperparameters: - C — Regularization parameter
controlling the trade-off between achieving low training error and low
testing error. - gamma — Kernel coefficient controlling how
much influence each data point has on the decision boundary. -
kernel — Defines the type of decision boundary (linear or
nonlinear), with options like 'linear', 'rbf',
and 'poly'.
Support Vector Machines are powerful models for classification tasks and can handle both linear and non-linear relationships through the use of kernel functions. By tuning these hyperparameters, we can significantly improve model performance.
Step 1: Load Breast Cancer Data
PYTHON
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=31
)
# Normalize (Standardize) features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Exploring the Effect of the C Hyperparameter
The C parameter in SVM controls the
regularization strength: - Low C
values → More regularization → Smoother decision boundary,
possibly underfitting. - High C values →
Less regularization → Fits the training data more closely, potentially
overfitting.
We will train SVM models with different C values and
observe how it affects accuracy.
PYTHON
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
# Test different values of C
C_values = [0.01, 0.1, 1, 10, 100]
train_scores = []
test_scores = []
for C in C_values:
model = SVC(C=C, kernel='rbf', gamma='scale')
model.fit(X_train, y_train)
train_acc = accuracy_score(y_train, model.predict(X_train))
test_acc = accuracy_score(y_test, model.predict(X_test))
train_scores.append(train_acc)
test_scores.append(test_acc)
# Plot the results
plt.figure(figsize=(8, 6))
plt.semilogx(C_values, train_scores, marker='o', label='Train Accuracy')
plt.semilogx(C_values, test_scores, marker='s', label='Test Accuracy')
plt.xlabel('C (log scale)')
plt.ylabel('Accuracy')
plt.title('Effect of Regularization Parameter C on SVM')
plt.legend()
plt.grid(True)
plt.show()

Exploring the Effect of the gamma Hyperparameter
The gamma parameter controls the influence of
each individual training example: - Low
gamma values → Far-reaching influence → Smoother
decision boundary, possibly underfitting. - High
gamma values → Very localized influence → Complex
decision boundary, potentially overfitting.
We will now examine how varying gamma affects the
model’s performance, while keeping C fixed.
PYTHON
# Test different values of gamma
gamma_values = [0.001, 0.01, 0.1, 1, 10]
train_scores = []
test_scores = []
for gamma in gamma_values:
model = SVC(C=1, kernel='rbf', gamma=gamma)
model.fit(X_train, y_train)
train_acc = accuracy_score(y_train, model.predict(X_train))
test_acc = accuracy_score(y_test, model.predict(X_test))
train_scores.append(train_acc)
test_scores.append(test_acc)
# Plot the results
plt.figure(figsize=(8, 6))
plt.semilogx(gamma_values, train_scores, marker='o', label='Train Accuracy')
plt.semilogx(gamma_values, test_scores, marker='s', label='Test Accuracy')
plt.xlabel('gamma (log scale)')
plt.ylabel('Accuracy')
plt.title('Effect of gamma on SVM')
plt.legend()
plt.grid(True)
plt.show()

Exploring the Effect of the kernel Hyperparameter
The kernel parameter in SVM specifies the type
of transformation applied to the input data: -
'linear' → No transformation; linear decision boundary. -
'rbf' (Radial Basis Function) → Maps data into higher
dimensions for non-linear boundaries. - 'poly' → Polynomial
transformations, allowing more flexible decision boundaries depending on
degree.
We will now compare different kern
PYTHON
# Test different kernel types
kernel_types = ['linear', 'rbf', 'poly']
train_scores = []
test_scores = []
for kernel in kernel_types:
model = SVC(C=1, kernel=kernel, gamma='scale')
model.fit(X_train, y_train)
train_acc = accuracy_score(y_train, model.predict(X_train))
test_acc = accuracy_score(y_test, model.predict(X_test))
train_scores.append(train_acc)
test_scores.append(test_acc)
# Show numeric results
for k, tr, te in zip(kernel_types, train_scores, test_scores):
print(f"Kernel: {k} → Train Accuracy: {tr:.3f}, Test Accuracy: {te:.3f}")
# Plot the results
import numpy as np
x = np.arange(len(kernel_types))
width = 0.35
plt.figure(figsize=(8, 6))
plt.bar(x - width/2, train_scores, width, label='Train Accuracy')
plt.bar(x + width/2, test_scores, width, label='Test Accuracy')
plt.xticks(x, kernel_types)
plt.xlabel('Kernel Type')
plt.ylabel('Accuracy')
plt.title('Comparison of SVM Kernels')
plt.legend()
plt.grid(True, axis='y')
plt.show()
SH
Kernel: linear → Train Accuracy: 0.987, Test Accuracy: 0.988
Kernel: rbf → Train Accuracy: 0.985, Test Accuracy: 0.977
Kernel: poly → Train Accuracy: 0.910, Test Accuracy: 0.883

Changing the Classification Threshold
Most classifiers output probabilities between 0 and 1. By default, the threshold for classification is 0.5. This means:
- If predicted probability ≥ 0.5 → classify as positive
- Else → classify as negative
Changing the Threshold:
- Lower threshold → more positives predicted → higher recall, more false positives
- Higher threshold → fewer positives predicted → higher precision, more false negatives
Choosing the right threshold depends on your application’s goals.
We’ll now visualize how the confusion matrix changes for two different thresholds.
PYTHON
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
thresholds = [0.3, 0.5, 0.7] # List of two thresholds to compare
# Train SVM with probability estimates enabled
model = SVC(C=1, kernel='rbf', gamma='scale', probability=True)
model.fit(X_train, y_train)
# Predict probabilities
y_proba = model.predict_proba(X_test)[:, 1] # Probability of class 1
# Plot side-by-side confusion matrices
fig, axs = plt.subplots(1, 3, figsize=(12, 5))
for i, thresh in enumerate(thresholds):
y_pred = (y_proba >= thresh).astype(int)
ConfusionMatrixDisplay.from_predictions(y_test, y_pred, ax=axs[i])
axs[i].set_title(f"Threshold = {thresh}")
plt.tight_layout()
plt.show()

Content from 06 Model Evaluation
Last updated on 2025-07-10 | Edit this page
Model Evaluation: Comparing Logistic Regression and SVM
We compare two classifiers: - Logistic Regression - SVM
Using metrics: Accuracy, Precision, Recall, F1 Score, Confusion Matrix, and ROC-AUC.
Step 1: Load the Data
PYTHON
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=31
)
# Normalize (Standardize) features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Step 3: Evaluation Metrics
What is Accuracy?
Accuracy is the proportion of correct predictions:
\[ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} \]
Can be misleading if classes are imbalanced.
What are Precision, Recall and F1 Score?
-
Precision: \(\frac{TP}{TP
+ FP}\)
-
Recall: \(\frac{TP}{TP +
FN}\)
- F1 Score: Harmonic mean of precision and recall
\[ F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]
PYTHON
from sklearn.metrics import precision_score, recall_score, f1_score
for name, model in [("SVM", svm_model), ("Neural Net", lr_model)]:
y_pred = model.predict(X_test)
print(f"\n{name} Metrics:")
print(f"Precision: {precision_score(y_test, y_pred):.2f}")
print(f"Recall: {recall_score(y_test, y_pred):.2f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.2f}")
What is a Confusion Matrix?
A confusion matrix shows the breakdown of correct and incorrect classifications.
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) | False Negative (FN) |
| Actual Negative | False Positive (FP) | True Negative (TN) |
PYTHON
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt
fig, axs = plt.subplots(1, 2, figsize=(12, 5))
ConfusionMatrixDisplay.from_estimator(svm_model, X_test, y_test, ax=axs[0])
axs[0].set_title("SVM Confusion Matrix")
ConfusionMatrixDisplay.from_estimator(lr_model, X_test, y_test, ax=axs[1])
axs[1].set_title("Neural Network Confusion Matrix")
plt.tight_layout()
plt.show()

What is the ROC Curve?
ROC = Receiver Operating Characteristic Curve
- Plots TPR vs FPR
- AUC = Area Under the ROC Curve Closer to 1 = better model.
PYTHON
from sklearn.metrics import roc_curve, auc
svm_probs = svm_model.predict_proba(X_test)[:, 1]
nn_probs = lr_model.predict_proba(X_test)[:, 1]
svm_fpr, svm_tpr, _ = roc_curve(y_test, svm_probs)
nn_fpr, nn_tpr, _ = roc_curve(y_test, nn_probs)
svm_auc = auc(svm_fpr, svm_tpr)
nn_auc = auc(nn_fpr, nn_tpr)
plt.figure(figsize=(8, 6))
plt.plot(svm_fpr, svm_tpr, label=f"SVM (AUC = {svm_auc:.2f})")
plt.plot(nn_fpr, nn_tpr, label=f"Logistic Regression (AUC = {nn_auc:.2f})")
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()

Content from 07 Neural Networks
Last updated on 2025-07-10 | Edit this page
Neural Network (MLPClassifier) with Breast Cancer Dataset
In this notebook, we will use a simple Multi-layer Perceptron (MLP) neural network to classify breast tumors.
What is an MLP?
An MLP is a type of feedforward neural network consisting of one or more hidden layers. Each neuron computes a weighted sum of its inputs and passes the result through a nonlinear activation function.
MLPs are suitable for classification tasks and are trained using backpropagation to minimize loss.
Step 1: Load the Breast Cancer Dataset
PYTHON
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=31
)
# Normalize (Standardize) features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Step 4: Evaluate the Model
PYTHON
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
ConfusionMatrixDisplay, classification_report, roc_curve, auc
)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
SH
Accuracy: 0.9824561403508771
Precision: 0.9732142857142857
Recall: 1.0
F1 Score: 0.9864253393665159
Classification Report:
precision recall f1-score support
0 1.00 0.95 0.98 62
1 0.97 1.00 0.99 109
accuracy 0.98 171
macro avg 0.99 0.98 0.98 171
weighted avg 0.98 0.98 0.98 171
What is a Confusion Matrix?
A confusion matrix shows how well the model distinguishes between classes:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) | False Negative (FN) |
| Actual Negative | False Positive (FP) | True Negative (TN) |
This matrix lets us compute metrics like accuracy, precision, recall, and F1 score.
PYTHON
import matplotlib.pyplot as plt
import seaborn as sns
ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)
plt.title('Neural Network Confusion Matrix')
plt.show()

What is an ROC Curve?
The ROC Curve shows the trade-off between True Positive Rate and False Positive Rate. AUC quantifies this performance. Closer to 1.0 = better classifier.
PYTHON
y_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label='AUC = ' + str(round(roc_auc, 2)))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Neural Network ROC Curve')
plt.legend()
plt.show()

Content from 08 Neural Networks Optimization
Last updated on 2025-07-10 | Edit this page
Optimising a Neural Network Classifier
In this notebook, we demonstrate how to tune hyperparameters in a neural network model to improve performance.
We will focus on: - hidden_layer_sizes -
alpha (regularization) -
learning_rate_init
We’ll also visualize how these parameters affect accuracy, and look for signs of overfitting or underfitting.
Step 1: Load Breast Cancer Data
Load with Normalisation
PYTHON
# from sklearn.datasets import load_breast_cancer
# from sklearn.model_selection import train_test_split
# from sklearn.preprocessing import StandardScaler
# import pandas as pd
# # Load dataset
# data = load_breast_cancer()
# X = data.data
# y = data.target
# # Split dataset
# X_train, X_test, y_train, y_test = train_test_split(
# X, y, test_size=0.3, random_state=31
# )
# # Normalize (Standardize) features
# scaler = StandardScaler()
# X_train = scaler.fit_transform(X_train)
# X_test = scaler.transform(X_test)
Load without Normalisation To see the effects of different network sizes and other hyperparameters we use the non-normlaised dataset. For teh easier task of calssifying using normalised data most values for hyperparameters result in a very good accuracy.
PYTHON
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=31
)
Step 2: Define a Function to Train and Evaluate
This function will: - Train the MLP model - Return training and test accuracy
PYTHON
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
import numpy as np
def train_and_evaluate(hidden_layer_sizes=(200, 400, 400, 200), alpha=0.0001, lr=0.001):
model = MLPClassifier(hidden_layer_sizes=hidden_layer_sizes,
alpha=alpha,
learning_rate_init=lr,
max_iter=2000,
random_state=42)
model.fit(X_train, y_train)
train_acc = accuracy_score(y_train, model.predict(X_train))
test_acc = accuracy_score(y_test, model.predict(X_test))
return train_acc, test_acc
Step 3: Explore Effect of hidden_layer_sizes
The number of neurons and layers controls the model’s capacity.
- Too small: underfitting
- Too large: overfitting
PYTHON
import matplotlib.pyplot as plt
values = [ (50,50), (50, 50, 50), (200, 400, 400, 200), (200, 400, 400, 400, 400, 200), (200, 400, 800, 800, 800, 400, 400, 200)]
labels = [0, 1, 2, 3, 4]
train_scores, test_scores = [], []
for s in values:
tr, te = train_and_evaluate(hidden_layer_sizes=s)
train_scores.append(tr)
test_scores.append(te)
labels = [str(s) for s in values]
plt.plot(labels, train_scores, marker='o', label='Train Acc')
plt.plot(labels, test_scores, marker='s', label='Test Acc')
plt.xlabel('Hidden Layer Sizes')
plt.ylabel('Accuracy')
plt.title('Effect of Network Size')
plt.legend()
plt.grid(True)
plt.xticks(rotation=90, ha='right')
plt.show()

Step 4: Explore Effect of alpha (L2
Regularization)
alpha prevents overfitting by penalising large
weights.
- Low
alpha: can overfit - High
alpha: can underfit
PYTHON
alphas = [1e-1, 3e-1, 5e-1, 7e-1, 1e1]
train_scores, test_scores = [], []
for a in alphas:
tr, te = train_and_evaluate(alpha=a)
train_scores.append(tr)
test_scores.append(te)
plt.semilogx(alphas, train_scores, marker='o', label='Train Acc')
plt.semilogx(alphas, test_scores, marker='s', label='Test Acc')
plt.xlabel('alpha (log scale)')
plt.ylabel('Accuracy')
plt.title('Effect of Regularization Strength')
plt.legend()
plt.grid(True)
plt.show()

Step 5: Explore Effect of learning_rate_init
This controls how fast the model updates its weights.
- Too small: slow convergence
- Too large: may never converge
PYTHON
lrs = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1]
train_scores, test_scores = [], []
for lr in lrs:
tr, te = train_and_evaluate(lr=lr)
train_scores.append(tr)
test_scores.append(te)
plt.plot(lrs, train_scores, marker='o', label='Train Acc')
plt.plot(lrs, test_scores, marker='s', label='Test Acc')
plt.xscale('log')
plt.xlabel('Learning Rate (log scale)')
plt.ylabel('Accuracy')
plt.title('Effect of Learning Rate')
plt.legend()
plt.grid(True)
plt.show()

Conclusion
- Neural networks are sensitive to hyperparameters
- Use visualisation to find sweet spot
- Avoid overfitting by tuning
alphaandhidden_layer_sizes - Don’t pick hyperparameters blindly – use grid search or cross-validation
Changing the Classification Threshold
Most classifiers like neural networks output probabilities between 0 and 1. By default, the threshold for classification is 0.5. This means:
- If predicted probability ≥ 0.5 → classify as positive
- Else → classify as negative
Changing the Threshold:
- Lower threshold → more positives predicted → higher recall, more false positives
- Higher threshold → fewer positives predicted → higher precision, more false negatives
Choosing the right threshold depends on your application’s goals.
We’ll now visualize how the confusion matrix changes for two different thresholds.
PYTHON
from sklearn.metrics import ConfusionMatrixDisplay
thresholds = [0.3, 0.5, 0.7] # List of two thresholds to compare
# Retrain model
manual_model = MLPClassifier()
manual_model.fit(X_train, y_train)
# Predict probabilities
y_proba = manual_model.predict_proba(X_test)[:, 1]
# Plot side-by-side confusion matrices
fig, axs = plt.subplots(1, 3, figsize=(12, 5))
for i, thresh in enumerate(thresholds):
y_pred = (y_proba >= thresh).astype(int)
ConfusionMatrixDisplay.from_predictions(y_test, y_pred, ax=axs[i])
axs[i].set_title(f"Threshold = {thresh}")
plt.tight_layout()
plt.show()
SH
C:\Users\moji1\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\LocalCache\local-packages\Python313\site-packages\sklearn\neural_network\_multilayer_perceptron.py:780: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
warnings.warn(

Content from 09 Random Forest
Last updated on 2025-07-10 | Edit this page
Random Forest Classifier with Breast Cancer Dataset
This notebook demonstrates the use of Random Forest Classifier for classifying tumors in the Breast Cancer dataset.
What is a Random Forest?
Random Forest is a powerful ensemble learning method for classification. It builds multiple decision trees and combines their predictions for improved accuracy and robustness.
Each tree is trained on a random subset of the data and features, reducing overfitting and improving generalization.
Step 1: Load the Breast Cancer Dataset
PYTHON
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=31
)
# Normalize (Standardize) features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Step 4: Evaluate the Model
PYTHON
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
ConfusionMatrixDisplay, classification_report, roc_curve, auc
)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
SH
Accuracy: 0.9590643274853801
Precision: 0.9636363636363636
Recall: 0.9724770642201835
F1 Score: 0.9680365296803652
Classification Report:
precision recall f1-score support
0 0.95 0.94 0.94 62
1 0.96 0.97 0.97 109
accuracy 0.96 171
macro avg 0.96 0.95 0.96 171
weighted avg 0.96 0.96 0.96 171
What is a Confusion Matrix?
A confusion matrix is a summary of prediction results:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) | False Negative (FN) |
| Actual Negative | False Positive (FP) | True Negative (TN) |
- Accuracy, Precision, Recall, F1 Score are all derived from this table.
PYTHON
import matplotlib.pyplot as plt
import seaborn as sns
ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)
plt.title('SVM Confusion Matrix')
plt.show()

What is an ROC Curve?
The ROC Curve shows the trade-off between True Positive Rate (Recall) and False Positive Rate. The AUC (Area Under Curve) summarizes the performance into a single number.
PYTHON
y_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label='AUC = ' + str(round(roc_auc, 2)))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('SVM ROC Curve')
plt.legend()
plt.show()

Content from 10 Random Forest Optimization
Last updated on 2025-07-10 | Edit this page
Optimising a Random Forest Classifier
In this notebook, we demonstrate how to tune hyperparameters in a Random Forest Classifier to improve classification performance.
We will use the Breast Cancer Wisconsin dataset to:
- Train an initial Random Forest model. - Explore the effects of the
most important hyperparameters: - n_estimators — Number of
trees in the forest. - max_depth — Maximum depth of each
tree. - min_samples_split — Minimum number of samples
required to split an internal node.
Random Forest is a powerful ensemble method for classification tasks that builds multiple decision trees and merges their outputs for improved accuracy and robustness. By tuning these hyperparameters, we can significantly improve model performance.
Step 1: Load Breast Cancer Data
PYTHON
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=31
)
# Normalize (Standardize) features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Exploring the Effect of the n_estimators
Hyperparameter
The n_estimators parameter in Random Forest controls the
number of trees in the forest: - Low
n_estimators values → Fewer trees → Faster
training but possibly underfitting. - High
n_estimators values → More trees → Better
performance but increased computation.
We will train Random Forest models with different
n_estimators values and observe how it affects
accuracy.
PYTHON
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
# Test different values of n_estimators
n_estimators_values = [1, 2 ,4, 8, 10, 20, 30, 50, 70, 90]
train_scores = []
test_scores = []
for n in n_estimators_values:
model = RandomForestClassifier(n_estimators=n, random_state=42)
model.fit(X_train, y_train)
train_acc = accuracy_score(y_train, model.predict(X_train))
test_acc = accuracy_score(y_test, model.predict(X_test))
train_scores.append(train_acc)
test_scores.append(test_acc)
# Plot the results
plt.figure(figsize=(8, 6))
plt.plot(n_estimators_values, train_scores, marker='o', label='Train Accuracy')
plt.plot(n_estimators_values, test_scores, marker='s', label='Test Accuracy')
plt.xlabel('n_estimators (Number of Trees)')
plt.ylabel('Accuracy')
plt.title('Effect of n_estimators on Random Forest')
plt.legend()
plt.grid(True)
plt.show()

Exploring the Effect of the max_depth
Hyperparameter
The max_depth parameter controls the maximum
depth of each tree: - Low max_depth
values → Shallow trees → Simpler models, possibly underfitting.
- High max_depth values → Deeper trees →
More complex models, potentially overfitting.
We will now examine how varying max_depth affects the
model’s performance, while keeping n_estimators fixed.
PYTHON
# Test different values of max_depth
max_depth_values = [2, 4, 6, 8, 10, 20]
train_scores = []
test_scores = []
for depth in max_depth_values:
model = RandomForestClassifier(n_estimators=100, max_depth=depth, random_state=42)
model.fit(X_train, y_train)
train_acc = accuracy_score(y_train, model.predict(X_train))
test_acc = accuracy_score(y_test, model.predict(X_test))
train_scores.append(train_acc)
test_scores.append(test_acc)
# Plot the results
plt.figure(figsize=(8, 6))
plt.plot(max_depth_values, train_scores, marker='o', label='Train Accuracy')
plt.plot(max_depth_values, test_scores, marker='s', label='Test Accuracy')
plt.xlabel('max_depth')
plt.ylabel('Accuracy')
plt.title('Effect of max_depth on Random Forest')
plt.legend()
plt.grid(True)
plt.show()

Exploring the Effect of the min_samples_split
Hyperparameter
The min_samples_split parameter controls the
minimum number of samples required to split an internal
node: - Low min_samples_split
values → More splits → Complex trees, potentially overfitting.
- High min_samples_split values → Fewer
splits → Simpler trees, possibly underfitting.
We will now compare different min_samples_split values
to observe their impact on performance.
PYTHON
# Test different values of min_samples_split
min_samples_split_values = [2, 5, 10, 20, 50]
train_scores = []
test_scores = []
for min_split in min_samples_split_values:
model = RandomForestClassifier(n_estimators=100, max_depth=6, min_samples_split=min_split, random_state=42)
model.fit(X_train, y_train)
train_acc = accuracy_score(y_train, model.predict(X_train))
test_acc = accuracy_score(y_test, model.predict(X_test))
train_scores.append(train_acc)
test_scores.append(test_acc)
# Show numeric results
for m, tr, te in zip(min_samples_split_values, train_scores, test_scores):
print(f"min_samples_split: {m} → Train Accuracy: {tr:.3f}, Test Accuracy: {te:.3f}")
# Plot the results
import numpy as np
x = np.arange(len(min_samples_split_values))
width = 0.35
# plt.figure(figsize=(8, 6))
# plt.bar(x - width/2, train_scores, width, label='Train Accuracy')
# plt.bar(x + width/2, test_scores, width, label='Test Accuracy')
# plt.xticks(x, min_samples_split_values)
# plt.xlabel('min_samples_split')
# plt.ylabel('Accuracy')
# plt.title('Effect of min_samples_split on Random Forest')
# plt.legend()
# plt.grid(True, axis='y')
# plt.show()
# Plot the results
plt.figure(figsize=(8, 6))
plt.plot( min_samples_split_values, train_scores, marker='o', label='Train Accuracy')
plt.plot( min_samples_split_values, test_scores, marker='s', label='Test Accuracy')
plt.xlabel('max_depth')
plt.ylabel('Accuracy')
plt.title('Effect of max_depth on Random Forest')
plt.legend()
plt.grid(True)
plt.show()
SH
min_samples_split: 2 → Train Accuracy: 0.997, Test Accuracy: 0.965
min_samples_split: 5 → Train Accuracy: 0.995, Test Accuracy: 0.959
min_samples_split: 10 → Train Accuracy: 0.980, Test Accuracy: 0.947
min_samples_split: 20 → Train Accuracy: 0.972, Test Accuracy: 0.947
min_samples_split: 50 → Train Accuracy: 0.972, Test Accuracy: 0.942

Changing the Classification Threshold
Most classifiers output probabilities between 0 and 1. By default, the threshold for classification is 0.5. This means:
- If predicted probability ≥ 0.5 → classify as positive
- Else → classify as negative
Changing the Threshold:
- Lower threshold → more positives predicted → higher recall, more false positives
- Higher threshold → fewer positives predicted → higher precision, more false negatives
Choosing the right threshold depends on your application’s goals.
We’ll now visualize how the confusion matrix changes for two different thresholds.
PYTHON
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
thresholds = [0.3, 0.5, 0.7] # List of two thresholds to compare
# Train Random Forest with probability estimates
model = RandomForestClassifier(n_estimators=100, max_depth=6, min_samples_split=5, random_state=42)
model.fit(X_train, y_train)
# Predict probabilities
y_proba = model.predict_proba(X_test)[:, 1] # Probability of class 1
# Plot side-by-side confusion matrices
fig, axs = plt.subplots(1, 3, figsize=(12, 5))
for i, thresh in enumerate(thresholds):
y_pred = (y_proba >= thresh).astype(int)
ConfusionMatrixDisplay.from_predictions(y_test, y_pred, ax=axs[i])
axs[i].set_title(f"Threshold = {thresh}")
plt.tight_layout()
plt.show()
