Last updated on 2025-07-10 | Edit this page

Introduction to Classification

What is Classification?

Classification is a type of supervised learning where the goal is to predict categorical class labels. Given input data, a classification model attempts to assign it to one of several predefined classes.

Some examples include: - Email spam detection (spam vs. not spam) - Disease diagnosis (positive vs. negative) - Image recognition (cat, dog, or other)

Workshop Goals

By the end of this workshop, you will be able to: - Understand common classification algorithms - Apply them using Scikit-Learn, NumPy, Pandas, and Matplotlib - Evaluate and optimise models

Topics Covered

Logistic Regression
Support Vector Machines (SVM)
Model Evaluation: Accuracy, Precision, Recall, F1-Score, ROC-AUC
Neural Networks (MLPClassifier)
Random Forest Classifier
Optimisation and Tuning

Required Libraries

We will use the following Python libraries throughout the workshop: - NumPy – numerical operations - Pandas – data manipulation - Scikit-Learn – machine learning models and tools - Matplotlib – data visualisation - Seaborn - data visualisation

Let’s get started! 🚀

Installing Libraries

Uncomment and run the commands below only if packages are not installed.

PYTHON

# !pip install numpy
# !pip install pandas
# !pip install scikit-learn
# !pip install matplotlib
# !pip install seaborn

Check your environment has the necessary libraries installed

PYTHON

import numpy
print("NumPy version:", numpy.__version__)

import pandas
print("Pandas version:", pandas.__version__)

import sklearn
print("sklearn version:", sklearn.__version__)

import matplotlib
print("matplotlib version:", matplotlib.__version__)

import seaborn
print("sklearn version:", seaborn.__version__)

SH

NumPy version: 2.2.6
Pandas version: 2.2.3
sklearn version: 1.7.0
matplotlib version: 3.10.3
sklearn version: 0.13.2

PYTHON

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets

Preview Example Dataset

We use the load_breast_cancer() dataset from Scikit-Learn. It includes 30 numeric features extracted from breast mass images.

PYTHON

from sklearn.datasets import load_breast_cancer
import pandas as pd

data = load_breast_cancer()
X = data.data
y = data.target

df = pd.DataFrame(X, columns=data.feature_names)
df['target'] = y
df.head()

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	…	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
0	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	0.07871	…	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
1	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	…	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
2	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	0.05999	…	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
3	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	0.2597	0.09744	…	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
4	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	0.1809	0.05883	…	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678

5 rows × 31 columns

PYTHON

df.describe()

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	…	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension	target
count	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	…	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000
mean	14.127292	19.289649	91.969033	654.889104	0.096360	0.104341	0.088799	0.048919	0.181162	0.062798	…	25.677223	107.261213	880.583128	0.132369	0.254265	0.272188	0.114606	0.290076	0.083946	0.627417
std	3.524049	4.301036	24.298981	351.914129	0.014064	0.052813	0.079720	0.038803	0.027414	0.007060	…	6.146258	33.602542	569.356993	0.022832	0.157336	0.208624	0.065732	0.061867	0.018061	0.483918
min	6.981000	9.710000	43.790000	143.500000	0.052630	0.019380	0.000000	0.000000	0.106000	0.049960	…	12.020000	50.410000	185.200000	0.071170	0.027290	0.000000	0.000000	0.156500	0.055040	0.000000
25%	11.700000	16.170000	75.170000	420.300000	0.086370	0.064920	0.029560	0.020310	0.161900	0.057700	…	21.080000	84.110000	515.300000	0.116600	0.147200	0.114500	0.064930	0.250400	0.071460	0.000000
50%	13.370000	18.840000	86.240000	551.100000	0.095870	0.092630	0.061540	0.033500	0.179200	0.061540	…	25.410000	97.660000	686.500000	0.131300	0.211900	0.226700	0.099930	0.282200	0.080040	1.000000
75%	15.780000	21.800000	104.100000	782.700000	0.105300	0.130400	0.130700	0.074000	0.195700	0.066120	…	29.720000	125.400000	1084.000000	0.146000	0.339100	0.382900	0.161400	0.317900	0.092080	1.000000
max	28.110000	39.280000	188.500000	2501.000000	0.163400	0.345400	0.426800	0.201200	0.304000	0.097440	…	49.540000	251.200000	4254.000000	0.222600	1.058000	1.252000	0.291000	0.663800	0.207500	1.000000

8 rows × 31 columns

PYTHON

df.hist(bins=20, figsize=(15, 10))
plt.tight_layout()

PYTHON

from sklearn.preprocessing import StandardScaler

# Apply StandardScaler
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns, index=df.index)
df_scaled.describe()

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	…	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension	target
count	5.690000e+02	5.690000e+02	5.690000e+02	5.690000e+02	5.690000e+02	5.690000e+02	5.690000e+02	5.690000e+02	5.690000e+02	5.690000e+02	…	5.690000e+02	5.690000e+02	569.000000	5.690000e+02	5.690000e+02	5.690000e+02	5.690000e+02	5.690000e+02	5.690000e+02	5.690000e+02
mean	-1.373633e-16	6.868164e-17	-1.248757e-16	-2.185325e-16	-8.366672e-16	1.873136e-16	4.995028e-17	-4.995028e-17	1.748260e-16	4.745277e-16	…	1.248757e-17	-3.746271e-16	0.000000	-2.372638e-16	-3.371644e-16	7.492542e-17	2.247763e-16	2.622390e-16	-5.744282e-16	-4.995028e-17
std	1.000880e+00	1.000880e+00	1.000880e+00	1.000880e+00	1.000880e+00	1.000880e+00	1.000880e+00	1.000880e+00	1.000880e+00	1.000880e+00	…	1.000880e+00	1.000880e+00	1.000880	1.000880e+00	1.000880e+00	1.000880e+00	1.000880e+00	1.000880e+00	1.000880e+00	1.000880e+00
min	-2.029648e+00	-2.229249e+00	-1.984504e+00	-1.454443e+00	-3.112085e+00	-1.610136e+00	-1.114873e+00	-1.261820e+00	-2.744117e+00	-1.819865e+00	…	-2.223994e+00	-1.693361e+00	-1.222423	-2.682695e+00	-1.443878e+00	-1.305831e+00	-1.745063e+00	-2.160960e+00	-1.601839e+00	-1.297676e+00
25%	-6.893853e-01	-7.259631e-01	-6.919555e-01	-6.671955e-01	-7.109628e-01	-7.470860e-01	-7.437479e-01	-7.379438e-01	-7.032397e-01	-7.226392e-01	…	-7.486293e-01	-6.895783e-01	-0.642136	-6.912304e-01	-6.810833e-01	-7.565142e-01	-7.563999e-01	-6.418637e-01	-6.919118e-01	-1.297676e+00
50%	-2.150816e-01	-1.046362e-01	-2.359800e-01	-2.951869e-01	-3.489108e-02	-2.219405e-01	-3.422399e-01	-3.977212e-01	-7.162650e-02	-1.782793e-01	…	-4.351564e-02	-2.859802e-01	-0.341181	-4.684277e-02	-2.695009e-01	-2.182321e-01	-2.234689e-01	-1.274095e-01	-2.164441e-01	7.706085e-01
75%	4.693926e-01	5.841756e-01	4.996769e-01	3.635073e-01	6.361990e-01	4.938569e-01	5.260619e-01	6.469351e-01	5.307792e-01	4.709834e-01	…	6.583411e-01	5.402790e-01	0.357589	5.975448e-01	5.396688e-01	5.311411e-01	7.125100e-01	4.501382e-01	4.507624e-01	7.706085e-01
max	3.971288e+00	4.651889e+00	3.976130e+00	5.250529e+00	4.770911e+00	4.568425e+00	4.243589e+00	3.927930e+00	4.484751e+00	4.910919e+00	…	3.885905e+00	4.287337e+00	5.930172	3.955374e+00	5.112877e+00	4.700669e+00	2.685877e+00	6.046041e+00	6.846856e+00	7.706085e-01

8 rows × 31 columns

Content from 02 Logistic Regression

Last updated on 2025-07-10 | Edit this page

Logistic Regression with Breast Cancer Dataset

This notebook demonstrates how to use Logistic Regression, a fundamental classification algorithm, to predict whether a tumor is malignant or benign using the Breast Cancer Wisconsin dataset.

What is Logistic Regression?

Logistic Regression is a supervised learning algorithm used for binary classification.

It models the probability that an input $\mathbf{x}$ belongs to class $y=1$ using the logistic (sigmoid) function:

\[ P(y=1 | \mathbf{x}) = \frac{1}{1 + e^{- (\mathbf{w}^T \mathbf{x} + b)}} \]

The output is a probability between 0 and 1. We classify an observation as class 1 if the predicted probability exceeds a threshold (typically 0.5).

Step 1: Load and Explore the Data

We use the load_breast_cancer() dataset from Scikit-Learn. It includes 30 numeric features extracted from breast mass images.

PYTHON

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=31
)

# Normalize (Standardize) features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step 3: Split the Data and Train the Model

We split the dataset into training and testing sets, and fit a logistic regression model.

PYTHON

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)

Step 4: Evaluate the Model

Metrics Used: - Accuracy - Precision - Recall - F1-score - Confusion Matrix - ROC Curve and AUC

PYTHON

from sklearn.metrics import (
    classification_report,
    roc_curve, auc, accuracy_score, precision_score, recall_score, f1_score
)

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

SH

Accuracy: 0.9766081871345029
Precision: 0.981651376146789
Recall: 0.981651376146789
F1 Score: 0.981651376146789

Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.97      0.97        62
           1       0.98      0.98      0.98       109

    accuracy                           0.98       171
   macro avg       0.97      0.97      0.97       171
weighted avg       0.98      0.98      0.98       171

What is a Confusion Matrix?

A confusion matrix is a table used to describe the performance of a classification model. For binary classification:

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Recall (Sensitivity) = TP / (TP + FN)
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

PYTHON

import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)
plt.title('Confusion Matrix')
plt.show()

What is an ROC Curve?

An ROC Curve (Receiver Operating Characteristic) plots:

True Positive Rate (Recall) on the Y-axis
False Positive Rate (1 - Specificity) on the X-axis

Each point on the curve corresponds to a different classification threshold. A model with perfect classification has a point in the top-left corner.

AUC (Area Under Curve) summarizes the ROC curve into a single value between 0 and 1: - AUC = 1: Perfect classifier - AUC = 0.5: No better than random guessing

PYTHON

y_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, label='AUC = ' + str(round(roc_auc, 2)))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

Content from 03 Logistic Regression Optimization

Last updated on 2025-07-10 | Edit this page

Optimising a Logistic Regression Classifier

In this notebook, we demonstrate how to tune hyperparameters in a Logistic Regression model to improve performance.

Step 1: Load Breast Cancer Data

PYTHON

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=31
)

# Normalize (Standardize) features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step 2: Define a Function to Train and Evaluate

This function will: - Train the model - Return training and test accuracy

PYTHON

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

def train_and_evaluate(max_iter=5000, C=1, penalty='l2', solver='liblinear'):
    model = LogisticRegression(max_iter=max_iter, C=C, penalty=penalty, solver=solver)
    model.fit(X_train, y_train)
    train_acc = accuracy_score(y_train, model.predict(X_train))
    test_acc = accuracy_score(y_test, model.predict(X_test))
    return train_acc, test_acc

Step 3: Explore Effect of C (Inverse of Regularization Strength)

Controls the amount of regularization. Effect:

Smaller C → Stronger regularization → Prevents overfitting but may underfit.

Larger C → Weaker regularization → Better fit but may overfit.

PYTHON

import matplotlib.pyplot as plt

values = [ 1e-3, 1e-2, 1e-1, 1, 10, 100, 1000 ]
train_scores, test_scores = [], []

for v in values:
    tr, te = train_and_evaluate(C=v)
    train_scores.append(tr)
    test_scores.append(te)

labels = [str(s) for s in values]

plt.plot(labels, train_scores, marker='o', label='Train Acc')
plt.plot(labels, test_scores, marker='s', label='Test Acc')
plt.xlabel('Values')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.xticks(rotation=90, ha='right')
plt.show()

Step 4: Explore Effect of `penalty` (L1, L2 Regularization)

Penalty	Key Effect	When to Use
L1	Sparsity, feature selection	You want simpler models or auto-feature selection
L2	Shrinkage, no sparsity	You want to reduce overfitting without dropping features

PYTHON

import matplotlib.pyplot as plt

values = [ 'l1', 'l2' ]
train_scores, test_scores = [], []

for v in values:
    tr, te = train_and_evaluate(penalty=v)
    train_scores.append(tr)
    test_scores.append(te)

labels = [str(s) for s in values]

plt.plot(labels, train_scores, marker='o', label='Train Acc')
plt.plot(labels, test_scores, marker='s', label='Test Acc')
plt.xlabel('Values')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.xticks(rotation=90, ha='right')
plt.show()

Step 5: Explore Effect of `max_iter`

Scenario	Effect
Too low	Model may not converge → You’ll see warnings like “STOP: TOTAL NO. OF F, G EVALUATIONS EXCEEDS LIMIT”. Model coefficients may be inaccurate or unstable.
Sufficient (or slightly high)	Model converges properly. No warnings. Coefficients stabilize at optimal values.
Very high (but converges early)	No harm—most solvers stop automatically once convergence is reached (before hitting `max_iter`). However, unnecessarily large values can increase training time for very large datasets.

PYTHON

import matplotlib.pyplot as plt

values = [3, 4, 5, 6, 7, 10, 100 ]
train_scores, test_scores = [], []

for v in values:
    tr, te = train_and_evaluate(max_iter=v)
    train_scores.append(tr)
    test_scores.append(te)

labels = [str(s) for s in values]

plt.plot(labels, train_scores, marker='o', label='Train Acc')
plt.plot(labels, test_scores, marker='s', label='Test Acc')
plt.xlabel('Values')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.xticks(rotation=90, ha='right')
plt.show()

SH

C:\Users\moji1\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\LocalCache\local-packages\Python313\site-packages\sklearn\svm\_base.py:1250: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
C:\Users\moji1\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\LocalCache\local-packages\Python313\site-packages\sklearn\svm\_base.py:1250: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
C:\Users\moji1\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\LocalCache\local-packages\Python313\site-packages\sklearn\svm\_base.py:1250: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
C:\Users\moji1\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\LocalCache\local-packages\Python313\site-packages\sklearn\svm\_base.py:1250: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
C:\Users\moji1\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\LocalCache\local-packages\Python313\site-packages\sklearn\svm\_base.py:1250: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(

Changing the Classification Threshold

Most classifiers output probabilities between 0 and 1. By default, the threshold for classification is 0.5. This means:

If predicted probability ≥ 0.5 → classify as positive
Else → classify as negative

Changing the Threshold:

Lower threshold → more positives predicted → higher recall, more false positives
Higher threshold → fewer positives predicted → higher precision, more false negatives

Choosing the right threshold depends on your application’s goals.

We’ll now visualize how the confusion matrix changes for two different thresholds.

PYTHON

from sklearn.metrics import ConfusionMatrixDisplay

thresholds = [0.3, 0.5, 0.7]         # List of three thresholds to compare

# Retrain model
manual_model = LogisticRegression()
manual_model.fit(X_train, y_train)

# Predict probabilities
y_proba = manual_model.predict_proba(X_test)[:, 1]

# Plot side-by-side confusion matrices
fig, axs = plt.subplots(1, 3, figsize=(12, 5))
for i, thresh in enumerate(thresholds):
    y_pred = (y_proba &gt;= thresh).astype(int)
    ConfusionMatrixDisplay.from_predictions(y_test, y_pred, ax=axs[i])
    axs[i].set_title(f"Threshold = {thresh}")
plt.tight_layout()
plt.show()

Content from 04 Svm

Last updated on 2025-07-10 | Edit this page

Support Vector Machine (SVM) with Breast Cancer Dataset

This notebook demonstrates the use of Support Vector Machines (SVM) for classifying tumors in the Breast Cancer dataset.

What is an SVM?

Support Vector Machines are powerful supervised learning models for classification. An SVM finds the hyperplane that best separates data points from two classes.

It maximizes the margin, which is the distance between the hyperplane and the nearest points from each class (support vectors).

Step 1: Load the Breast Cancer Dataset

PYTHON

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=31
)

# Normalize (Standardize) features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step 3: Train an SVM Model

We use the SVC class from sklearn.svm with default kernel (RBF).

PYTHON

from sklearn.svm import SVC

model = SVC(probability=True)
model.fit(X_train, y_train)

#sk-container-id-1 { /* Definition of color scheme common for light and dark mode / –sklearn-color-text: #000; –sklearn-color-text-muted: #666; –sklearn-color-line: gray; / Definition of color scheme for unfitted estimators / –sklearn-color-unfitted-level-0: #fff5e6; –sklearn-color-unfitted-level-1: #f6e4d2; –sklearn-color-unfitted-level-2: #ffe0b3; –sklearn-color-unfitted-level-3: chocolate; / Definition of color scheme for fitted estimators */ –sklearn-color-fitted-level-0: #f0f8ff; –sklearn-color-fitted-level-1: #d4ebff; –sklearn-color-fitted-level-2: #b3dbfd; –sklearn-color-fitted-level-3: cornflowerblue;

/* Specific color for light theme */ –sklearn-color-text-on-default-background: var(–sg-text-color, var(–theme-code-foreground, var(–jp-content-font-color1, black))); –sklearn-color-background: var(–sg-background-color, var(–theme-background, var(–jp-layout-color0, white))); –sklearn-color-border-box: var(–sg-text-color, var(–theme-code-foreground, var(–jp-content-font-color1, black))); –sklearn-color-icon: #696969;

@media (prefers-color-scheme: dark) { /* Redefinition of color scheme for dark theme */ –sklearn-color-text-on-default-background: var(–sg-text-color, var(–theme-code-foreground, var(–jp-content-font-color1, white))); –sklearn-color-background: var(–sg-background-color, var(–theme-background, var(–jp-layout-color0, #111))); –sklearn-color-border-box: var(–sg-text-color, var(–theme-code-foreground, var(–jp-content-font-color1, white))); –sklearn-color-icon: #878787; } }

#sk-container-id-1 { color: var(–sklearn-color-text); }

#sk-container-id-1 pre { padding: 0; }

#sk-container-id-1 input.sk-hidden–visually { border: 0; clip: rect(1px 1px 1px 1px); clip: rect(1px, 1px, 1px, 1px); height: 1px; margin: -1px; overflow: hidden; padding: 0; position: absolute; width: 1px; }

#sk-container-id-1 div.sk-dashed-wrapped { border: 1px dashed var(–sklearn-color-line); margin: 0 0.4em 0.5em 0.4em; box-sizing: border-box; padding-bottom: 0.4em; background-color: var(–sklearn-color-background); }

#sk-container-id-1 div.sk-container { /* jupyter’s normalize.less sets [hidden] { display: none; } but bootstrap.min.css set [hidden] { display: none !important; } so we also need the !important here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */ display: inline-block !important; position: relative; }

#sk-container-id-1 div.sk-text-repr-fallback { display: none; }

div.sk-parallel-item, div.sk-serial, div.sk-item { /* draw centered vertical line to link estimators */ background-image: linear-gradient(var(–sklearn-color-text-on-default-background), var(–sklearn-color-text-on-default-background)); background-size: 2px 100%; background-repeat: no-repeat; background-position: center center; }

/* Parallel-specific style estimator block */

#sk-container-id-1 div.sk-parallel-item::after { content: ““; width: 100%; border-bottom: 2px solid var(–sklearn-color-text-on-default-background); flex-grow: 1; }

#sk-container-id-1 div.sk-parallel { display: flex; align-items: stretch; justify-content: center; background-color: var(–sklearn-color-background); position: relative; }

#sk-container-id-1 div.sk-parallel-item { display: flex; flex-direction: column; }

#sk-container-id-1 div.sk-parallel-item:first-child::after { align-self: flex-end; width: 50%; }

#sk-container-id-1 div.sk-parallel-item:last-child::after { align-self: flex-start; width: 50%; }

#sk-container-id-1 div.sk-parallel-item:only-child::after { width: 0; }

/* Serial-specific style estimator block */

#sk-container-id-1 div.sk-serial { display: flex; flex-direction: column; align-items: center; background-color: var(–sklearn-color-background); padding-right: 1em; padding-left: 1em; }

/* Toggleable style: style used for estimator/Pipeline/ColumnTransformer box that is clickable and can be expanded/collapsed. - Pipeline and ColumnTransformer use this feature and define the default style - Estimators will overwrite some part of the style using the sk-estimator class */

/* Pipeline and ColumnTransformer style (default) */

#sk-container-id-1 div.sk-toggleable { /* Default theme specific background. It is overwritten whether we have a specific estimator or a Pipeline/ColumnTransformer */ background-color: var(–sklearn-color-background); }

/* Toggleable label */ #sk-container-id-1 label.sk-toggleable__label { cursor: pointer; display: flex; width: 100%; margin-bottom: 0; padding: 0.5em; box-sizing: border-box; text-align: center; align-items: start; justify-content: space-between; gap: 0.5em; }

#sk-container-id-1 label.sk-toggleable__label .caption { font-size: 0.6rem; font-weight: lighter; color: var(–sklearn-color-text-muted); }

#sk-container-id-1 label.sk-toggleable__label-arrow:before { /* Arrow on the left of the label */ content: “▸”; float: left; margin-right: 0.25em; color: var(–sklearn-color-icon); }

#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before { color: var(–sklearn-color-text); }

/* Toggleable content - dropdown */

#sk-container-id-1 div.sk-toggleable__content { display: none; text-align: left; /* unfitted */ background-color: var(–sklearn-color-unfitted-level-0); }

#sk-container-id-1 div.sk-toggleable__content.fitted { /* fitted */ background-color: var(–sklearn-color-fitted-level-0); }

#sk-container-id-1 div.sk-toggleable__content pre { margin: 0.2em; border-radius: 0.25em; color: var(–sklearn-color-text); /* unfitted */ background-color: var(–sklearn-color-unfitted-level-0); }

#sk-container-id-1 div.sk-toggleable__content.fitted pre { /* unfitted */ background-color: var(–sklearn-color-fitted-level-0); }

#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content { /* Expand drop-down */ display: block; width: 100%; overflow: visible; }

#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before { content: “▾”; }

/* Pipeline/ColumnTransformer-specific style */

#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label { color: var(–sklearn-color-text); background-color: var(–sklearn-color-unfitted-level-2); }

#sk-container-id-1 div.sk-label.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label { background-color: var(–sklearn-color-fitted-level-2); }

/* Estimator-specific style */

/* Colorize estimator box */ #sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label { /* unfitted */ background-color: var(–sklearn-color-unfitted-level-2); }

#sk-container-id-1 div.sk-estimator.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label { /* fitted */ background-color: var(–sklearn-color-fitted-level-2); }

#sk-container-id-1 div.sk-label label.sk-toggleable__label, #sk-container-id-1 div.sk-label label { /* The background is the default theme color */ color: var(–sklearn-color-text-on-default-background); }

/* On hover, darken the color of the background */ #sk-container-id-1 div.sk-label:hover label.sk-toggleable__label { color: var(–sklearn-color-text); background-color: var(–sklearn-color-unfitted-level-2); }

/* Label box, darken color on hover, fitted */ #sk-container-id-1 div.sk-label.fitted:hover label.sk-toggleable__label.fitted { color: var(–sklearn-color-text); background-color: var(–sklearn-color-fitted-level-2); }

/* Estimator label */

#sk-container-id-1 div.sk-label label { font-family: monospace; font-weight: bold; display: inline-block; line-height: 1.2em; }

#sk-container-id-1 div.sk-label-container { text-align: center; }

/* Estimator-specific / #sk-container-id-1 div.sk-estimator { font-family: monospace; border: 1px dotted var(–sklearn-color-border-box); border-radius: 0.25em; box-sizing: border-box; margin-bottom: 0.5em; / unfitted */ background-color: var(–sklearn-color-unfitted-level-0); }

#sk-container-id-1 div.sk-estimator.fitted { /* fitted */ background-color: var(–sklearn-color-fitted-level-0); }

/* on hover / #sk-container-id-1 div.sk-estimator:hover { / unfitted */ background-color: var(–sklearn-color-unfitted-level-2); }

#sk-container-id-1 div.sk-estimator.fitted:hover { /* fitted */ background-color: var(–sklearn-color-fitted-level-2); }

/* Specification for estimator info (e.g. “i” and “?”) */

/* Common style for “i” and “?” */

.sk-estimator-doc-link, a:link.sk-estimator-doc-link, a:visited.sk-estimator-doc-link { float: right; font-size: smaller; line-height: 1em; font-family: monospace; background-color: var(–sklearn-color-background); border-radius: 1em; height: 1em; width: 1em; text-decoration: none !important; margin-left: 0.5em; text-align: center; /* unfitted */ border: var(–sklearn-color-unfitted-level-1) 1pt solid; color: var(–sklearn-color-unfitted-level-1); }

.sk-estimator-doc-link.fitted, a:link.sk-estimator-doc-link.fitted, a:visited.sk-estimator-doc-link.fitted { /* fitted */ border: var(–sklearn-color-fitted-level-1) 1pt solid; color: var(–sklearn-color-fitted-level-1); }

/* On hover / div.sk-estimator:hover .sk-estimator-doc-link:hover, .sk-estimator-doc-link:hover, div.sk-label-container:hover .sk-estimator-doc-link:hover, .sk-estimator-doc-link:hover { / unfitted */ background-color: var(–sklearn-color-unfitted-level-3); color: var(–sklearn-color-background); text-decoration: none; }

div.sk-estimator.fitted:hover .sk-estimator-doc-link.fitted:hover, .sk-estimator-doc-link.fitted:hover, div.sk-label-container:hover .sk-estimator-doc-link.fitted:hover, .sk-estimator-doc-link.fitted:hover { /* fitted */ background-color: var(–sklearn-color-fitted-level-3); color: var(–sklearn-color-background); text-decoration: none; }

/* Span, style for the box shown on hovering the info icon / .sk-estimator-doc-link span { display: none; z-index: 9999; position: relative; font-weight: normal; right: .2ex; padding: .5ex; margin: .5ex; width: min-content; min-width: 20ex; max-width: 50ex; color: var(–sklearn-color-text); box-shadow: 2pt 2pt 4pt #999; / unfitted */ background: var(–sklearn-color-unfitted-level-0); border: .5pt solid var(–sklearn-color-unfitted-level-3); }

.sk-estimator-doc-link.fitted span { /* fitted */ background: var(–sklearn-color-fitted-level-0); border: var(–sklearn-color-fitted-level-3); }

.sk-estimator-doc-link:hover span { display: block; }

/* “?”-specific style due to the `` HTML tag */

#sk-container-id-1 a.estimator_doc_link { float: right; font-size: 1rem; line-height: 1em; font-family: monospace; background-color: var(–sklearn-color-background); border-radius: 1rem; height: 1rem; width: 1rem; text-decoration: none; /* unfitted */ color: var(–sklearn-color-unfitted-level-1); border: var(–sklearn-color-unfitted-level-1) 1pt solid; }

#sk-container-id-1 a.estimator_doc_link.fitted { /* fitted */ border: var(–sklearn-color-fitted-level-1) 1pt solid; color: var(–sklearn-color-fitted-level-1); }

/* On hover / #sk-container-id-1 a.estimator_doc_link:hover { / unfitted */ background-color: var(–sklearn-color-unfitted-level-3); color: var(–sklearn-color-background); text-decoration: none; }

#sk-container-id-1 a.estimator_doc_link.fitted:hover { /* fitted */ background-color: var(–sklearn-color-fitted-level-3); }

.estimator-table summary { padding: .5rem; font-family: monospace; cursor: pointer; }

.estimator-table details[open] { padding-left: 0.1rem; padding-right: 0.1rem; padding-bottom: 0.3rem; }

.estimator-table .parameters-table { margin-left: auto !important; margin-right: auto !important; }

.estimator-table .parameters-table tr:nth-child(odd) { background-color: #fff; }

.estimator-table .parameters-table tr:nth-child(even) { background-color: #f6f6f6; }

.estimator-table .parameters-table tr:hover { background-color: #e0e0e0; }

.estimator-table table td { border: 1px solid rgba(106, 105, 104, 0.232); }

.user-set td { color:rgb(255, 94, 0); text-align: left; }

.user-set td.value pre { color:rgb(255, 94, 0) !important; background-color: transparent !important; }

.default td { color: black; text-align: left; }

.user-set td i, .default td i { color: black; }

.copy-paste-icon { background-image: url(data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCA0NDggNTEyIj48IS0tIUZvbnQgQXdlc29tZSBGcmVlIDYuNy4yIGJ5IEBmb250YXdlc29tZSAtIGh0dHBzOi8vZm9udGF3ZXNvbWUuY29tIExpY2Vuc2UgLSBodHRwczovL2ZvbnRhd2Vzb21lLmNvbS9saWNlbnNlL2ZyZWUgQ29weXJpZ2h0IDIwMjUgRm9udGljb25zLCBJbmMuLS0+PHBhdGggZD0iTTIwOCAwTDMzMi4xIDBjMTIuNyAwIDI0LjkgNS4xIDMzLjkgMTQuMWw2Ny45IDY3LjljOSA5IDE0LjEgMjEuMiAxNC4xIDMzLjlMNDQ4IDMzNmMwIDI2LjUtMjEuNSA0OC00OCA0OGwtMTkyIDBjLTI2LjUgMC00OC0yMS41LTQ4LTQ4bDAtMjg4YzAtMjYuNSAyMS41LTQ4IDQ4LTQ4ek00OCAxMjhsODAgMCAwIDY0LTY0IDAgMCAyNTYgMTkyIDAgMC0zMiA2NCAwIDAgNDhjMCAyNi41LTIxLjUgNDgtNDggNDhMNDggNTEyYy0yNi41IDAtNDgtMjEuNS00OC00OEwwIDE3NmMwLTI2LjUgMjEuNS00OCA0OC00OHoiLz48L3N2Zz4=); background-repeat: no-repeat; background-size: 14px 14px; background-position: 0; display: inline-block; width: 14px; height: 14px; cursor: pointer; } SVC(probability=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.SVC?Documentation for SVCiFitted

Parameters

	C	1.0
	kernel	‘rbf’
	degree	3
	gamma	‘scale’
	coef0	0.0
	shrinking	True
	probability	True
	tol	0.001
	cache_size	200
	class_weight	None
	verbose	False
	max_iter	-1
	decision_function_shape	‘ovr’
	break_ties	False
	random_state	None

function copyToClipboard(text, element) { // Get the parameter prefix from the closest toggleable content const toggleableContent = element.closest(’.sk-toggleable__content’); const paramPrefix = toggleableContent ? toggleableContent.dataset.paramPrefix : ’’; const fullParamName = paramPrefix ? ${paramPrefix}${text} : text;

SH

const originalStyle = element.style;
const computedStyle = window.getComputedStyle(element);
const originalWidth = computedStyle.width;
const originalHTML = element.innerHTML.replace('Copied!', '');

navigator.clipboard.writeText(fullParamName)
    .then(() => {
        element.style.width = originalWidth;
        element.style.color = 'green';
        element.innerHTML = "Copied!";

        setTimeout(() => {
            element.innerHTML = originalHTML;
            element.style = originalStyle;
        }, 2000);
    })
    .catch(err => {
        console.error('Failed to copy:', err);
        element.style.color = 'red';
        element.innerHTML = "Failed!";
        setTimeout(() => {
            element.innerHTML = originalHTML;
            element.style = originalStyle;
        }, 2000);
    });
return false;

}

document.querySelectorAll(‘.fa-regular.fa-copy’).forEach(function(element) { const toggleableContent = element.closest(’.sk-toggleable__content’); const paramPrefix = toggleableContent ? toggleableContent.dataset.paramPrefix : ’’; const paramName = element.parentElement.nextElementSibling.textContent.trim(); const fullParamName = paramPrefix ? ${paramPrefix}${paramName} : paramName;

SH

element.setAttribute('title', fullParamName);

});

Step 4: Evaluate the Model

PYTHON

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    ConfusionMatrixDisplay, classification_report, roc_curve, auc
)

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

SH

Accuracy: 0.9766081871345029
Precision: 0.972972972972973
Recall: 0.9908256880733946
F1 Score: 0.9818181818181818

Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.95      0.97        62
           1       0.97      0.99      0.98       109

    accuracy                           0.98       171
   macro avg       0.98      0.97      0.97       171
weighted avg       0.98      0.98      0.98       171

What is a Confusion Matrix?

A confusion matrix is a summary of prediction results:

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

Accuracy, Precision, Recall, F1 Score are all derived from this table.

PYTHON

import matplotlib.pyplot as plt
import seaborn as sns

ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)
plt.title('SVM Confusion Matrix')
plt.show()

What is an ROC Curve?

The ROC Curve shows the trade-off between True Positive Rate (Recall) and False Positive Rate. The AUC (Area Under Curve) summarizes the performance into a single number.

PYTHON

y_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, label='AUC = ' + str(round(roc_auc, 2)))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('SVM ROC Curve')
plt.legend()
plt.show()

Content from 05 Svm Optimization

Last updated on 2025-07-10 | Edit this page

Optimising a Support Vector Machine (SVM) Classifier

In this notebook, we demonstrate how to tune hyperparameters in a Support Vector Machine (SVM) classifier to improve classification performance.

We will use the Breast Cancer Wisconsin dataset to: - Train an initial SVM model. - Explore the effects of the most important hyperparameters: - C — Regularization parameter controlling the trade-off between achieving low training error and low testing error. - gamma — Kernel coefficient controlling how much influence each data point has on the decision boundary. - kernel — Defines the type of decision boundary (linear or nonlinear), with options like 'linear', 'rbf', and 'poly'.

Support Vector Machines are powerful models for classification tasks and can handle both linear and non-linear relationships through the use of kernel functions. By tuning these hyperparameters, we can significantly improve model performance.

Step 1: Load Breast Cancer Data

PYTHON

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=31
)

# Normalize (Standardize) features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Exploring the Effect of the `C` Hyperparameter

The C parameter in SVM controls the regularization strength: - Low C values → More regularization → Smoother decision boundary, possibly underfitting. - High C values → Less regularization → Fits the training data more closely, potentially overfitting.

We will train SVM models with different C values and observe how it affects accuracy.

PYTHON

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Test different values of C
C_values = [0.01, 0.1, 1, 10, 100]
train_scores = []
test_scores = []

for C in C_values:
    model = SVC(C=C, kernel='rbf', gamma='scale')
    model.fit(X_train, y_train)
    train_acc = accuracy_score(y_train, model.predict(X_train))
    test_acc = accuracy_score(y_test, model.predict(X_test))
    train_scores.append(train_acc)
    test_scores.append(test_acc)

# Plot the results
plt.figure(figsize=(8, 6))
plt.semilogx(C_values, train_scores, marker='o', label='Train Accuracy')
plt.semilogx(C_values, test_scores, marker='s', label='Test Accuracy')
plt.xlabel('C (log scale)')
plt.ylabel('Accuracy')
plt.title('Effect of Regularization Parameter C on SVM')
plt.legend()
plt.grid(True)
plt.show()

Exploring the Effect of the `gamma` Hyperparameter

The gamma parameter controls the influence of each individual training example: - Low gamma values → Far-reaching influence → Smoother decision boundary, possibly underfitting. - High gamma values → Very localized influence → Complex decision boundary, potentially overfitting.

We will now examine how varying gamma affects the model’s performance, while keeping C fixed.

PYTHON

# Test different values of gamma
gamma_values = [0.001, 0.01, 0.1, 1, 10]
train_scores = []
test_scores = []

for gamma in gamma_values:
    model = SVC(C=1, kernel='rbf', gamma=gamma)
    model.fit(X_train, y_train)
    train_acc = accuracy_score(y_train, model.predict(X_train))
    test_acc = accuracy_score(y_test, model.predict(X_test))
    train_scores.append(train_acc)
    test_scores.append(test_acc)

# Plot the results
plt.figure(figsize=(8, 6))
plt.semilogx(gamma_values, train_scores, marker='o', label='Train Accuracy')
plt.semilogx(gamma_values, test_scores, marker='s', label='Test Accuracy')
plt.xlabel('gamma (log scale)')
plt.ylabel('Accuracy')
plt.title('Effect of gamma on SVM')
plt.legend()
plt.grid(True)
plt.show()

Exploring the Effect of the `kernel` Hyperparameter

The kernel parameter in SVM specifies the type of transformation applied to the input data: - 'linear' → No transformation; linear decision boundary. - 'rbf' (Radial Basis Function) → Maps data into higher dimensions for non-linear boundaries. - 'poly' → Polynomial transformations, allowing more flexible decision boundaries depending on degree.

We will now compare different kern

PYTHON

# Test different kernel types
kernel_types = ['linear', 'rbf', 'poly']
train_scores = []
test_scores = []

for kernel in kernel_types:
    model = SVC(C=1, kernel=kernel, gamma='scale')
    model.fit(X_train, y_train)
    train_acc = accuracy_score(y_train, model.predict(X_train))
    test_acc = accuracy_score(y_test, model.predict(X_test))
    train_scores.append(train_acc)
    test_scores.append(test_acc)

# Show numeric results
for k, tr, te in zip(kernel_types, train_scores, test_scores):
    print(f"Kernel: {k} → Train Accuracy: {tr:.3f}, Test Accuracy: {te:.3f}")

# Plot the results
import numpy as np

x = np.arange(len(kernel_types))
width = 0.35

plt.figure(figsize=(8, 6))
plt.bar(x - width/2, train_scores, width, label='Train Accuracy')
plt.bar(x + width/2, test_scores, width, label='Test Accuracy')
plt.xticks(x, kernel_types)
plt.xlabel('Kernel Type')
plt.ylabel('Accuracy')
plt.title('Comparison of SVM Kernels')
plt.legend()
plt.grid(True, axis='y')
plt.show()

SH

Kernel: linear → Train Accuracy: 0.987, Test Accuracy: 0.988
Kernel: rbf → Train Accuracy: 0.985, Test Accuracy: 0.977
Kernel: poly → Train Accuracy: 0.910, Test Accuracy: 0.883

Changing the Classification Threshold

Most classifiers output probabilities between 0 and 1. By default, the threshold for classification is 0.5. This means:

If predicted probability ≥ 0.5 → classify as positive
Else → classify as negative

Changing the Threshold:

Lower threshold → more positives predicted → higher recall, more false positives
Higher threshold → fewer positives predicted → higher precision, more false negatives

Choosing the right threshold depends on your application’s goals.

We’ll now visualize how the confusion matrix changes for two different thresholds.

PYTHON

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

thresholds = [0.3, 0.5, 0.7]         # List of two thresholds to compare

# Train SVM with probability estimates enabled
model = SVC(C=1, kernel='rbf', gamma='scale', probability=True)
model.fit(X_train, y_train)

# Predict probabilities
y_proba = model.predict_proba(X_test)[:, 1]  # Probability of class 1

# Plot side-by-side confusion matrices
fig, axs = plt.subplots(1, 3, figsize=(12, 5))
for i, thresh in enumerate(thresholds):
    y_pred = (y_proba &gt;= thresh).astype(int)
    ConfusionMatrixDisplay.from_predictions(y_test, y_pred, ax=axs[i])
    axs[i].set_title(f"Threshold = {thresh}")
plt.tight_layout()
plt.show()

Content from 06 Model Evaluation

Last updated on 2025-07-10 | Edit this page

Model Evaluation: Comparing Logistic Regression and SVM

We compare two classifiers: - Logistic Regression - SVM

Using metrics: Accuracy, Precision, Recall, F1 Score, Confusion Matrix, and ROC-AUC.

Step 1: Load the Data

PYTHON

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=31
)

# Normalize (Standardize) features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step 2: Train the Models

PYTHON

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

lr_model = LogisticRegression(max_iter=5000)
lr_model.fit(X_train, y_train)

svm_model = SVC(probability=True)
svm_model.fit(X_train, y_train)

Step 3: Evaluation Metrics

What is Accuracy?

Accuracy is the proportion of correct predictions:

\[ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} \]

Can be misleading if classes are imbalanced.

PYTHON

from sklearn.metrics import accuracy_score

lr_acc = accuracy_score(y_test, lr_model.predict(X_test))
svm_acc = accuracy_score(y_test, svm_model.predict(X_test))

print(f"Logistic egression Accuracy: {lr_acc:.2f}")
print(f"SVM Accuracy: {svm_acc:.2f}")

SH

Logistic egression Accuracy: 0.98
SVM Accuracy: 0.98

What are Precision, Recall and F1 Score?

Precision: $\frac{TP}{TP + FP}$
Recall: $\frac{TP}{TP + FN}$
F1 Score: Harmonic mean of precision and recall

\[ F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]

PYTHON

from sklearn.metrics import precision_score, recall_score, f1_score

for name, model in [("SVM", svm_model), ("Neural Net", lr_model)]:
    y_pred = model.predict(X_test)
    print(f"\n{name} Metrics:")
    print(f"Precision: {precision_score(y_test, y_pred):.2f}")
    print(f"Recall: {recall_score(y_test, y_pred):.2f}")
    print(f"F1 Score: {f1_score(y_test, y_pred):.2f}")

SH

SVM Metrics:
Precision: 0.97
Recall: 0.99
F1 Score: 0.98

Neural Net Metrics:
Precision: 0.98
Recall: 0.98
F1 Score: 0.98

What is a Confusion Matrix?

A confusion matrix shows the breakdown of correct and incorrect classifications.

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

PYTHON

from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt

fig, axs = plt.subplots(1, 2, figsize=(12, 5))
ConfusionMatrixDisplay.from_estimator(svm_model, X_test, y_test, ax=axs[0])
axs[0].set_title("SVM Confusion Matrix")
ConfusionMatrixDisplay.from_estimator(lr_model, X_test, y_test, ax=axs[1])
axs[1].set_title("Neural Network Confusion Matrix")
plt.tight_layout()
plt.show()

What is the ROC Curve?

ROC = Receiver Operating Characteristic Curve

Plots TPR vs FPR
AUC = Area Under the ROC Curve Closer to 1 = better model.

PYTHON

from sklearn.metrics import roc_curve, auc

svm_probs = svm_model.predict_proba(X_test)[:, 1]
nn_probs = lr_model.predict_proba(X_test)[:, 1]

svm_fpr, svm_tpr, _ = roc_curve(y_test, svm_probs)
nn_fpr, nn_tpr, _ = roc_curve(y_test, nn_probs)
svm_auc = auc(svm_fpr, svm_tpr)
nn_auc = auc(nn_fpr, nn_tpr)

plt.figure(figsize=(8, 6))
plt.plot(svm_fpr, svm_tpr, label=f"SVM (AUC = {svm_auc:.2f})")
plt.plot(nn_fpr, nn_tpr, label=f"Logistic Regression (AUC = {nn_auc:.2f})")
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()

Conclusion

Both models perform well, but:

Neural Net may achieve higher recall
SVM may offer higher precision

Evaluation metrics guide us to choose the best model for our real-world use case.

Content from 07 Neural Networks

Last updated on 2025-07-10 | Edit this page

Neural Network (MLPClassifier) with Breast Cancer Dataset

In this notebook, we will use a simple Multi-layer Perceptron (MLP) neural network to classify breast tumors.

What is an MLP?

An MLP is a type of feedforward neural network consisting of one or more hidden layers. Each neuron computes a weighted sum of its inputs and passes the result through a nonlinear activation function.

MLPs are suitable for classification tasks and are trained using backpropagation to minimize loss.

Step 1: Load the Breast Cancer Dataset

PYTHON

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=31
)

# Normalize (Standardize) features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step 3: Train an MLPClassifier

We use MLPClassifier from Scikit-Learn with one hidden layer.

PYTHON

from sklearn.neural_network import MLPClassifier

model = MLPClassifier(hidden_layer_sizes=(50,), max_iter=2000, random_state=42)
model.fit(X_train, y_train)

Step 4: Evaluate the Model

PYTHON

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    ConfusionMatrixDisplay, classification_report, roc_curve, auc
)

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

SH

Accuracy: 0.9824561403508771
Precision: 0.9732142857142857
Recall: 1.0
F1 Score: 0.9864253393665159

Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.95      0.98        62
           1       0.97      1.00      0.99       109

    accuracy                           0.98       171
   macro avg       0.99      0.98      0.98       171
weighted avg       0.98      0.98      0.98       171

What is a Confusion Matrix?

A confusion matrix shows how well the model distinguishes between classes:

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

This matrix lets us compute metrics like accuracy, precision, recall, and F1 score.

PYTHON

import matplotlib.pyplot as plt
import seaborn as sns

ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)
plt.title('Neural Network Confusion Matrix')
plt.show()

What is an ROC Curve?

The ROC Curve shows the trade-off between True Positive Rate and False Positive Rate. AUC quantifies this performance. Closer to 1.0 = better classifier.

PYTHON

y_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, label='AUC = ' + str(round(roc_auc, 2)))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Neural Network ROC Curve')
plt.legend()
plt.show()

Content from 08 Neural Networks Optimization

Last updated on 2025-07-10 | Edit this page

Optimising a Neural Network Classifier

In this notebook, we demonstrate how to tune hyperparameters in a neural network model to improve performance.

We will focus on: - hidden_layer_sizes - alpha (regularization) - learning_rate_init

We’ll also visualize how these parameters affect accuracy, and look for signs of overfitting or underfitting.

Step 1: Load Breast Cancer Data

Load with Normalisation

PYTHON

# from sklearn.datasets import load_breast_cancer
# from sklearn.model_selection import train_test_split
# from sklearn.preprocessing import StandardScaler
# import pandas as pd

# # Load dataset
# data = load_breast_cancer()
# X = data.data
# y = data.target

# # Split dataset
# X_train, X_test, y_train, y_test = train_test_split(
#     X, y, test_size=0.3, random_state=31
# )

# # Normalize (Standardize) features
# scaler = StandardScaler()
# X_train = scaler.fit_transform(X_train)
# X_test = scaler.transform(X_test)

Load without Normalisation To see the effects of different network sizes and other hyperparameters we use the non-normlaised dataset. For teh easier task of calssifying using normalised data most values for hyperparameters result in a very good accuracy.

PYTHON

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=31
)

Step 2: Define a Function to Train and Evaluate

This function will: - Train the MLP model - Return training and test accuracy

PYTHON

from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
import numpy as np

def train_and_evaluate(hidden_layer_sizes=(200, 400, 400, 200), alpha=0.0001, lr=0.001):
    model = MLPClassifier(hidden_layer_sizes=hidden_layer_sizes,
                           alpha=alpha,
                           learning_rate_init=lr,
                           max_iter=2000,
                           random_state=42)
    model.fit(X_train, y_train)
    train_acc = accuracy_score(y_train, model.predict(X_train))
    test_acc = accuracy_score(y_test, model.predict(X_test))
    return train_acc, test_acc

Step 3: Explore Effect of `hidden_layer_sizes`

The number of neurons and layers controls the model’s capacity.

Too small: underfitting
Too large: overfitting

PYTHON

import matplotlib.pyplot as plt

values = [ (50,50), (50, 50, 50), (200, 400, 400, 200), (200, 400, 400, 400, 400, 200), (200, 400, 800, 800, 800, 400, 400, 200)]
labels = [0, 1, 2, 3, 4]
train_scores, test_scores = [], []

for s in values:
    tr, te = train_and_evaluate(hidden_layer_sizes=s)
    train_scores.append(tr)
    test_scores.append(te)

labels = [str(s) for s in values]

plt.plot(labels, train_scores, marker='o', label='Train Acc')
plt.plot(labels, test_scores, marker='s', label='Test Acc')
plt.xlabel('Hidden Layer Sizes')
plt.ylabel('Accuracy')
plt.title('Effect of Network Size')
plt.legend()
plt.grid(True)
plt.xticks(rotation=90, ha='right')
plt.show()

Step 4: Explore Effect of `alpha` (L2 Regularization)

alpha prevents overfitting by penalising large weights.

Low alpha: can overfit
High alpha: can underfit

PYTHON

alphas =  [1e-1, 3e-1, 5e-1, 7e-1, 1e1]
train_scores, test_scores = [], []

for a in alphas:
    tr, te = train_and_evaluate(alpha=a)
    train_scores.append(tr)
    test_scores.append(te)

plt.semilogx(alphas, train_scores, marker='o', label='Train Acc')
plt.semilogx(alphas, test_scores, marker='s', label='Test Acc')
plt.xlabel('alpha (log scale)')
plt.ylabel('Accuracy')
plt.title('Effect of Regularization Strength')
plt.legend()
plt.grid(True)
plt.show()

Step 5: Explore Effect of `learning_rate_init`

This controls how fast the model updates its weights.

Too small: slow convergence
Too large: may never converge

PYTHON

lrs = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1]
train_scores, test_scores = [], []

for lr in lrs:
    tr, te = train_and_evaluate(lr=lr)
    train_scores.append(tr)
    test_scores.append(te)

plt.plot(lrs, train_scores, marker='o', label='Train Acc')
plt.plot(lrs, test_scores, marker='s', label='Test Acc')
plt.xscale('log')
plt.xlabel('Learning Rate (log scale)')
plt.ylabel('Accuracy')
plt.title('Effect of Learning Rate')
plt.legend()
plt.grid(True)
plt.show()

Conclusion

Neural networks are sensitive to hyperparameters
Use visualisation to find sweet spot
Avoid overfitting by tuning alpha and hidden_layer_sizes
Don’t pick hyperparameters blindly – use grid search or cross-validation

Changing the Classification Threshold

Most classifiers like neural networks output probabilities between 0 and 1. By default, the threshold for classification is 0.5. This means:

If predicted probability ≥ 0.5 → classify as positive
Else → classify as negative

Changing the Threshold:

Lower threshold → more positives predicted → higher recall, more false positives
Higher threshold → fewer positives predicted → higher precision, more false negatives

Choosing the right threshold depends on your application’s goals.

We’ll now visualize how the confusion matrix changes for two different thresholds.

PYTHON

from sklearn.metrics import ConfusionMatrixDisplay

thresholds = [0.3, 0.5, 0.7]         # List of two thresholds to compare

# Retrain model
manual_model = MLPClassifier()
manual_model.fit(X_train, y_train)

# Predict probabilities
y_proba = manual_model.predict_proba(X_test)[:, 1]

# Plot side-by-side confusion matrices
fig, axs = plt.subplots(1, 3, figsize=(12, 5))
for i, thresh in enumerate(thresholds):
    y_pred = (y_proba &gt;= thresh).astype(int)
    ConfusionMatrixDisplay.from_predictions(y_test, y_pred, ax=axs[i])
    axs[i].set_title(f"Threshold = {thresh}")
plt.tight_layout()
plt.show()

SH

C:\Users\moji1\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\LocalCache\local-packages\Python313\site-packages\sklearn\neural_network\_multilayer_perceptron.py:780: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(

Content from 09 Random Forest

Last updated on 2025-07-10 | Edit this page

Random Forest Classifier with Breast Cancer Dataset

This notebook demonstrates the use of Random Forest Classifier for classifying tumors in the Breast Cancer dataset.

What is a Random Forest?

Random Forest is a powerful ensemble learning method for classification. It builds multiple decision trees and combines their predictions for improved accuracy and robustness.

Each tree is trained on a random subset of the data and features, reducing overfitting and improving generalization.

Step 1: Load the Breast Cancer Dataset

PYTHON

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=31
)

# Normalize (Standardize) features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step 3: Train an SVM Model

We use the SVC class from sklearn.svm with default kernel (RBF).

PYTHON

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

Step 4: Evaluate the Model

PYTHON

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    ConfusionMatrixDisplay, classification_report, roc_curve, auc
)

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

SH

Accuracy: 0.9590643274853801
Precision: 0.9636363636363636
Recall: 0.9724770642201835
F1 Score: 0.9680365296803652

Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.94      0.94        62
           1       0.96      0.97      0.97       109

    accuracy                           0.96       171
   macro avg       0.96      0.95      0.96       171
weighted avg       0.96      0.96      0.96       171

What is a Confusion Matrix?

A confusion matrix is a summary of prediction results:

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

Accuracy, Precision, Recall, F1 Score are all derived from this table.

PYTHON

import matplotlib.pyplot as plt
import seaborn as sns

ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)
plt.title('SVM Confusion Matrix')
plt.show()

What is an ROC Curve?

The ROC Curve shows the trade-off between True Positive Rate (Recall) and False Positive Rate. The AUC (Area Under Curve) summarizes the performance into a single number.

PYTHON

y_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, label='AUC = ' + str(round(roc_auc, 2)))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('SVM ROC Curve')
plt.legend()
plt.show()

Content from 10 Random Forest Optimization

Last updated on 2025-07-10 | Edit this page

Optimising a Random Forest Classifier

In this notebook, we demonstrate how to tune hyperparameters in a Random Forest Classifier to improve classification performance.

We will use the Breast Cancer Wisconsin dataset to: - Train an initial Random Forest model. - Explore the effects of the most important hyperparameters: - n_estimators — Number of trees in the forest. - max_depth — Maximum depth of each tree. - min_samples_split — Minimum number of samples required to split an internal node.

Random Forest is a powerful ensemble method for classification tasks that builds multiple decision trees and merges their outputs for improved accuracy and robustness. By tuning these hyperparameters, we can significantly improve model performance.

Step 1: Load Breast Cancer Data

PYTHON

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=31
)

# Normalize (Standardize) features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Exploring the Effect of the `n_estimators` Hyperparameter

The n_estimators parameter in Random Forest controls the number of trees in the forest: - Low n_estimators values → Fewer trees → Faster training but possibly underfitting. - High n_estimators values → More trees → Better performance but increased computation.

We will train Random Forest models with different n_estimators values and observe how it affects accuracy.

PYTHON

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Test different values of n_estimators
n_estimators_values = [1, 2 ,4, 8, 10, 20, 30, 50, 70, 90]
train_scores = []
test_scores = []

for n in n_estimators_values:
    model = RandomForestClassifier(n_estimators=n, random_state=42)
    model.fit(X_train, y_train)
    train_acc = accuracy_score(y_train, model.predict(X_train))
    test_acc = accuracy_score(y_test, model.predict(X_test))
    train_scores.append(train_acc)
    test_scores.append(test_acc)

# Plot the results
plt.figure(figsize=(8, 6))
plt.plot(n_estimators_values, train_scores, marker='o', label='Train Accuracy')
plt.plot(n_estimators_values, test_scores, marker='s', label='Test Accuracy')
plt.xlabel('n_estimators (Number of Trees)')
plt.ylabel('Accuracy')
plt.title('Effect of n_estimators on Random Forest')
plt.legend()
plt.grid(True)
plt.show()

Exploring the Effect of the `max_depth` Hyperparameter

The max_depth parameter controls the maximum depth of each tree: - Low max_depth values → Shallow trees → Simpler models, possibly underfitting. - High max_depth values → Deeper trees → More complex models, potentially overfitting.

We will now examine how varying max_depth affects the model’s performance, while keeping n_estimators fixed.

PYTHON

# Test different values of max_depth
max_depth_values = [2, 4, 6, 8, 10, 20]
train_scores = []
test_scores = []

for depth in max_depth_values:
    model = RandomForestClassifier(n_estimators=100, max_depth=depth, random_state=42)
    model.fit(X_train, y_train)
    train_acc = accuracy_score(y_train, model.predict(X_train))
    test_acc = accuracy_score(y_test, model.predict(X_test))
    train_scores.append(train_acc)
    test_scores.append(test_acc)

# Plot the results
plt.figure(figsize=(8, 6))
plt.plot(max_depth_values, train_scores, marker='o', label='Train Accuracy')
plt.plot(max_depth_values, test_scores, marker='s', label='Test Accuracy')
plt.xlabel('max_depth')
plt.ylabel('Accuracy')
plt.title('Effect of max_depth on Random Forest')
plt.legend()
plt.grid(True)
plt.show()

Exploring the Effect of the `min_samples_split` Hyperparameter

The min_samples_split parameter controls the minimum number of samples required to split an internal node: - Low min_samples_split values → More splits → Complex trees, potentially overfitting. - High min_samples_split values → Fewer splits → Simpler trees, possibly underfitting.

We will now compare different min_samples_split values to observe their impact on performance.

PYTHON

# Test different values of min_samples_split
min_samples_split_values = [2, 5, 10, 20, 50]
train_scores = []
test_scores = []

for min_split in min_samples_split_values:
    model = RandomForestClassifier(n_estimators=100, max_depth=6, min_samples_split=min_split, random_state=42)
    model.fit(X_train, y_train)
    train_acc = accuracy_score(y_train, model.predict(X_train))
    test_acc = accuracy_score(y_test, model.predict(X_test))
    train_scores.append(train_acc)
    test_scores.append(test_acc)

# Show numeric results
for m, tr, te in zip(min_samples_split_values, train_scores, test_scores):
    print(f"min_samples_split: {m} → Train Accuracy: {tr:.3f}, Test Accuracy: {te:.3f}")

# Plot the results
import numpy as np

x = np.arange(len(min_samples_split_values))
width = 0.35

# plt.figure(figsize=(8, 6))
# plt.bar(x - width/2, train_scores, width, label='Train Accuracy')
# plt.bar(x + width/2, test_scores, width, label='Test Accuracy')
# plt.xticks(x, min_samples_split_values)
# plt.xlabel('min_samples_split')
# plt.ylabel('Accuracy')
# plt.title('Effect of min_samples_split on Random Forest')
# plt.legend()
# plt.grid(True, axis='y')
# plt.show()

# Plot the results
plt.figure(figsize=(8, 6))
plt.plot( min_samples_split_values, train_scores, marker='o', label='Train Accuracy')
plt.plot( min_samples_split_values, test_scores, marker='s', label='Test Accuracy')
plt.xlabel('max_depth')
plt.ylabel('Accuracy')
plt.title('Effect of max_depth on Random Forest')
plt.legend()
plt.grid(True)
plt.show()

SH

min_samples_split: 2 → Train Accuracy: 0.997, Test Accuracy: 0.965
min_samples_split: 5 → Train Accuracy: 0.995, Test Accuracy: 0.959
min_samples_split: 10 → Train Accuracy: 0.980, Test Accuracy: 0.947
min_samples_split: 20 → Train Accuracy: 0.972, Test Accuracy: 0.947
min_samples_split: 50 → Train Accuracy: 0.972, Test Accuracy: 0.942

Changing the Classification Threshold

Most classifiers output probabilities between 0 and 1. By default, the threshold for classification is 0.5. This means:

If predicted probability ≥ 0.5 → classify as positive
Else → classify as negative

Changing the Threshold:

Lower threshold → more positives predicted → higher recall, more false positives
Higher threshold → fewer positives predicted → higher precision, more false negatives

Choosing the right threshold depends on your application’s goals.

We’ll now visualize how the confusion matrix changes for two different thresholds.

PYTHON

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

thresholds = [0.3, 0.5, 0.7]         # List of two thresholds to compare

# Train Random Forest with probability estimates
model = RandomForestClassifier(n_estimators=100, max_depth=6, min_samples_split=5, random_state=42)
model.fit(X_train, y_train)

# Predict probabilities
y_proba = model.predict_proba(X_test)[:, 1]  # Probability of class 1

# Plot side-by-side confusion matrices
fig, axs = plt.subplots(1, 3, figsize=(12, 5))
for i, thresh in enumerate(thresholds):
    y_pred = (y_proba &gt;= thresh).astype(int)
    ConfusionMatrixDisplay.from_predictions(y_test, y_pred, ax=axs[i])
    axs[i].set_title(f"Threshold = {thresh}")
plt.tight_layout()
plt.show()

Introduction to Classification

What is Classification?

Workshop Goals

Topics Covered

Required Libraries

Installing Libraries

PYTHON

Check your environment has the necessary libraries installed

PYTHON

SH

PYTHON

Preview Example Dataset

PYTHON

PYTHON

PYTHON

PYTHON

Logistic Regression with Breast Cancer Dataset

What is Logistic Regression?

Step 1: Load and Explore the Data

PYTHON

Step 3: Split the Data and Train the Model

PYTHON

Step 4: Evaluate the Model

PYTHON

SH

What is a Confusion Matrix?

PYTHON

What is an ROC Curve?

PYTHON

Optimising a Logistic Regression Classifier

Step 1: Load Breast Cancer Data

PYTHON

Step 2: Define a Function to Train and Evaluate

PYTHON

Step 3: Explore Effect of C (Inverse of Regularization Strength)

PYTHON

Step 4: Explore Effect of penalty (L1, L2 Regularization)

PYTHON

Step 5: Explore Effect of max_iter

PYTHON

SH

Changing the Classification Threshold

Changing the Threshold:

PYTHON

Support Vector Machine (SVM) with Breast Cancer Dataset

What is an SVM?

Step 1: Load the Breast Cancer Dataset

PYTHON

Step 3: Train an SVM Model

PYTHON

SH

SH

Step 4: Evaluate the Model

PYTHON

SH

What is a Confusion Matrix?

PYTHON

What is an ROC Curve?

PYTHON

Optimising a Support Vector Machine (SVM) Classifier

Step 1: Load Breast Cancer Data

PYTHON

Exploring the Effect of the C Hyperparameter

PYTHON

Exploring the Effect of the gamma Hyperparameter

PYTHON

Exploring the Effect of the kernel Hyperparameter

PYTHON

SH

Changing the Classification Threshold

Changing the Threshold:

PYTHON

Model Evaluation: Comparing Logistic Regression and SVM

Step 1: Load the Data

PYTHON

Step 2: Train the Models

PYTHON

Step 3: Evaluation Metrics

What is Accuracy?

PYTHON

Step 4: Explore Effect of `penalty` (L1, L2 Regularization)

Step 5: Explore Effect of `max_iter`

Exploring the Effect of the `C` Hyperparameter

Exploring the Effect of the `gamma` Hyperparameter

Exploring the Effect of the `kernel` Hyperparameter

Step 3: Explore Effect of `hidden_layer_sizes`

Step 4: Explore Effect of `alpha` (L2 Regularization)

Step 5: Explore Effect of `learning_rate_init`

Exploring the Effect of the `n_estimators` Hyperparameter

Exploring the Effect of the `max_depth` Hyperparameter

Exploring the Effect of the `min_samples_split` Hyperparameter