Measure Model Performance

Table of Contents

Measure of Model Performance

Distance Matrix and Confusion matrix are two measures of Model performance

 

Distance Matrix

A distance matrix is a square matrix that contains the distances between each pair of points in a dataset.

It is used in machine learning for below models

  • Clustering Algorithms: Algorithms like K-Means use distance matrices to group similar data points together.
  • Nearest-Neighbor Methods: In K-Nearest Neighbors (KNN) classification or regression, the distance matrix is used to find the closest data points to a given query point.

Common Distance Matrix used

Euclidean Distance: The straight-line distance between two points in Euclidean space.

A mathematical equation with numbers and symbols Description automatically generated

Manhattan Distance: The sum of the absolute differences of their Cartesian coordinates.

A mathematical equation with numbers and symbols Description automatically generated

Distance Matrix Example

Let’s us understand Distance matrix with an example

PointsCoordinate XCoordinate Y
Point A12
Point B46
Point C52

Distance Matrix using Euclidean Distance formula

A black text on a white background Description automatically generated

PointsABC
A054
B504.1231
C44.12310

d(A, B) =

d(A, B) =

d(A, B) = = 5

Observations

  • A and B are separated by a distance of 5 units.
  • A and C are closer, with a distance of 4 units.
  • B and C are approximately 4.1231 units apart.

Use in Algorithms

Clustering Algorithms

  • Cluster 1: Contains Points A and C as these are closest to each other
  • Cluster 2: Contains Point B

Use in KNN algorithm

  • New Point D at (2, 3). Compute distances 𝑑(𝐴𝐷), d(BD) and 𝑑(𝐶𝐷)
  • Use these distances to find the nearest neighbors for classification or regression.

 

Confusion Matrix

 

Confusion Matrix definition

A Confusion Matrix is a table which describe the performance of a classification model on a set of test data for which the true values are known. It allows us to see where the model is getting confused between classes.

Definition: A matrix that compares actual target values with those predicted by the model.

2X2 Confusion Matrix

Actual / Predicted ValuesPositiveNegative
PositiveTrue Positive (TP)False Negative (FN)
NegativeFalse Positive (FP)True Negative (TN)
  • True Positive (TP): Correctly predicted positive class
  • True Negative (TN): Correctly predicted negative class
  • False Positive (FP): Incorrectly predicted positive class
  • False Negative (FN): Incorrectly predicted negative class

 

Confusion Matrix Example

An Example of Confusion matrix- Use in medical test to detect a disease

Suppose we have a test dataset of 100 patients with the following actual conditions:

  • Actual Positive Cases (Disease Present): 40 patients
  • Actual Negative Cases (Disease Absent): 60 patients

We run our classification model on this test dataset, we obtain the following predictions:

  • Predicted Positive Cases: 50 patients
    • Correctly Predicted Positive Cases (True Positives, TP): 35 patients
    • Incorrectly Predicted Positive Cases (False Positives, FP): 15 patients
  • Predicted Negative Cases: 50 patients
    • Correctly Predicted Negative Cases (True Negatives, TN): 45 patients
    • Incorrectly Predicted Negative Cases (False Negatives, FN): 5 patients

Confusion Matrix for above example

Actual / Predicted ValuesPositiveNegative
Positive: 35(TP) + 5(FN) = 40TP=35FN=5
Negative: 15(FP) + 45(TN) = 60FP=15TN = 45

Interpreting the Confusion Matrix

  • True Positive (TP): The model correctly predicts that the patient has the disease.
  • True Negative (TN): The model correctly predicts that the patient does not have the disease.
  • False Positive (FP): The model incorrectly predicts that the patient has the disease (false alarm).
  • False Negative (FN): The model incorrectly predicts that the patient does not have the disease (missed detection)

 

Calculating Performance Metrics

Accuracy

Definition : The proportion of correctly classified instances among all instances.

Formula = (TP+TN) / (TP+TN+FP+FN)

  • (35+45) / (100) = 80%

Precision (Positive Predictive Value)

Definition: The proportion of positive identifications that were correctly predicted

Formula = (TP) / {TP + FP(Incorrectly predicted positive cases)}

  • 35 / (35+15) = 70%

Recall (True Positive Rate)

Definition: Percentage of actual positive correctly identified

Formula = TP / {TP+FN (actually positive, incorrectly predicted as negative)}

  • 35 / (35+5) = 87.5%

Specificity (True Negative Rate)

Definition: The proportion of actual negatives that were correctly identified.

Formula = TN / {(TN + FP (Actually negative, incorrectly predicted as positive)}

  • 45 / (45 + 15) = 75%

F1 Score

Definition: Harmonic Mean of precision and recall. Harmonic mean is used instead of the arithmetic mean because it gives more weight to the lower of the two values (precision or recall)

Formula = 2(Precision x Recall)/ (Precision + Recall)

  • 2x (.7*.875)/1.575 = 78%

Interpretation: The F1 Score balances precision and recall, providing a single metric to evaluate the model’s performance.

 

False Positive Rate (FPR)

Definition: The proportion of negatives incorrectly classified as positive.

Formula = FP / (FP+TN)

  • 15 / (15+45) = 25%

False Negative Rate (FNR)

Definition: The proportion of positives incorrectly classified as negative.

Formula = FN / (FN + TP)

  • 5 / (5+35) = 12.5%

Interpreting the Metrics

  • High Recall (Sensitivity): The model is good at identifying actual positives. It has correctly identified 87.5% of the patients who have the disease.
  • Moderate Precision: Out of all positive predictions, only 70% were correct. There is a significant number of false positives (15 out of 50 predicted positives).

Leave a Comment

Your email address will not be published. Required fields are marked *

error: Content is protected !!