Model Evaluation Techniques

This notebook will only deal with commonly used evaluation metrics for classification. This list is not exhaustive, you are encouraged to look at the other metrics that can be used.

References:
(1) Scikit-Learn : https://scikit-learn.org/stable/modules/model_evaluation.html
(2) https://github.com/maykulkarni/Machine-Learning-Notebooks

Useful Resources :
https://scikit-learn.org/stable/modules/model_evaluation.html
https://scikit-learn.org/stable/modules/model_evaluation.html#mean-absolute-error

In [1]:

import numpy as np
import matplotlib.pyplot as plt
import random

# Set the seed.
random.seed(0)
np.random.seed(0)

# Make your plot outputs appear and be stored within the notebook.
%matplotlib inline

1. Classification Metrics

1.1 Accuracy Score

In [2]:

# Import the function.
from sklearn.metrics import accuracy_score

In [3]:

"""
    Assume y_true and y_pred to be the following.
"""
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]

In [4]:

# Compute the accuracy score. Essentially it means that 50% of the test samples have been classified correctly.
accuracy_score(y_true, y_pred)

Out[4]:

0.5

In [5]:

# If 'normalize' == 'False', then the number of correctly classified samples is returned. 
accuracy_score(y_true, y_pred, normalize=False)

Out[5]:

1.2 Confusion Matrix

No description has been provided for this image

In [6]:

# Import the function.
from sklearn.metrics import confusion_matrix

In [7]:

"""
    Assumption of y_true and y_pred.
"""
y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]

In [8]:

"""
    To understand a confusion matrix, you'll need to understand the terms : true-positive, true-negative, 
    false-negative and false-positive. For more information on them, refer the following link :  
    
    Link : https://en.wikipedia.org/wiki/Confusion_matrix
"""
confusion_matrix(y_true, y_pred)

Out[8]:

array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])

People generally visually plot the confusion matrix, as it is much easier to visualize. However, we will not be doing that here. You’re free to explore about that.

1.3 Classification Report

The classification_report function builds a text report showing the main classification metrics.

In [9]:

# Import the function.
from sklearn.metrics import classification_report

In [10]:

# Dummy Dataset (Assumptions)
y_true = [0, 1, 2, 2, 0]
y_pred = [0, 0, 2, 1, 0]
target_names = ['class 0', 'class 1', 'class 2']

In [11]:

# Think about why we used print() here? Why did we not use it anywhere above?
print (classification_report(y_true, y_pred, target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.67      1.00      0.80         2
     class 1       0.00      0.00      0.00         1
     class 2       1.00      0.50      0.67         2

    accuracy                           0.60         5
   macro avg       0.56      0.50      0.49         5
weighted avg       0.67      0.60      0.59         5

1.4 Precision, Recall and F1 Score

These three metrics are generally used together, because the computation of F1-Score requires the value of precision
and recall.

$ \text{Precision} = \frac{\text{True Positive}}{\text{True Positive } + \text{ False Positive}}
= \frac{\text{True Positive}}{\text{Total Predicted Positive}} $

$ \text{Recall} = \frac{\text{True Positive}}{\text{True Positive } + \text{ False Negative}}
= \frac{\text{True Positive}}{\text{Total Actual Positive}}$

$ \text{F1} = \frac{\text{2 * Precision * Recall}}{\text{Precision + Recall}}$

1.4.1 Precision

In [12]:

print (y_true)
print (y_pred)

[0, 1, 2, 2, 0]
[0, 0, 2, 1, 0]

Precision is the ability of the classifier not to label as positive a sample that is negative. The best value is 1 and the worst value is 0.

In [13]:

# Import the function.
from sklearn.metrics import precision_score

In [14]:

# Computing the precision score.
precision_score(y_true, y_pred, average="weighted")

Out[14]:

0.6666666666666666

1.4.2 Recall

Recall is the ability of the classifier to find all the positive samples. The best value is 1 and the worst value is 0.

In [15]:

# Import the function.
from sklearn.metrics import recall_score

In [16]:

# Computing the recall score.
recall_score(y_true, y_pred, average="weighted")

Out[16]:

0.6

1.4.3 F1 Score

The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:

In [17]:

# Import the function.
from sklearn.metrics import f1_score

In [18]:

# Computing the f1 score.
f1_score(y_true, y_pred, average="weighted")

Out[18]:

0.5866666666666667

In [19]:

# Why do we need F1 score when we already have accuracy?
y_true = [1,1,1,1,1,0]
y_pred = [1,1,1,1,1,1]

# Accuracy is not a very good metric to use when the data is highly unbalanced or the class distribution is skewed.
np.sum(np.array(y_true) == np.array(y_pred))/len(y_true)*100

Out[19]:

83.33333333333334

Optional

Another commonly used classification metric is ‘ROC-AUC’. You can read more about this here : https://scikit-learn.org/stable/modules/model_evaluation.html

solidfish

Model Evaluation – Classification