Skip to main content
  1. Posts/

A Walkthrough of Outcomes in Statistical Tests

··2018 words·10 mins·

Introduction
#

Statistical errors are commonplace in hypothesis testing, they can’t be eliminated but only reduced. They can be of two types, type I errors or false positives or type II errors or false negatives, and complement the two other types of correct classifications, the true positives and true negatives. These errors also appear during the classification of observations. In this article, I’m going to describe their relationship with the receiver operating characteristic (ROC) curve and the area beneath it, called the area under the curve (AUC) or more precisely the area under the receiver operating characteristic curve (AUROC).

Type I and type II errors
#

Errors are an integral part of statistical testing. In hypothesis testing, the test chooses between two competing propositions called the null hypothesis denoted by \(H_0\), and the alternative hypothesis denoted by \(H_1\). If the result of the test is correct then a right decision was made, but if the result of the test does not correspond with reality then an error has occurred. There are two situations in which the decision is wrong. The null hypothesis may be true but we reject \(H_0\), or the alternative hypothesis \(H_1\) may be true but we do not reject \(H_0\). The first type of error is called a type I error, or a false positive while the second is called a type II error or false negative. The probability of making a type I error type is defined by the parameter \(\alpha\), while the probablity of making a type II error is defined by the parameter \(\beta\), both errors conditional on there being a true effect or association.

Table of error types Null hypothesis (H0) is
True False
Decision about null
hypothesis (H0)
Not reject Correct inference
(true negative)
(probability = 1 − α)
Type II error
(false negative)
(probability = β)
Reject Type I error
(false positive)
(probability = α)
Correct inference
(true positive)
(probability = 1 − β)
Table 1 – The relations between truth or falseness of the null hypothesis and the outcomes of the test.

The power of the test is the probability that the test correctly rejects the null hypothesis \(H_0\) when the alternative hypothesis \(H_1\) is true, and is commonly denoted by \(1-\beta\).

The confusion matrix
#

Every prediction of the classifier falls into one of four categories. These values can be collected inside a table, commonly called the confusion matrix, which shows the performance of the algorithm, and is illustrated in Table 2. The results bear a strong resemblence to those in Table 1, where the predicition of the classifier reflects the decision to retain or discard the null hypothesis in favor of the alternative hypothesis, generating either true positives or true negatives if the decision was correct, or false positives or false negatives if the decision wasn’t correct.

Actually Negative Actually Positive
Predicted Negative True Negative \((TN)\) False Negative \((FN)\)
Predicted Positive False Positive \((FP)\) True Positive \((TP)\)
Total \(N\) \(P\)
Table 2 – The confusion matrix shows all possible outcomes when applying a classifier to a population.

By changing the decision threshold \(t\) in the classifier in Figure 1, the outcomes of the four categories change, and one can get an idea of the relationship between these outcomes. In a real-world situation, the erroreous outcomes can’t be eliminated but only minnimized, since a decrease in false positives corresponds to an increase in false negatives and viceversa.

TN — true negative FP — false positive (Type I / α) FN — false negative (Type II / β) TP — true positive (power / 1−β)  threshold τ
−1.0
+1.0
0.70
+0.00
Predicted −
Predicted +
Actual − (H₀)
TN
Specificity
FP
Type I = α
Actual + (H₁)
FN
Type II = β
TP
Power = 1−β
α (FPR)
β (FNR)
Power 1−β
d′
Figure 1 – The threshold–error trade-off and the percentages of predicted and actual outcomes in a binary classifier.

From the cells in Table 2 we derive several, important key rates.

Name Definition Synonyms
False Positive Rate \(\frac{FP}{TN + FP}\) 1 - Specificity / Type I error
True Positive Rate \(\frac{TP}{TP + FN}\) Sensitivity / Recall / Power / 1 - Type II error
Positive Predicted Value \(\frac{TP}{TP + FP}\) Precision, 1 - False Discovery Proportion
Negative Predicted Value \(\frac{TP}{TN + FN}\)
Accuracy \(\frac{TP + TN}{TP + FN + TN + FP}\)
F-measure \(\frac{2}{1/\text{Precision} + 1/\text{Recall}}\)
Table 3 – Important measures of classification.

Area under the curve
#

The receiver operating characteristic (ROC) curve is an important technique in machine learning for selecting classifiers based on their performance. It has useful properties for situations with skewed class distribution and unbalanced classification error costs. One fundamental parameter for ROC curves is given by the area lying underneath it, appropriately called the area under the curve (AUC) or more precisely the area under the receiver operating characteristic (AUROC) curve. This distinction matters because there are other curves, scuh as precision-recall curves, with their own AUC values.

Building the ROC curve
#

A binary classifier model outputs a probability \(\hat{p} \in [0,1]\) for each observation. To classify, we pick a threshold \(t\) and predict “positive” when \(\hat{p} \geq t\). In order to build a ROC curve, one sweeps \(t\) over a grid, and at each \(t\) one computes the false positive rate (FPR) and the true positive rate (TPR) from the confusion matrix. The AUROC is the area beneath the ROC curve.

$$\mathrm{AUROC} = \int_0^1 \mathrm{TPR}(\mathrm{FPR}) \, d(\mathrm{FPR})$$

The dashed diagonal line \(\mathrm{FPR} = \mathrm{TPR}\) represents a random predictor and serves as the baseline.

The probabilistic interpretation
#

One interpretation of the AUROC is the expectation that a uniformly drawn random positive is ranked before a uniformly drawn random negative. This is one other way to understand the AUROC from a purely probabilistic view. Formally, let \(X_+\) be the score assigned to a random positive and \(X_-\) the score for a random negative. Then the AUROC is given by the formula below.

$$\mathrm{AUROC} = P(X_+ > X_-)$$

In practice, the AUC can be approximated numerically using the trapezoid rule where the area under the ROC curve is divided into trapezoids defined by consecutive (FPR, TPR) points, and their areas are summed. One can in first approximation calculate the area under the ROC curve using the following formula.

$$\text{AUROC} \approx \sum_{i=1}^{n-1} \frac{(\text{FPR}_{i+1} - \text{FPR}_i)(\text{TPR}_i + \text{TPR}_{i+1})}{2}$$

To give a better understanding of AUROC values and their implication, there are a few examples in the table below.

AUROC Interpretation
1.0 The model always ranks positives above negatives.
0.5 The model ranks positives and negatives no better than random.
0.8 There is an 80% chance the model will rank a random positive above a random negative.
Table 4 – Some example values of AUROC and their probabilistic interpretation.

The AUROC is therefore purely a ranking metric, it does not depend on the absolute calibration of the predicted probabilities but only on their relative ordering.

Computing the AUROC in Scikit-Learn
#

Minimal example
#

Using the information about the AUC at sklearn.metrics.auc in the Scikit-Learn documentation website and using some toy data as in the example below, one can easily calculate this value, as shown in the minimal example below.

import numpy as np
from sklearn.metrics import roc_curve

y_true = np.array(
    ['P', 'P', 'N', 'P', 'P',
     'P', 'N', 'N', 'P', 'N',
     'P', 'N', 'P', 'N', 'N',
     'N', 'P', 'N', 'P', 'N']
     )
y_score = np.array(
    [0.9, 0.8, 0.7, 0.6, 0.55,
    0.51, 0.49, 0.43, 0.42, 0.39,
    0.33, 0.31, 0.23, 0.22, 0.19,
    0.15, 0.12, 0.11, 0.04, 0.01]
    )

fpr, tpr, thresholds = roc_curve(y_true, y_score, pos_label='P')
print(metrics.auc(fpr, tpr))   # 0.67999

Probabilistic example
#

This returns 0.67999, but one can also reach the same conclusions by simulation. Using the true and predicted positive and negative examples from the example above, one can calculate the proportion of cases where positive cases have greater score than negative cases, as shown in the code below.

pos = y_score[y_true == 'P']
neg = y_score[y_true == 'N']

random_seed, sample_size = 33, 50000
rng = np.random.default_rng(random_seed)
p = rng.choice(pos, size=sample_size) > rng.choice(neg, size=sample_size)
print(p.mean())   # 0.67916

And you get 0.67916, which is quite close to the previous value from Scikit-Learn.

Full end-to-end example with logistic regression
#

In Scikit-Learn there are plenty of data sets that one can use to train a binary classifier and calculate its ROC and AUC. The example below uses a logistic regression and the make_classification data set which generate a random \(n\)-class classification problem.

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# generate synthetic binary dataset
X, y = make_classification(n_samples=1000, n_features=20,
                            n_classes=2, random_state=0)

# train / test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0
    )

# fit logistic regression
model = LogisticRegression()
model.fit(X_train, y_train)

# predicted probabilities for the positive class
preds = model.predict_proba(X_test)[:, 1]

# compute AUROC
print("AUROC:", roc_auc_score(y_test, preds))

# plot ROC curves
plt.figure(figsize=(7, 6))
fpr, tpr, _ = roc_curve(y_test, preds)
plt.plot(fpr, tpr, label=f"AUC = {roc_auc_score(y_test, preds):.2f}")
plt.plot([0,1],[0,1], 'k--', label="Random (AUC=0.5)")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()

In Figure 1 one can see the area under the curve of the receiver operating characteristic. The dashed, diagonal line represents the ROC curve of a predictor that selects no better than randomly, and which has an AUROC of 0.5. The random predictor is commonly used as a baseline to see whether the model is useful or not.

roc_curve

Figure 1 – Receiver operating characteristic (ROC) curve for a logistic regression classifier.

roc_curve

More information about the Scikit-Learn methods used can be found at sklearn.metrics.roc_curve and sklearn.metrics.roc_auc_score at the Scikit-Learn documentation website.

Why not just use accuracy?
#

Consider a heavily imbalanced dataset where 95% of samples belong to class 1. A naive model that always predicts class 1 achieves 95% accuracy yet has zero ability to distinguish between classes. The AUROC exposes this failure.

import numpy as np
from sklearn.metrics import accuracy_score, roc_auc_score

n, ratio = 10000, 0.95
y = np.array([0] * int((1 - ratio)*n) + [1] * int(ratio*n))

# model that always predicts class 1
y_proba_naive = np.ones(n)
print("Accuracy:", accuracy_score(y, y_proba_naive > 0.5))  # 0.95
print("AUROC:", roc_auc_score(y, y_proba_naive))            # 0.5 (random!)

As the AUROC is 0.5, it correctly flags the naive model as no better than chance.

Choosing a threshold from the ROC curve
#

In general we would like to have a classifier that is equally good at detecting true positives and true negatives, while at the same time making as few as possible false negative and false positive errors. Unfortunately in the majority of cases this can’t be achieved, so one compromise is to select a threshold that achieves either a high recall (low number of false negatives) or a high precision (low number of false positives). To this end, the ROC curve can help us to make the proper threshold selection.

Priority Use
Low FPR priority (e.g. legal systems, spam filters) Pick a point far left on the curve (low \(t\)).
High TPR priority (e.g. cancer screening) Pick a point near the top of the curve (lower \(t\), accepting more false alarms).
Balanced trade-off The point on the curve closest to \((0, 1)\).
Table 5 – Some guideline thresholds to consider when interpreting the AUC values.

This last item in the table on the balanced trade-off is related to Youden’s J statistic.

$$J = \text{TPR} - \text{FPR} = \text{Sensitivity} + \text{Specificity} - 1$$

Important considerations on the AUROC
#

The area under the curve is a helpful tool to understand the performance of a classifier, but there are some important consideration that need to be mentioned. These are summarized in the table below.

Limitation Detail
Calibration-insensitive AUROC is identical if probabilities range 0.9–1.0 vs 0–1, as long as ranking order is preserved.
Not for imbalanced absolute performance A high AUROC may still correspond to poor precision or negative predictive value.
Single-number oversimplification Collapsing the full curve to one scalar ignores threshold-specific trade-offs.
Multiclass extension For \(c\) classes, one common approach averages pairwise AUC over all \(\frac{c(c-1)}{2}\) pairs.
Table 6 – Some important considerations when dealing with the AUROC interpretation.

Angelo Varlotta
Author
Angelo Varlotta
If you can’t explain it simply, you don’t understand it well enough – Albert Einstein