Understanding the receiver operating characteristic curve

Table of Contents

Introduction
#

The Receiver Operating Characteristic (ROC) curve is an important technique in machine learning for selecting classifiers based on their performance. They have properties that make them particularly useful, especially for domains with skewed class distribution and unbalanced classification error costs.

AUC and AUROC
#

One fundamental parameter for ROC curves is given by the area lying underneath it, appropriately called the Area Under the Curve (AUC), or more precisely, the AUROC (Area Under the Receiver Operating Characteristic) curve. Using AUC alone is technically ambiguous, while AUROC is the precise term. This distinction matters because there are other curves, scuh as precision-recall curves, with their own AUC values.

There are several equivalent interpretations of the AUROC.

The expectation that a uniformly drawn random positive is ranked before a uniformly drawn random negative.
The expected proportion of positives ranked before a uniformly drawn random negative.
The expected true positive rate if the ranking is split just before a uniformly drawn random negative.
The expected proportion of negatives ranked after a uniformly drawn random positive.
The expected false positive rate if the ranking is split just after a uniformly drawn random positive.

The confusion matrix
#

Let us assume a probabilistic binary classifier such as a logistic regression. For a given decision threshold $t$, every prediction from the classifier falls into one of four categories. These values can be collected inside a table which shows the performance of the algorithm. This table is commonly called a confusion matrix, and is illustrated in Table 1 below.

	Predicted Negative	Predicted Positive	Total
Actually Negative	True Negative $(TN)$	False Positive $(FP)$	$N$
Actually Positive	False Negative $(FN)$	True Positive $(TP)$	$P$

Table 1 – The confusion matrix shows all possible outcomes when applying a classifier to a population.

From these four cells we derive several, important key rates.

Name	Definition	Synonyms
False Positive Rate	$\frac{FP}{TN + FP}$	1 - Specificity / Type I error
True Positive Rate	$\frac{TP}{TP + FN}$	Sensitivity / Recall / Power / 1 - Type II error
Positive Predicted Value	$\frac{TP}{TP + FP}$	Precision, 1 - False Discovery Proportion
Negative Predicted Value	$\frac{TP}{TN + FN}$

Table 2 – Important measures of classification.

There are also other important measures for classification, such as the accuracy and the F-measure, which are defined below.

$$\text{Accuracy} = \frac{TP + TN}{TP + FN + TN + FP}$$$$\text{F-measure} = \frac{2}{\frac{1}{\text{Precision}} + \frac{1}{\text{Recall}}}$$

Building the ROC curve
#

A logistic regression model outputs a probability $\hat{p} \in [0,1]$ for each observation. To classify, we pick a threshold $t$ and predict “positive” when $\hat{p} \geq t$.

Sweep $t$ over a grid, e.g. $t \in {0.00, 0.01, 0.02, \ldots, 1.00}$.
At each $t$, compute FPR and TPR from the confusion matrix.
Plot TPR (y-axis) vs. FPR (x-axis).

The resulting curve is the ROC curve. The AUROC is the area enclosed between this curve and the x-axis:

$$\text{AUROC} = \int_0^1 \text{TPR}(\text{FPR}) \, d(\text{FPR})$$

Key reference points
#

AUROC	Interpretation
$1.0$	Perfect classifier
$0.5$	Random / no-skill classifier (the diagonal line)
$0.0$	Perfectly wrong classifier (all predictions flipped)

Table 3 – Some example values of AUROC and their interpretation.

The dashed diagonal line $(\text{FPR} = \text{TPR})$ represents a random predictor and serves as the baseline.

The probabilistic interpretation
#

This is the most powerful way to understand AUROC. AUROC equals the probability that a randomly drawn positive example is ranked higher (given a higher predicted probability) than a randomly drawn negative example.

Formally, let $X_+$ be the score assigned to a random positive and $X_-$ the score for a random negative:

$$\text{AUROC} = P(X_+ > X_-)$$

This means:

AUROC	Interpretation
1.0	The model always ranks positives above negatives.
0.5	The model ranks them at random.
0.8	There is an 80% chance the model will rank a random positive above a random negative.

Table 4 – Some example values of AUROC and their probabilistic interpretation.

AUROC is therefore purely a ranking metric, it does not depend on the absolute calibration of the predicted probabilities, only on their relative ordering.

Computing the AUROC in Scikit-Learn
#

Minimal example
#

import numpy as np
from sklearn import metrics

y_true = np.array(
    ['P', 'P', 'N', 'P', 'P',
     'P', 'N', 'N', 'P', 'N',
     'P', 'N', 'P', 'N', 'N',
     'N', 'P', 'N', 'P', 'N']
     )
y_score = np.array(
    [0.9, 0.8, 0.7, 0.6, 0.55,
    0.51, 0.49, 0.43, 0.42, 0.39,
    0.33, 0.31, 0.23, 0.22, 0.19,
    0.15, 0.12, 0.11, 0.04, 0.01]
    )

fpr, tpr, thresholds = metrics.roc_curve(y_true, y_score, pos_label='P')
print(metrics.auc(fpr, tpr))   # 0.67999

This returns 0.67999, but one could also reach the same conclusions by simulation. Using the same, raw random positive and negative examples as above, one can calculate the proportion of cases when positive cases have greater score than negative cases.

pos = y_score[y_true == 'P']
neg = y_score[y_true == 'N']

rng = np.random.default_rng(33)
p = rng.choice(pos, size=50000) > rng.choice(neg, size=50000)
print(p.mean())   # 0.67916

And you get 0.67916, which is quite close. More information about the AUC can be found at sklearn.metrics.auc in the Scikit-Learn documentation website.

Full end-to-end example with logistic regression
#

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# generate synthetic binary dataset
X, y = make_classification(n_samples=1000, n_features=20,
                            n_classes=2, random_state=0)

# train / test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)

# fit logistic regression
model = LogisticRegression()
model.fit(X_train, y_train)

# predicted probabilities for the positive class
preds = model.predict_proba(X_test)[:, 1]

# compute AUROC
print("AUROC:", roc_auc_score(y_test, preds))

# plot ROC curves
plt.figure(figsize=(7, 6))
fpr, tpr, _ = roc_curve(y_test, preds)
plt.plot(fpr, tpr, label=f"AUC = {roc_auc_score(y_test, preds):.2f}")
plt.plot([0,1],[0,1], 'k--', label="Random (AUC=0.5)")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()

In Figure 1 one can see the area under the curve of the receiver operating characteristic. The dashed line in the diagonal represent the ROC curve of a random predictor, which has an AUROC of 0.5. The random predictor is commonly used as a baseline to see whether the model is useful or not.

More information can be found at sklearn.metrics.roc_curve and sklearn.metrics.roc_auc_score at the Scikit-Learn documentation website.

Calculating the AUC directly on the breast cancer dataset
#

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

X, y = load_breast_cancer(return_X_y=True)
clf = LogisticRegression(solver="newton-cholesky", random_state=0)
clf.fit(X, y)

roc_auc_score(y, clf.predict_proba(X)[:, 1])    # 0.99
roc_auc_score(y, clf.decision_function(X))      # 0.99

Trapezoidal integration for AUC
#

In practice, AUC is approximated numerically using the trapezoid rule: the area under the ROC curve is divided into trapezoids defined by consecutive (FPR, TPR) points, and their areas are summed:

$$\text{AUROC} \approx \sum_{i=1}^{n-1} \frac{(\text{FPR}_{i+1} - \text{FPR}_i)(\text{TPR}_i + \text{TPR}_{i+1})}{2}$$

Why not just use accuracy?
#

Consider a heavily imbalanced dataset where 95% of samples belong to class 1. A naive model that always predicts class 1 achieves 95% accuracy yet has zero ability to distinguish between classes. AUROC exposes this failure:

import numpy as np
from sklearn.metrics import accuracy_score, roc_auc_score

n, ratio = 10000, 0.95
y = np.array([0] * int((1 - ratio)*n) + [1] * int(ratio*n))

# model that always predicts class 1
y_proba_naive = np.ones(n)
print("Accuracy:", accuracy_score(y, y_proba_naive > 0.5))  # 0.95
print("AUROC:", roc_auc_score(y, y_proba_naive))            # 0.5 (random!)

AUROC = 0.5 correctly flags the naive model as no better than chance.

Choosing a threshold from the ROC curve
#

In general we would like to have a classifier that is equally good at detecting true positives and true negatives, while at the same time making few (if any) false negative and false positive errors. Unfortunately in the majority of cases this can’t be achieved, so one compromise is to select a threshold that achieves either a high recall (low number of false negatives) or a high precision (low number of false positives). To this end, the ROC curve can help us to make the proper threshold selection.

Priority	Use
Low FPR priority (e.g. legal systems, spam filters)	Pick a point far left on the curve (low $t$).
High TPR priority (e.g. cancer screening)	Pick a point near the top of the curve (lower $t$, accepting more false alarms).
Balanced trade-off	The point on the curve closest to $(0, 1)$.

Table 5 – Some guideline thresholds to consider when interpreting the AUC values.

This last item in the table on the balanced trade-off is related to Youden’s J statistic.

$$J = \text{TPR} - \text{FPR} = \text{Sensitivity} + \text{Specificity} - 1$$

Important caveats
#

Limitation	Detail
Calibration-insensitive	AUROC is identical if probabilities range 0.9–1.0 vs 0–1, as long as ranking order is preserved.
Not for imbalanced absolute performance	A high AUROC may still correspond to poor precision or negative predictive value.
Single-number oversimplification	Collapsing the full curve to one scalar ignores threshold-specific trade-offs.
Multiclass extension	For $c$ classes, one common approach averages pairwise AUC over all $\frac{c(c-1)}{2}$ pairs.

Table 6 – Some important caveats when dealing with the AUROC interpretation.

Author

Angelo Varlotta

If you can’t explain it simply, you don’t understand it well enough – Albert Einstein

	Predicted Negative	Predicted Positive	Total
Actually Negative	True Negative \((TN)\)	False Positive \((FP)\)	\(N\)
Actually Positive	False Negative \((FN)\)	True Positive \((TP)\)	\(P\)

Name	Definition	Synonyms
False Positive Rate	\(\frac{FP}{TN + FP}\)	1 - Specificity / Type I error
True Positive Rate	\(\frac{TP}{TP + FN}\)	Sensitivity / Recall / Power / 1 - Type II error
Positive Predicted Value	\(\frac{TP}{TP + FP}\)	Precision, 1 - False Discovery Proportion
Negative Predicted Value	\(\frac{TP}{TN + FN}\)

AUROC	Interpretation
\(1.0\)	Perfect classifier
\(0.5\)	Random / no-skill classifier (the diagonal line)
\(0.0\)	Perfectly wrong classifier (all predictions flipped)

Introduction #

AUC and AUROC #

The confusion matrix #

Building the ROC curve #

Key reference points #

The probabilistic interpretation #

Computing the AUROC in Scikit-Learn #

Minimal example #

Full end-to-end example with logistic regression #

Calculating the AUC directly on the breast cancer dataset #

Trapezoidal integration for AUC #

Why not just use accuracy? #

Choosing a threshold from the ROC curve #

Important caveats #