Machine Learning 简明教程
Machine Learning - AUC-ROC Curve
AUC-ROC 曲线是机器学习中常用的性能指标,用于评估二元分类模型的性能。它是真阳性率(TPR)与假阳性率(FPR)在不同阈值下的曲线图。
The AUC-ROC curve is a commonly used performance metric in machine learning that is used to evaluate the performance of binary classification models. It is a plot of the true positive rate (TPR) against the false positive rate (FPR) at different threshold values.
What is the AUC-ROC Curve?
AUC-ROC 曲线是在不同阈值下二元分类模型性能的图形表示。它在 y 轴上绘制真阳性率(TPR),在 x 轴上绘制假阳性率(FPR)。TPR 是正确识别为模型的实际阳性病例的比例,而 FPR 是实际阴性病例被模型错误分类为阳性的比例。
The AUC-ROC curve is a graphical representation of the performance of a binary classification model at different threshold values. It plots the true positive rate (TPR) on the y-axis and the false positive rate (FPR) on the x-axis. The TPR is the proportion of actual positive cases that are correctly identified by the model, while the FPR is the proportion of actual negative cases that are incorrectly classified as positive by the model.
AUC-ROC 曲线可用于评估二元分类模型的整体性能,因为它考虑了 TPR 和 FPR 在不同阈值下的权衡。曲线下面积(AUC)表示模型在所有可能阈值下的整体性能。完美的分类器 AUC 为 1.0,而随机分类器的 AUC 为 0.5。
The AUC-ROC curve is a useful metric for evaluating the overall performance of a binary classification model because it takes into account the trade-off between TPR and FPR at different threshold values. The area under the curve (AUC) represents the overall performance of the model across all possible threshold values. A perfect classifier would have an AUC of 1.0, while a random classifier would have an AUC of 0.5.
Why is the AUC-ROC Curve Important?
AUC-ROC 曲线是机器学习中一项重要的性能指标,因为它提供了对模型区分阳性病例和阴性病例能力的全面衡量。
The AUC-ROC curve is an important performance metric in machine learning because it provides a comprehensive measure of a model’s ability to distinguish between positive and negative cases.
当数据不平衡时,这种曲线特别有用,这意味着一个类的流行度远远高于另一个类。在这种情况下,仅准确性可能不是衡量模型性能的好标准,因为它可能会受大多数类的流行度影响。
It is particularly useful when the data is imbalanced, meaning that one class is much more prevalent than the other. In such cases, accuracy alone may not be a good measure of the model’s performance because it can be skewed by the prevalence of the majority class.
AUC-ROC 曲线通过考虑 TPR 和 FPR,为模型的性能提供了更均衡的视图。
The AUC-ROC curve provides a more balanced view of the model’s performance by taking into account both TPR and FPR.
Implementing the AUC ROC Curve in Python
现在我们了解了 AUC-ROC 曲线是什么以及它为什么重要,让我们看看如何用 Python 实现它。我们将使用 Scikit-learn 库构建二元分类模型并绘制 AUC-ROC 曲线。
Now that we understand what the AUC-ROC curve is and why it is important, let’s see how we can implement it in Python. We will use the Scikit-learn library to build a binary classification model and plot the AUC-ROC curve.
首先,我们需要导入必要的库并加载数据集。在此示例中,我们将使用 scikit-learn 提供的乳腺癌数据集。
First, we need to import the necessary libraries and load the dataset. In this example, we will be using the breast cancer dataset from scikit-learn.
Example
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt
# load the dataset
data = load_breast_cancer()
X = data.data
y = data.target
# split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
接下来,我们将拟合逻辑回归模型到训练集并对测试集进行预测。
Next, we will fit a logistic regression model to the training set and make predictions on the test set.
# fit a logistic regression model
lr = LogisticRegression()
lr.fit(X_train, y_train)
# make predictions on the test set
y_pred = lr.predict_proba(X_test)[:, 1]
在做出预测后,我们可以使用 scikit-learn 中的 roc_auc_score() 函数计算 AUC-ROC 分数。
After making predictions, we can calculate the AUC-ROC score using the roc_auc_score() function from scikit-learn.
# calculate the AUC-ROC score
auc_roc = roc_auc_score(y_test, y_pred)
print("AUC-ROC Score:", auc_roc)
这将输出逻辑回归模型的 AUC-ROC 分数。
This will output the AUC-ROC score for the logistic regression model.
最后,我们可以使用 roc_curve() 函数和 matplotlib 库绘制 ROC 曲线。
Finally, we can plot the ROC curve using the roc_curve() function and matplotlib library.
# plot the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
plt.plot(fpr, tpr)
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()
当你执行此代码时,它将绘制逻辑回归模型的 ROC 曲线。
When you execute this code, it will plot the ROC curve for the logistic regression model.
此外,它将在终端上打印 AUC-ROC 分数 −
In addition, it will print the AUC-ROC score on the terminal −
AUC-ROC Score: 0.9967245332459875