Machine Learning 简明教程

Machine Learning - Logistic Regression

逻辑回归是一种用于二元分类问题的流行算法，其中目标变量是包含两个类别的分类变量。它对给定输入特征的目标变量建模概率并预测概率最高的类别。

逻辑回归是广义线性模型的一种类型，其中目标变量遵循伯努利分布。此模型由输入特征的线性函数组成，该线性函数使用逻辑函数转换以生成介于 0 和 1 之间概率值。

线性函数主要可用作另一个函数的输入，例如以下关系中的 g −

h_{\theta }\left ( x \right )=g\left ( \theta ^{T}x \right )\, 其中\: 0\leq h_{\theta }\leq 1

在其中，g 是逻辑函数或 sigmoid 函数，它可以表示如下 −

g\left ( z \right )=\frac{1}{1+e^{-z}}\: z=\theta ^{T}x

sigmoid 曲线可以用以下图形来表示。我们可以看到 y 轴上的值介于 0 到 1 之间，并在 0.5 处穿过该轴。

这些类可以分为正类或负类。如果输出介于 0 到 1 之间，则它属于正类的概率。对于我们的实现，我们对假设函数的输出的解释是，如果它 ≥ 0.5，则为正，否则为负。

Implementation in Python

现在，我们将使用 Python 来实现逻辑回归中的上述概念。出于此目的，我们使用名为“iris”的多变量花卉数据集。iris 数据集是机器学习中一个众所周知的数据集，其中包括三种不同种类的鸢尾花的萼片长度、萼片宽度、花瓣长度和花瓣宽度的测量值。我们将使用逻辑回归来根据其测量值预测鸢尾花的种类。

现在，让我们检查使用 iris 数据集在 Python 中实现逻辑回归的步骤 −

Load the Dataset

首先，我们需要将 iris 数据集加载到我们的 Python 环境中。我们可以使用 scikitlearn 库来加载数据集，如下所示 −

from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data # input features
y = iris.target # target variable

Plot the Training Data

这是一个可选步骤，但为了对数据集进一步阐明，我们绘制训练数据，如下所示 −

import matplotlib.pyplot as plt

# plot the training data
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train)
plt.xlabel('Sepal length (cm)')
plt.ylabel('Sepal width (cm)')
plt.title('Iris Training Data')
plt.show()

Split the Dataset

接下来，我们需要将数据集分成一组训练集和一组测试集。我们将使用 70% 的数据进行训练，30% 用于测试。

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

Create the Logistic Regression Model

我们可以使用 scikit-learn 中的 LogisticRegression 类来创建一个逻辑回归模型。我们将使用 L2 正则化并将正则化强度设置为 1。

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(penalty='l2', C=1.0, random_state=42)

Train the Model

我们可以使用 fit() 方法对训练集中的模型进行训练。

clf.fit(X_train, y_train)

Make Predictions

模型训练完成后，我们可以使用 predict() 方法对测试集进行预测。

y_pred = clf.predict(X_test)

Evaluate the Model

最后，我们可以使用准确性、精确度、召回率和 F1 分数等指标来评估模型的性能。

from sklearn.metrics import accuracy_score, precision_score,
recall_score, f1_score
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred, average='macro'))
print('Recall:', recall_score(y_test, y_pred, average='macro'))
print('F1-score:', f1_score(y_test, y_pred, average='macro'))

在这里，我们对 average 参数使用值“macro”来按类分别计算指标，然后取平均值。

Complete Implementation Example

下面给出了使用 iris 数据集在 python 中实现逻辑回归的完整示例 −

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# load the iris dataset
iris = load_iris()
X = iris.data # input features
y = iris.target # target variable

# plot the training data
plt.figure(figsize=(7.5, 3.5))
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train)
plt.xlabel('Sepal length (cm)')
plt.ylabel('Sepal width (cm)')
plt.title('Iris Training Data')
plt.show()

# split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# create the logistic regression model
clf = LogisticRegression(penalty='l2', C=1.0, random_state=42)

# train the model on the training set
clf.fit(X_train, y_train)

# make predictions on the test set
y_pred = clf.predict(X_test)

# evaluate the performance of the model
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred, average='macro'))
print('Recall:', recall_score(y_test, y_pred, average='macro'))
print('F1-score:', f1_score(y_test, y_pred, average='macro'))

执行此代码时，它将生成以下绘图作为输出 −

Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1-score: 1.0