Machine Learning 简明教程
Machine Learning - Logistic Regression
逻辑回归是一种用于二元分类问题的流行算法,其中目标变量是包含两个类别的分类变量。它对给定输入特征的目标变量建模概率并预测概率最高的类别。
Logistic regression is a popular algorithm used for binary classification problems, where the target variable is categorical with two classes. It models the probability of the target variable given the input features and predicts the class with the highest probability.
逻辑回归是广义线性模型的一种类型,其中目标变量遵循伯努利分布。此模型由输入特征的线性函数组成,该线性函数使用逻辑函数转换以生成介于 0 和 1 之间概率值。
Logistic regression is a type of generalized linear model, where the target variable follows a Bernoulli distribution. The model consists of a linear function of the input features, which is transformed using the logistic function to produce a probability value between 0 and 1.
线性函数主要可用作另一个函数的输入,例如以下关系中的 g −
The linear function is basically used as an input to another function such as g in the following relation −
h_{\theta }\left ( x \right )=g\left ( \theta ^{T}x \right )\, 其中\: 0\leq h_{\theta }\leq 1
h_{\theta }\left ( x \right )=g\left ( \theta ^{T}x \right )\, where\: 0\leq h_{\theta }\leq 1
在其中,g 是逻辑函数或 sigmoid 函数,它可以表示如下 −
Here, g is the logistic or sigmoid function which can be given as follows −
g\left ( z \right )=\frac{1}{1+e^{-z}}\: z=\theta ^{T}x
g\left ( z \right )=\frac{1}{1+e^{-z}}\: where\: z=\theta ^{T}x
sigmoid 曲线可以用以下图形来表示。我们可以看到 y 轴上的值介于 0 到 1 之间,并在 0.5 处穿过该轴。
The sigmoid curve can be represented with the help of following graph. We can see the values of y-axis lie between 0 and 1 and crosses the axis at 0.5.
这些类可以分为正类或负类。如果输出介于 0 到 1 之间,则它属于正类的概率。对于我们的实现,我们对假设函数的输出的解释是,如果它 ≥ 0.5,则为正,否则为负。
The classes can be divided into positive or negative. The output comes under the probability of positive class if it lies between 0 and 1. For our implementation, we are interpreting the output of hypothesis function as positive if it is ≥ 0.5, otherwise negative.
Implementation in Python
现在,我们将使用 Python 来实现逻辑回归中的上述概念。出于此目的,我们使用名为“iris”的多变量花卉数据集。iris 数据集是机器学习中一个众所周知的数据集,其中包括三种不同种类的鸢尾花的萼片长度、萼片宽度、花瓣长度和花瓣宽度的测量值。我们将使用逻辑回归来根据其测量值预测鸢尾花的种类。
Now we will implement the above concept of logistic regression in Python. For this purpose, we are using a multivariate flower dataset named 'iris'. The iris dataset is a well-known dataset in machine learning, consisting of measurements of the sepal length, sepal width, petal length, and petal width of three different species of iris flowers. We will use logistic regression to predict the species of an iris flower given its measurements.
现在,让我们检查使用 iris 数据集在 Python 中实现逻辑回归的步骤 −
Let us now check the steps to implement logistic regression in Python using the iris dataset −
Load the Dataset
首先,我们需要将 iris 数据集加载到我们的 Python 环境中。我们可以使用 scikitlearn 库来加载数据集,如下所示 −
First, we need to load the iris dataset into our Python environment. We can use the scikitlearn library to load the dataset, as follows −
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data # input features
y = iris.target # target variable
Plot the Training Data
这是一个可选步骤,但为了对数据集进一步阐明,我们绘制训练数据,如下所示 −
This is an optional step but for more clarification about the dataset we are plotting the training data as follows −
import matplotlib.pyplot as plt
# plot the training data
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train)
plt.xlabel('Sepal length (cm)')
plt.ylabel('Sepal width (cm)')
plt.title('Iris Training Data')
plt.show()
Split the Dataset
接下来,我们需要将数据集分成一组训练集和一组测试集。我们将使用 70% 的数据进行训练,30% 用于测试。
Next, we need to split the dataset into a training set and a test set. We will use 70% of the data for training and 30% for testing.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
Create the Logistic Regression Model
我们可以使用 scikit-learn 中的 LogisticRegression 类来创建一个逻辑回归模型。我们将使用 L2 正则化并将正则化强度设置为 1。
We can use the LogisticRegression class from scikit-learn to create a logistic regression model. We will use L2 regularization and set the regularization strength to 1.
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(penalty='l2', C=1.0, random_state=42)
Train the Model
我们可以使用 fit() 方法对训练集中的模型进行训练。
We can train the model on the training set using the fit() method.
clf.fit(X_train, y_train)
Make Predictions
模型训练完成后,我们可以使用 predict() 方法对测试集进行预测。
Once the model is trained, we can use it to make predictions on the test set using the predict() method.
y_pred = clf.predict(X_test)
Evaluate the Model
最后,我们可以使用准确性、精确度、召回率和 F1 分数等指标来评估模型的性能。
Finally, we can evaluate the performance of the model using metrics such as accuracy, precision, recall, and F1-score.
from sklearn.metrics import accuracy_score, precision_score,
recall_score, f1_score
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred, average='macro'))
print('Recall:', recall_score(y_test, y_pred, average='macro'))
print('F1-score:', f1_score(y_test, y_pred, average='macro'))
在这里,我们对 average 参数使用值“macro”来按类分别计算指标,然后取平均值。
Here, we have used the average parameter with the value 'macro' to calculate the metrics for each class separately and then take the average.
Complete Implementation Example
下面给出了使用 iris 数据集在 python 中实现逻辑回归的完整示例 −
Give below is the complete implementation example of logistic regression in python using the iris dataset −
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# load the iris dataset
iris = load_iris()
X = iris.data # input features
y = iris.target # target variable
# plot the training data
plt.figure(figsize=(7.5, 3.5))
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train)
plt.xlabel('Sepal length (cm)')
plt.ylabel('Sepal width (cm)')
plt.title('Iris Training Data')
plt.show()
# split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# create the logistic regression model
clf = LogisticRegression(penalty='l2', C=1.0, random_state=42)
# train the model on the training set
clf.fit(X_train, y_train)
# make predictions on the test set
y_pred = clf.predict(X_test)
# evaluate the performance of the model
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred, average='macro'))
print('Recall:', recall_score(y_test, y_pred, average='macro'))
print('F1-score:', f1_score(y_test, y_pred, average='macro'))
执行此代码时,它将生成以下绘图作为输出 −
When you execute this code, it will produce the following plot as the output −
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1-score: 1.0