Machine Learning With Python 简明教程

Classification Algorithms - Logistic Regression

Introduction to Logistic Regression

逻辑回归是一种有监督学习分类算法,用于预测目标变量的概率。目标或因变量的本质是二分的,这意味着仅有两个可能的类别。

Logistic regression is a supervised learning classification algorithm used to predict the probability of a target variable. The nature of target or dependent variable is dichotomous, which means there would be only two possible classes.

简言之,因变量本质上是二元的,数据被编码为 1(表示成功/是)或 0(表示失败/否)。

In simple words, the dependent variable is binary in nature having data coded as either 1 (stands for success/yes) or 0 (stands for failure/no).

在数学上,逻辑回归模型会预测 P(Y=1) 作为 X 的函数。它是可以用于各种分类问题的最简单的 ML 算法之一,例如垃圾邮件检测、糖尿病预测、癌症检测等。

Mathematically, a logistic regression model predicts P(Y=1) as a function of X. It is one of the simplest ML algorithms that can be used for various classification problems such as spam detection, Diabetes prediction, cancer detection etc.

Types of Logistic Regression

通常,逻辑回归意味着具有二进制目标变量的二进制逻辑回归,但可能有更多两类目标变量可以通过它来预测。基于这些类别的数量,逻辑回归可以划分为以下类型 -

Generally, logistic regression means binary logistic regression having binary target variables, but there can be two more categories of target variables that can be predicted by it. Based on those number of categories, Logistic regression can be divided into following types −

Binary or Binomial

在这种分类中,因变量只有两种可能的类型 1 和 0。例如,这些变量可能表示成功或失败、是或否、赢或输等。

In such a kind of classification, a dependent variable will have only two possible types either 1 and 0. For example, these variables may represent success or failure, yes or no, win or loss etc.

Multinomial

在这种分类中,因变量可能有 3 个或更多种可能的无序类型或没有定量意义的类型。例如,这些变量可能表示“类型 A”或“类型 B”或“类型 C”。

In such a kind of classification, dependent variable can have 3 or more possible unordered types or the types having no quantitative significance. For example, these variables may represent “Type A” or “Type B” or “Type C”.

Ordinal

在这种分类中,因变量可能有 3 个或更多种可能的有序类型或具有定量意义的类型。例如,这些变量可能表示“差”或“好”、“非常好”、“优秀”,每个类别都可以有 0、1、2、3 等得分。

In such a kind of classification, dependent variable can have 3 or more possible ordered types or the types having a quantitative significance. For example, these variables may represent “poor” or “good”, “very good”, “Excellent” and each category can have the scores like 0,1,2,3.

Logistic Regression Assumptions

在开始实现逻辑回归之前,我们必须认识到以下关于此算法的假设 -

Before diving into the implementation of logistic regression, we must be aware of the following assumptions about the same −

  1. In case of binary logistic regression, the target variables must be binary always and the desired outcome is represented by the factor level 1.

  2. There should not be any multi-collinearity in the model, which means the independent variables must be independent of each other .

  3. We must include meaningful variables in our model.

  4. We should choose a large sample size for logistic regression.

Binary Logistic Regression model

逻辑回归最简单的形式是二元或二项逻辑回归,其中目标变量或因变量可以仅具有两种可能的类型,即 1 或 0。它允许我们对多个预测变量和二元/二项目标变量建模关系。对于逻辑回归,线性函数基本作为另一个函数(例如下面的关系中的𝑔)的输入使用 -

The simplest form of logistic regression is binary or binomial logistic regression in which the target or dependent variable can have only 2 possible types either 1 or 0. It allows us to model a relationship between multiple predictor variables and a binary/binomial target variable. In case of logistic regression, the linear function is basically used as an input to another function such as 𝑔 in the following relation −

这里,𝑔是逻辑斯蒂或 sigmoid 函数,可以表示如下 -

Here, 𝑔 is the logistic or sigmoid function which can be given as follows −

sigmoid 曲线可以借助下面的图像表示。我们可以看到 y 轴的值介于 0 到 1 之间,并且在 0.5 处穿过轴。

To sigmoid curve can be represented with the help of following graph. We can see the values of y-axis lie between 0 and 1 and crosses the axis at 0.5.

flow

类别可以分为正类或负类。输出落在 0 到 1 之间时,即正类的概率。对于我们的实现,我们解释假设函数的输出,如果它≥0.5 为正,否则为负。

The classes can be divided into positive or negative. The output comes under the probability of positive class if it lies between 0 and 1. For our implementation, we are interpreting the output of hypothesis function as positive if it is ≥0.5, otherwise negative.

我们还需要定义一个损失函数来度量算法使用函数权重的执行情况,该函数由 theta 如下所示 -

We also need to define a loss function to measure how well the algorithm performs using the weights on functions, represented by theta as follows −

ℎ=𝑔(𝑋𝜃)

现在,在定义了损失函数后,我们的主要目标是最小化损失函数。可以通过帮助拟合权重来实现,这意味着增加或减少权重。借助于相对于每个权重的损失函数的导数,我们将能够了解哪些参数应该具有高权重以及哪些参数应该具有较小的权重。

Now, after defining the loss function our prime goal is to minimize the loss function. It can be done with the help of fitting the weights which means by increasing or decreasing the weights. With the help of derivatives of the loss function w.r.t each weight, we would be able to know what parameters should have high weight and what should have smaller weight.

以下梯度下降方程告诉我们如果我们修改参数,损失将如何改变 -

The following gradient descent equation tells us how loss would change if we modified the parameters −

Implementation in Python

现在,我们将用 Python 实现二项式逻辑回归的上述概念。为此,我们使用了一个名为“iris”的多变量花数据集,其中有 3 类,每类有 50 个实例,但我们将使用前两个特征列。每个类代表一种鸢尾花。

Now we will implement the above concept of binomial logistic regression in Python. For this purpose, we are using a multivariate flower dataset named ‘iris’ which have 3 classes of 50 instances each, but we will be using the first two feature columns. Every class represents a type of iris flower.

首先,我们需要导入必需的库,如下所示 -

First, we need to import the necessary libraries as follows −

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets

然后,加载 iris 数据集,如下所示 -

Next, load the iris dataset as follows −

iris = datasets.load_iris()
X = iris.data[:, :2]
y = (iris.target != 0) * 1

我们可以绘制我们的训练数据,如下所示 -

We can plot our training data s follows −

plt.figure(figsize=(6, 6))
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='g', label='0')
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='y', label='1')
plt.legend();
star

接下来,我们将定义 sigmoid 函数、损失函数和梯度下降,如下所示 -

Next, we will define sigmoid function, loss function and gradient descend as follows −

class LogisticRegression:
   def __init__(self, lr=0.01, num_iter=100000, fit_intercept=True, verbose=False):
      self.lr = lr
      self.num_iter = num_iter
      self.fit_intercept = fit_intercept
      self.verbose = verbose
   def __add_intercept(self, X):
      intercept = np.ones((X.shape[0], 1))
      return np.concatenate((intercept, X), axis=1)
   def __sigmoid(self, z):
      return 1 / (1 + np.exp(-z))
   def __loss(self, h, y):
      return (-y * np.log(h) - (1 - y) * np.log(1 - h)).mean()
   def fit(self, X, y):
      if self.fit_intercept:
         X = self.__add_intercept(X)

现在,初始化权重,如下所示 -

Now, initialize the weights as follows −

self.theta = np.zeros(X.shape[1])
   for i in range(self.num_iter):
      z = np.dot(X, self.theta)
      h = self.__sigmoid(z)
      gradient = np.dot(X.T, (h - y)) / y.size
      self.theta -= self.lr * gradient
      z = np.dot(X, self.theta)
      h = self.__sigmoid(z)
      loss = self.__loss(h, y)
      if(self.verbose ==True and i % 10000 == 0):
         print(f'loss: {loss} \t')

借助以下脚本,我们可以预测输出概率 -

With the help of the following script, we can predict the output probabilities −

def predict_prob(self, X):
   if self.fit_intercept:
      X = self.__add_intercept(X)
   return self.__sigmoid(np.dot(X, self.theta))
def predict(self, X):
   return self.predict_prob(X).round()

接下来,我们可以对模型进行评估并将其绘制如下 -

Next, we can evaluate the model and plot it as follows −

model = LogisticRegression(lr=0.1, num_iter=300000)
preds = model.predict(X)
(preds == y).mean()

plt.figure(figsize=(10, 6))
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='g', label='0')
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='y', label='1')
plt.legend()
x1_min, x1_max = X[:,0].min(), X[:,0].max(),
x2_min, x2_max = X[:,1].min(), X[:,1].max(),
xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max), np.linspace(x2_min, x2_max))
grid = np.c_[xx1.ravel(), xx2.ravel()]
probs = model.predict_prob(grid).reshape(xx1.shape)
plt.contour(xx1, xx2, probs, [0.5], linewidths=1, colors='red');
red line

Multinomial Logistic Regression Model

另一种有用的逻辑回归形式是多项逻辑回归,其中目标变量或因变量可以具有 3 种或更多种可能的无序类型,即类型没有数量上的意义。

Another useful form of logistic regression is multinomial logistic regression in which the target or dependent variable can have 3 or more possible unordered types i.e. the types having no quantitative significance.

Implementation in Python

现在,我们将在 Python 中实现多项逻辑回归的上述概念。为此,我们正在使用一个名为 digit 的 sklearn 中的数据集。

Now we will implement the above concept of multinomial logistic regression in Python. For this purpose, we are using a dataset from sklearn named digit.

首先,我们需要导入必需的库,如下所示 -

First, we need to import the necessary libraries as follows −

Import sklearn
from sklearn import datasets
from sklearn import linear_model
from sklearn import metrics
from sklearn.model_selection import train_test_split

接下来,我们需要加载 digit 数据集 -

Next, we need to load digit dataset −

digits = datasets.load_digits()

现在,定义特征矩阵 (X) 和响应矢量 (y) 如下所示 −

Now, define the feature matrix(X) and response vector(y)as follows −

X = digits.data
y = digits.target

在下一行代码的帮助下,我们可以将 X 和 y 拆分为训练集和测试集 −

With the help of next line of code, we can split X and y into training and testing sets −

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

现在按以下方式创建一个逻辑回归对象 −

Now create an object of logistic regression as follows −

digreg = linear_model.LogisticRegression()

现在,我们需要使用训练集训练模型,如下所示 −

Now, we need to train the model by using the training sets as follows −

digreg.fit(X_train, y_train)

接下来,按以下方式对测试集进行预测 −

Next, make the predictions on testing set as follows −

y_pred = digreg.predict(X_test)

接下来按以下方式打印模型的准确度 −

Next print the accuracy of the model as follows −

print("Accuracy of Logistic Regression model is:",
metrics.accuracy_score(y_test, y_pred)*100)

Output

Accuracy of Logistic Regression model is: 95.6884561891516

从以上输出我们可以看到模型的准确度为 96%。

From the above output we can see the accuracy of our model is around 96 percent.