Machine Learning With Python 简明教程

Regression Algorithms - Linear Regression

Introduction to Linear Regression

线性回归可以定义为分析因变量与给定的一组自变量之间的线性关系的统计模型。变量之间的线性关系意味着当一个或多个自变量的值变化（增加或减少）时，因变量的值也会相应地发生变化（增加或减少）。

Linear regression may be defined as the statistical model that analyzes the linear relationship between a dependent variable with given set of independent variables. Linear relationship between variables means that when the value of one or more independent variables will change (increase or decrease), the value of dependent variable will also change accordingly (increase or decrease).

在数学上，可以通过以下等式来表示这种关系 −

Mathematically the relationship can be represented with the help of following equation −

Y = mX + b

其中，Y 是我们尝试预测的因变量

Here, Y is the dependent variable we are trying to predict

X 是我们用于进行预测的自变量。

X is the dependent variable we are using to make predictions.

m 是回归线的斜率，表示 X 对 Y 的影响。

m is the slop of the regression line which represents the effect X has on Y

b 是一个常量，称为 Y 截距。如果 X = 0，则 Y 等于 b。

b is a constant, known as the Y-intercept. If X = 0,Y would be equal to b.

此外，线性关系的本质可以是正面的或负面的，如下所述 −

Furthermore, the linear relationship can be positive or negative in nature as explained below −

Positive Linear Relationship

如果自变量和因变量均增加，则线性关系将称为正相关关系。可以通过以下图形来理解这一点 −

A linear relationship will be called positive if both independent and dependent variable increases. It can be understood with the help of following graph −

Negative Linear relationship

如果自变量增加而因变量减小，则线性关系将称为正相关关系。可以通过以下图形来理解这一点 −

A linear relationship will be called positive if independent increases and dependent variable decreases. It can be understood with the help of following graph −

Types of Linear Regression

线性回归具有以下两种类型 −

Linear regression is of the following two types −

Simple Linear Regression
Multiple Linear Regression

Simple Linear Regression (SLR)

它是线性回归的最基本版本，它使用单个特征预测响应。SLR 中的假设是这两个变量是线性相关的。

It is the most basic version of linear regression which predicts a response using a single feature. The assumption in SLR is that the two variables are linearly related.

Python implementation

我们可以用两种方法在 Python 中实现 SLR，一种是提供你自己的数据集，另一种是从 scikit-learn python 库中使用数据集。

We can implement SLR in Python in two ways, one is to provide your own dataset and other is to use dataset from scikit-learn python library.

Example 1 − 在以下 Python 实现示例中，我们正在使用我们自己的数据集。

Example 1 − In the following Python implementation example, we are using our own dataset.

首先，我们将从导入必要包开始，如下所示 −

First, we will start with importing necessary packages as follows −

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

接下来，定义一个将计算 SLR 重要值的函数 −

Next, define a function which will calculate the important values for SLR −

def coef_estimation(x, y):

以下脚本行将给出观测值 n −

The following script line will give number of observations n −

n = np.size(x)

x 和 y 向量的平均值可以按如下方式计算 −

The mean of x and y vector can be calculated as follows −

m_x, m_y = np.mean(x), np.mean(y)

我们可以按如下方式找到交叉差和围绕 x 的差 −

We can find cross-deviation and deviation about x as follows −

SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x

接下来，回归系数即 b 可按如下方式计算 −

Next, regression coefficients i.e. b can be calculated as follows −

b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return(b_0, b_1)

接下来，我们需要定义一个函数，它将绘制回归线并预测响应矢量 −

Next, we need to define a function which will plot the regression line as well as will predict the response vector −

def plot_regression_line(x, y, b):

以下脚本行将绘制实际点作为散点图 −

The following script line will plot the actual points as scatter plot −

plt.scatter(x, y, color = "m", marker = "o", s = 30)

以下脚本行将预测响应矢量 −

The following script line will predict response vector −

y_pred = b[0] + b[1]*x

以下脚本行将绘制回归线并在其上放置标签 −

The following script lines will plot the regression line and will put the labels on them −

plt.plot(x, y_pred, color = "g")
plt.xlabel('x')
plt.ylabel('y')
plt.show()

最后，我们需要定义 main() 函数，用于提供数据集并调用我们上面定义的函数 −

At last, we need to define main() function for providing dataset and calling the function we defined above −

def main():
   x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
   y = np.array([100, 300, 350, 500, 750, 800, 850, 900, 1050, 1250])
   b = coef_estimation(x, y)
   print("Estimated coefficients:\nb_0 = {} \nb_1 = {}".format(b[0], b[1]))
   plot_regression_line(x, y, b)

if __name__ == "__main__":
main()

Output

Estimated coefficients:
b_0 = 154.5454545454545
b_1 = 117.87878787878788

Example 2 − 在以下 Python 实现示例中，我们正在使用 scikit-learn 中的糖尿病数据集。

Example 2 − In the following Python implementation example, we are using diabetes dataset from scikit-learn.

首先，我们将从导入必要包开始，如下所示 −

First, we will start with importing necessary packages as follows −

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

接下来，我们将加载糖尿病数据集并创建其对象 −

Next, we will load the diabetes dataset and create its object −

diabetes = datasets.load_diabetes()

由于我们实施了 SLR，我们将仅使用一项功能，如下所示 −

As we are implementing SLR, we will be using only one feature as follows −

X = diabetes.data[:, np.newaxis, 2]

接下来，我们需要将数据分成训练和测试集，如下所示 −

Next, we need to split the data into training and testing sets as follows −

X_train = X[:-30]
X_test = X[-30:]

接下来，我们需要将目标分成训练和测试集，如下所示 −

Next, we need to split the target into training and testing sets as follows −

y_train = diabetes.target[:-30]
y_test = diabetes.target[-30:]

现在，要训练模型，我们需要创建如下所示的线性回归对象 −

Now, to train the model we need to create linear regression object as follows −

regr = linear_model.LinearRegression()

接下来，使用训练集训练模型，如下所示 −

Next, train the model using the training sets as follows −

regr.fit(X_train, y_train)

接下来，使用测试集进行预测，如下所示 −

Next, make predictions using the testing set as follows −

y_pred = regr.predict(X_test)

接下来，我们将打印一些系数，例如 MSE、方差分数等，如下所示 −

Next, we will be printing some coefficient like MSE, Variance score etc. as follows −

print('Coefficients: \n', regr.coef_)
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
print('Variance score: %.2f' % r2_score(y_test, y_pred))

现在，绘制输出内容，如下所示 −

Now, plot the outputs as follows −

plt.scatter(X_test, y_test, color='blue')
plt.plot(X_test, y_pred, color='red', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()

Output

Coefficients:
   [941.43097333]
Mean squared error: 3035.06
Variance score: 0.41

Multiple Linear Regression (MLR)

这是简单线性回归的扩展，它使用两个或更多特征来预测响应。在数学上，我们可以解释如下 −

It is the extension of simple linear regression that predicts a response using two or more features. Mathematically we can explain it as follows −

考虑一个具有 n 个观测值、p 个特征（即自变量）和 y 作为响应（即因变量）的数据集线性回归线对于 p 个特征可以计算如下 −

Consider a dataset having n observations, p features i.e. independent variables and y as one response i.e. dependent variable the regression line for p features can be calculated as follows −

此处，h(xi) 是预测的响应值，而 b0、b1、b2……bp 是回归系数。

Here, h(xi) is the predicted response value and b0,b1,b2…,bp are the regression coefficients.

多元线性回归模型始终包含称为残差误差的数据误差，该误差会更改计算，如下所示 −

Multiple Linear Regression models always includes the errors in the data known as residual error which changes the calculation as follows −

我们还可以将上述方程式写成以下形式 −

We can also write the above equation as follows −

Python Implementation

在此示例中，我们将使用 scikit learn 的波士顿住房数据集 −

in this example, we will be using Boston housing dataset from scikit learn −

首先，我们将从导入必要包开始，如下所示 −

First, we will start with importing necessary packages as follows −

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model, metrics

接下来，加载数据集，如下所示 −

Next, load the dataset as follows −

boston = datasets.load_boston(return_X_y=False)

以下脚本行将定义特征矩阵 X 和响应向量 Y −

The following script lines will define feature matrix, X and response vector, Y −

X = boston.data
y = boston.target

接下来，将数据集分成训练和测试集，如下所示 −

Next, split the dataset into training and testing sets as follows −

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=1)

Example

现在，创建线性回归对象并训练模型，如下所示 −

Now, create linear regression object and train the model as follows −

reg = linear_model.LinearRegression()
reg.fit(X_train, y_train)
print('Coefficients: \n', reg.coef_)
print('Variance score: {}'.format(reg.score(X_test, y_test)))
plt.style.use('fivethirtyeight')
plt.scatter(reg.predict(X_train), reg.predict(X_train) - y_train,
   color = "green", s = 10, label = 'Train data')
plt.scatter(reg.predict(X_test), reg.predict(X_test) - y_test,
   color = "blue", s = 10, label = 'Test data')
plt.hlines(y = 0, xmin = 0, xmax = 50, linewidth = 2)
plt.legend(loc = 'upper right')
plt.title("Residual errors")
plt.show()

Output

Coefficients:
[
   -1.16358797e-01  6.44549228e-02  1.65416147e-01  1.45101654e+00
   -1.77862563e+01  2.80392779e+00  4.61905315e-02 -1.13518865e+00
    3.31725870e-01 -1.01196059e-02 -9.94812678e-01  9.18522056e-03
   -7.92395217e-01
]
Variance score: 0.709454060230326

Assumptions

以下是线性回归模型对数据集所做的一些假设 −

The following are some assumptions about dataset that is made by Linear Regression model −

Multi-collinearity − 线性回归模型假设数据中几乎没有或没有多重共线性。基本上，当自变量或特征其中有依赖关系时，就会出现多重共线性。

Multi-collinearity − Linear regression model assumes that there is very little or no multi-collinearity in the data. Basically, multi-collinearity occurs when the independent variables or features have dependency in them.

Auto-correlation − 线性回归模型的另一项假设是数据中几乎没有或没有自相关。基本上，当残差误差之间存在依赖关系时，就会出现自相关。

Auto-correlation − Another assumption Linear regression model assumes is that there is very little or no auto-correlation in the data. Basically, auto-correlation occurs when there is dependency between residual errors.

Relationship between variables − 线性回归模型假设响应变量和特征变量之间的关系必须是线性的。

Relationship between variables − Linear regression model assumes that the relationship between response and feature variables must be linear.