Machine Learning 简明教程

Machine Learning - Multiple Linear Regression

它基本上是多元线性回归的扩展,使用两个或多个特征来预测响应。我们可以用数学表示如下:

It is basically the extension of simple linear regression that predicts a response using two or more features. Mathematically we can explain it as follows −

考虑具有 n 个观测值、 p 个特征(即自变量)和 y 个响应(即因变量)的数据集,则 p 个特征的回归线可以计算如下:

Consider a dataset having n observations, p features i.e. independent variables and y as one response i.e. dependent variable the regression line for p features can be calculated as follows −

h\left ( x_{i} \right )=b_{0}b_{1}x_{i1}+b_{2}x_{i2}\cdot \cdot \cdot +b_{p}x_{ip}

h\left ( x_{i} \right )=b_{0}b_{1}x_{i1}+b_{2}x_{i2}\cdot \cdot \cdot +b_{p}x_{ip}

在此,h\left ( x_{i} \right ) 是预测响应值,而 $b_{0},b_{1},b_{2}…​.b_{p}$ 是回归系数。

Here,$h\left ( x_{i} \right )$ is the predicted response value and $b_{0},b_{1},b_{2}…​.b_{p}$ are the regression coefficients.

多元线性回归模型始终包含称为残差误差的数据误差,该误差会更改计算,如下所示 −

Multiple Linear Regression models always includes the errors in the data known as residual error which changes the calculation as follows −

h\left ( x_{i} \right )=b_{0}b_{1}x_{i1}+b_{2}x_{i2}\cdot \cdot \cdot +b_{p}x_{ip}+e_{i}

h\left ( x_{i} \right )=b_{0}b_{1}x_{i1}+b_{2}x_{i2}\cdot \cdot \cdot +b_{p}x_{ip}+e_{i}

我们还可以将上述方程式写成以下形式 −

We can also write the above equation as follows −

y_{i}=h\left ( x_{i} \right )+e_{i}\:\: 或 \:\: e_{i}=y_{i}-h\left ( x_{i} \right )

y_{i}=h\left ( x_{i} \right )+e_{i}\:\: or \:\: e_{i}=y_{i}-h\left ( x_{i} \right )

Python Implementation

要使用 Scikit-Learn 在 Python 中实现多元线性回归,我们可以使用与简单线性回归中相同的 LinearRegression 类,但这次我们需要提供多个自变量作为输入。

To implement multiple linear regression in Python using Scikit-Learn, we can use the same LinearRegression class as in simple linear regression, but this time we need to provide multiple independent variables as input.

考虑 Scikit-Learn 中的波士顿住房数据集,并使用它实现多元线性回归。

Let’s consider the Boston Housing dataset from Scikit-Learn and implement multiple linear regression using it.

Example

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import matplotlib.pyplot as plt

# Load the Boston Housing dataset
boston = load_boston()

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(boston.data,
boston.target, test_size=0.2, random_state=0)

# Create a linear regression object
lr_model = LinearRegression()

# Fit the model on the training data
lr_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = lr_model.predict(X_test)

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)

# Calculate the coefficient of determination
r2 = r2_score(y_test, y_pred)

print('Mean Squared Error:', mse)
print('Coefficient of Determination:', r2)

# Plot the predicted values against the actual values
plt.figure(figsize=(7.5, 3.5))
plt.scatter(y_test, y_pred)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')

# Add a regression line to the plot
x = np.linspace(0, 50, 100)
y = x
plt.plot(x, y, color='red')

# Show the plot
plt.show()

在此代码中,我们首先使用 Scikit-Learn 的 load_boston() 函数加载波士顿住房数据集。然后,使用 train_test_split() 函数将数据集拆分为训练集和测试集。

In this code, we first load the Boston Housing dataset using the load_boston() function from Scikit-Learn. We then split the dataset into training and testing sets using the train_test_split() function.

接下来,我们创建一个 LinearRegression 对象,并使用 fit() 方法在训练数据上拟合。然后,我们使用 predict() 方法对测试数据进行预测,并分别使用 mean_squared_error()r2_score() 函数计算均方误差和确定系数。

Next, we create a LinearRegression object and fit it on the training data using the fit() method. We then make predictions on the test data using the predict() method and calculate the mean squared error and coefficient of determination using the mean_squared_error() and r2_score() functions, respectively.

最后,我们使用 scatter() 函数将预测值与实际值绘制出来,并使用 plot() 函数向绘图中添加回归线。我们使用 xlabel()ylabel() 函数给 x 轴和 y 轴加上标签,并使用 show() 函数显示绘图。

Finally, we plot the predicted values against the actual values using the scatter() function and add a regression line to the plot using the plot() function. We label the x-axis and y-axis using the xlabel() and ylabel() functions and display the plot using the show() function.

当您执行程序时,它将生成以下绘图作为输出,并在终端上打印均方误差和决定系数 −

When you execute the program, it will produce the following plot as the output and it will print the Mean Squared Error and the Coefficient of Determination on the terminal −

Mean Squared Error: 33.44897999767653
Coefficient of Determination: 0.5892223849182507
actual values