Machine Learning 简明教程

Machine Learning - Simple Linear Regression

简单线性回归是一种回归分析,其中使用单个自变量(也称为预测变量)来预测因变量。换句话说,它对因变量和单个自变量之间的线性关系进行建模。

Simple linear regression is a type of regression analysis in which a single independent variable (also known as a predictor variable) is used to predict the dependent variable. In other words, it models the linear relationship between the dependent variable and a single independent variable.

Python Implementation

下面给出了一个示例,展示了如何使用 Python 中的 Pima-Indian-Diabetes 数据集实现简单线性回归。我们还将绘制回归线。

Given below is an example that shows how to implement simple linear regression using the Pima-Indian-Diabetes dataset in Python. We will also plot the regression line.

Data Preparation

1、我们首先需要从 scikit-learn 中导入 Diabetes 数据集,并将其拆分为训练集和测试集。我们将使用 80% 的数据训练模型,剩余 20% 的数据进行测试。

First, we need to import the Diabetes dataset from scikit-learn and split it into training and testing sets. We will use 80% of the data for training the model and the remaining 20% for testing.

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

# Load the Diabetes dataset
diabetes = load_diabetes()

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(diabetes.data[:, 2],
diabetes.target, test_size=0.2, random_state=0)

# Reshape the input data
X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)

2、在这里,我们使用数据集的第三个特征(列),即平均血压,作为我们的自变量(预测变量),目标变量作为我们的因变量(响应变量)。

Here, we are using the third feature (column) of the dataset, which represents the mean blood pressure, as our independent variable (predictor variable) and the target variable as our dependent variable (response variable).

Model Training

3、我们将使用 scikit-learn 的 LinearRegression 类在训练数据上训练一个简单线性回归模型。代码如下 −

We will use scikit-learn’s LinearRegression class to train a simple linear regression model on the training data. The code for this is as follows −

from sklearn.linear_model import LinearRegression
# Create a linear regression object

lr_model = LinearRegression()
# Fit the model on the training data
lr_model.fit(X_train, y_train)

在此, X_train 表示训练数据的输入特征(平均收缩压), y_train 表示输出变量(目标变量)。

Here, X_train represents the input feature (mean blood pressure) of the training data and y_train represents the output variable (target variable).

Model Testing

模型训练完毕后,我们可以用它对测试数据进行预测。代码如下 −

Once the model is trained, we can use it to make predictions on the test data. The code for this is as follows −

# Make predictions on the test data

y_pred = lr_model.predict(X_test)

在此, X_test 表示测试数据的输入特征, y_pred 表示预测的输出变量(目标变量)。

Here, X_test represents the input feature of the test data and y_pred represents the predicted output variable (target variable).

Model Evaluation

我们需要评估模型的性能以确定其准确性。我们将使用均方误差 (MSE) 和决定系数 (R^2) 作为评估指标。代码如下 −

We need to evaluate the performance of the model to determine its accuracy. We will use the mean squared error (MSE) and the coefficient of determination (R^2) as evaluation metrics. The code for this is as follows −

from sklearn.metrics import mean_squared_error, r2_score

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)

# Calculate the coefficient of determination
r2 = r2_score(y_test, y_pred)

print('Mean Squared Error:', mse)
print('Coefficient of Determination:', r2)

在此, y_test 表示测试数据的实际输出变量。

Here, y_test represents the actual output variable of the test data.

Plotting the Regression Line

我们还可以可视化回归线,以了解它与数据拟合得如何。代码如下 −

We can also visualize the regression line to see how well it fits the data. The code for this is as follows −

import matplotlib.pyplot as plt

# Plot the training data
plt.scatter(X_train, y_train, color='gray')

# Plot the regression line
plt.plot(X_train, lr_model.predict(X_train), color='red', linewidth=2)

# Add axis labels
plt.xlabel('Mean Blood Pressure')
plt.ylabel('Disease Progression')

# Show the plot
plt.show()

在此,我们使用 matplotlib 库中的 scatter() 函数绘制训练数据点,使用 plot() 函数绘制回归线。 xlabel()ylabel() 函数分别用于标记图表的 x 轴和 y 轴。最后,我们使用 show() 函数显示图表。

Here, we are using the scatter() function from the matplotlib library to plot the training data points and the plot() function to plot the regression line. The xlabel() and ylabel() functions are used to label the x-axis and y-axis of the plot, respectively. Finally, we use the show() function to display the plot.

Complete Implementation Example

在 Python 中实现简单线性回归的完整代码如下 −

The complete code for implementing simple linear regression in Python is as follows −

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Load the Diabetes dataset
diabetes = load_diabetes()

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(diabetes.data[:, 2],
diabetes.target, test_size=0.2, random_state=0)

# Reshape the input data
X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)

# Create a linear regression object
lr_model = LinearRegression()

# Fit the model on the training data
lr_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = lr_model.predict(X_test)

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)

# Calculate the coefficient of determination
r2 = r2_score(y_test, y_pred)

print('Mean Squared Error:', mse)
print('Coefficient of Determination:', r2)

# Plot the training data
plt.figure(figsize=(7.5, 3.5))
plt.scatter(X_train, y_train, color='gray')

# Plot the regression line
plt.plot(X_train, lr_model.predict(X_train), color='red', linewidth=2)

# Add axis labels
plt.xlabel('Mean Blood Pressure')
plt.ylabel('Disease Progression')

# Show the plot
plt.show()

执行此代码后,您将得到以下图表作为输出,并且它还将在终端上打印均方误差和决定系数 −

On executing this code, you will get the following plot as the output and it will also print the Mean Squared Error and the Coefficient of Determination on the terminal −

Mean Squared Error: 4150.680189329983
Coefficient of Determination: 0.19057346847560164
mean blood pressure