Machine Learning 简明教程

Machine Learning - Backward Elimination

后向消除是一种特征选择技术,在机器学习中用于为预测模型选择最重要的特征。在此技术中,我们首先考虑所有初始特征,然后迭代式地删除最不重要的特征,直至我们获取到能提供最佳性能的最佳特征子集。

Backward Elimination is a feature selection technique used in machine learning to select the most significant features for a predictive model. In this technique, we start by considering all the features initially, and then we iteratively remove the least significant features until we get the best subset of features that gives the best performance.

Implementation in Python

为在 Python 中实现后向消除,你可以遵循以下步骤:

To implement Backward Elimination in Python, you can follow these steps −

导入必要的库:pandas、numpy 和 statsmodels.api。

Import the necessary libraries: pandas, numpy, and statsmodels.api.

import pandas as pd
import numpy as np
import statsmodels.api as sm

将数据集加载到 Pandas 数据框架中。我们将使用 Pima-Indians-Diabetes 数据集

Load your dataset into a Pandas DataFrame. We will be using Pima-Indians-Diabetes dataset

diabetes = pd.read_csv(r'C:\Users\Leekha\Desktop\diabetes.csv')

定义自变量 (X) 和因变量 (y)。

Define the predictor variables (X) and the target variable (y).

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

在自变量中添加一列全 1,表示截距。

Add a column of ones to the predictor variables to represent the intercept.

X = np.append(arr = np.ones((len(X), 1)).astype(int), values = X, axis = 1)

使用 statsmodels 库中的普通最小二乘法 (OLS) 方法,拟合包含所有自变量的多元线性回归模型。

Use the Ordinary Least Squares (OLS) method from the statsmodels library to fit the multiple linear regression model with all the predictor variables.

X_opt = X[:, [0, 1, 2, 3, 4, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()

检查每个自变量的 p 值,并移除 p 值最高(即最小显着性)的自变量。

Check the p-values of each predictor variable and remove the one with the highest p-value (i.e., the least significant).

regressor_OLS.summary()

重复步骤 5 和 6,直至所有剩余自变量的 p 值都低于显着性水平(例如,0.05)。

Repeat steps 5 and 6 until all the remaining predictor variables have a p-value below the significance level (e.g., 0.05).

X_opt = X[:, [0, 1, 3, 4, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

X_opt = X[:, [0, 3, 4, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

X_opt = X[:, [0, 3, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

X_opt = X[:, [0, 3]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

所有自变量子集的最终 p 值低于显着性水平,即该模型的最佳特征集。

The final subset of predictor variables with p-values below the significance level is the optimal set of features for the model.

Example

以下是 Backward Elimination 在 Python 中的完整实现:

Here is the complete implementation of Backward Elimination in Python −

# Importing the necessary libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm

# Load the diabetes dataset
diabetes = pd.read_csv(r'C:\Users\Leekha\Desktop\diabetes.csv')

# Define the predictor variables (X) and the target variable (y)
X = diabetes.iloc[:, :-1].values
y = diabetes.iloc[:, -1].values

# Add a column of ones to the predictor variables to represent the intercept
X = np.append(arr = np.ones((len(X), 1)).astype(int), values = X, axis = 1)

# Fit the multiple linear regression model with all the predictor variables
X_opt = X[:, [0, 1, 2, 3, 4, 5, 6, 7, 8]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()

# Check the p-values of each predictor variable and remove the one
# with the highest p-value (i.e., the least significant)
regressor_OLS.summary()

# Repeat the above step until all the remaining predictor variables
# have a p-value below the significance level (e.g., 0.05)
X_opt = X[:, [0, 1, 2, 3, 5, 6, 7, 8]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

X_opt = X[:, [0, 1, 3, 5, 6, 7, 8]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

X_opt = X[:, [0, 1, 3, 5, 7, 8]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

X_opt = X[:, [0, 1, 3, 5, 7]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

当你执行这个程序时,它将产生以下输出——

When you execute this program, it will produce the following output −

backward elimination