Machine Learning 简明教程

Machine Learning - Forward Feature Construction

前向特征构建是机器学习中的一种特征选择方法,在此方法中我们从一个空的特征集开始,并迭代添加每个步骤中表现最好的特征,直到达到所需的特征数量为止。

特征选择的目标是识别与预测目标变量最相关的最重要特征,同时忽略对模型增加噪声并且可能导致过拟合的较不重要的特征。

前向特征构造涉及以下步骤−

  1. 初始化一组空特征。

  2. 设置要选择的最大特征数。

  3. 迭代直到达到所需的特征数 - 对于组选特征中尚不存在的每个剩余特征,根据选定特征和当前特征拟合一个模型,并使用验证组评估其性能。选择导致最佳性能的特征,并将其添加到选定特征组中。

  4. 将选定特征集作为模型的最佳特征集返回。

前向特征构造的主要优势在于它的计算效率高、可用于高维数据集。但是,它可能并不总是导致最优特征集,尤其是在特征之间存在高度相关或非线性关系时。

Example

下面是使用 Python 实现前向特征构造的一个示例 −

# Importing the necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load the diabetes dataset
diabetes = pd.read_csv(r'C:\Users\Leekha\Desktop\diabetes.csv')

# Define the predictor variables (X) and the target variable (y)
X = diabetes.iloc[:, :-1].values
y = diabetes.iloc[:, -1].values

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Create an empty set of features
selected_features = set()

# Set the maximum number of features to be selected
max_features = 8

# Iterate until the desired number of features is reached
while len(selected_features) < max_features:

   # Set the best feature and the best score to be 0
   best_feature = None
   best_score = 0

   # Iterate over all the remaining features
   for i in range(X_train.shape[1]):

      # Skip the feature if it's already selected
      if i in selected_features:
         continue

      # Select the current feature and fit a linear regression model
      X_train_selected = X_train[:, list(selected_features) + [i]]
      regressor = LinearRegression()
      regressor.fit(X_train_selected, y_train)

      # Compute the score on the testing set
      X_test_selected = X_test[:, list(selected_features) + [i]]
      score = regressor.score(X_test_selected, y_test)

      # Update the best feature and score if the current feature performs better
      if score > best_score:
         best_feature = i
         best_score = score

   # Add the best feature to the set of selected features
   selected_features.add(best_feature)

   # Print the selected features and the score
   print('Selected Features:', list(selected_features))
   print('Score:', best_score)

Output

在执行时,它会产生以下输出 −

Selected Features: [1]
Score: 0.23530716168783583
Selected Features: [0, 1]
Score: 0.2923143573608237
Selected Features: [0, 1, 5]
Score: 0.3164103491569179
Selected Features: [0, 1, 5, 6]
Score: 0.3287368302427327
Selected Features: [0, 1, 2, 5, 6]
Score: 0.334586804842275
Selected Features: [0, 1, 2, 3, 5, 6]
Score: 0.3356264736550455
Selected Features: [0, 1, 2, 3, 4, 5, 6]
Score: 0.3313166516703744
Selected Features: [0, 1, 2, 3, 4, 5, 6, 7]
Score: 0.32230203252064216