Machine Learning 简明教程

Machine Learning - Feature Selection

特征选择是机器学习中的一个重要步骤,它涉及选择可用特征的子集,以提高模型的性能。以下是其中一些常用的特征选择技术 −

Feature selection is an important step in machine learning that involves selecting a subset of the available features to improve the performance of the model. The following are some commonly used feature selection techniques −

Filter Methods

此方法涉及通过计算统计度量(例如,相关性、互信息、卡方检验等)来评估每个特征的相关性,并根据得分对特征进行排名。然后从模型中删除获得低分特征。

This method involves evaluating the relevance of each feature by calculating a statistical measure (e.g., correlation, mutual information, chi-square, etc.) and ranking the features based on their scores. Features that have low scores are then removed from the model.

要在 Python 中实现筛选方法,可以使用 sklearn.feature_selection 模块中的 SelectKBest 或 SelectPercentile 函数。下面是一个小的代码片段来实现特征选择。

To implement filter methods in Python, you can use the SelectKBest or SelectPercentile functions from the sklearn.feature_selection module. Below is a small code snippet to implement Feature selection.

from sklearn.feature_selection import SelectPercentile, chi2
selector = SelectPercentile(chi2, percentile=10)
X_new = selector.fit_transform(X, y)

Wrapper Methods

此方法涉及通过添加或删除特征并选择产生最佳性能的特征子集来评估模型的性能。此方法的计算成本很高,但比筛选方法更准确。

This method involves evaluating the model’s performance by adding or removing features and selecting the subset of features that yields the best performance. This approach is computationally expensive, but it is more accurate than filter methods.

要在 Python 中实现包装方法,可以使用 sklearn.feature_selection 模块中的 RFE(递归特征消除)函数。下面是一个小的代码片段来实现包装方法。

To implement wrapper methods in Python, you can use the RFE (Recursive Feature Elimination) function from the sklearn.feature_selection module. Below is a small code snippet to implement Wrapper method.

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

estimator = LogisticRegression()
selector = RFE(estimator, n_features_to_select=5)
selector = selector.fit(X, y)
X_new = selector.transform(X)

Embedded Methods

此方法涉及将特征选择纳入模型构建过程本身。这可以使用套索回归、岭回归或决策树等技术来完成。这些方法为每个特征分配权重,从模型中移除权重低的特征。

This method involves incorporating feature selection into the model building process itself. This can be done using techniques such as Lasso regression, Ridge regression, or Decision Trees. These methods assign weights to each feature and features with low weights are removed from the model.

要使用 Python 实施嵌入式方法,您可以使用 sklearn.liner 模块的 Lasso 或 Ridge 回归函数。以下是实施嵌入式方法的一个小代码片段:-

To implement embedded methods in Python, you can use the Lasso or Ridge regression functions from the sklearn.linear_model module. Below is a small code snippet for implementing embedded methods −

from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1)
lasso.fit(X, y)
coef = pd.Series(lasso.coef_, index = X.columns)
important_features = coef[coef != 0]

Principal Component Analysis (PCA)

这是一个无监督式学习方法类型,它包括将原始特征转换为一组不相关的原理成分,这些成分说明了数据中的最大方差。可以基于一个阈值选择原理成分的数量,这可以降低数据集的维度。

This is a type of unsupervised learning method that involves transforming the original features into a set of uncorrelated principal components that explain the maximum variance in the data. The number of principal components can be selected based on a threshold value, which can reduce the dimensionality of the dataset.

要使用 Python 实施 PCA,您可以使用 sklearn.decomposition 模块的 PCA 函数。例如,要降低特征的数量,您可以使用下例中给出的 PCA:-

To implement PCA in Python, you can use the PCA function from the sklearn.decomposition module. For example, to reduce the number of features you can use PCA as given the following code −

from sklearn.decomposition import PCA
pca = PCA(n_components=3)
X_new = pca.fit_transform(X)

Recursive Feature Elimination (RFE)

此方法包括循环地移除最不重要的特征,直到识别出一组最重要特征。它使用一个基于模型的方法并且计算可能会很昂贵,但它可以在高维数据集里得出好结果。

This method involves recursively eliminating the least significant features until a subset of the most important features is identified. It uses a model-based approach and can be computationally expensive, but it can yield good results in high-dimensional datasets.

要使用 Python 实施 RFE,您可以在 sklearn.feature_selection 模块里使用 RFECV(递归特征消除与交叉验证)函数。例如,以下是我们可以通过它实施递归特征消除的一个小代码片段:-

To implement RFE in Python, you can use the RFECV (Recursive Feature Elimination with Cross Validation) function from the sklearn.feature_selection module. For example, below is a small code snippet with the help of which we can implement to use Recursive Feature Elimination −

from sklearn.feature_selection import RFECV
from sklearn.tree import DecisionTreeClassifier
estimator = DecisionTreeClassifier()
selector = RFECV(estimator, step=1, cv=5)
selector = selector.fit(X, y)
X_new = selector.transform(X)

可以单独或结合使用这些特征选择技术来提高机器学习模型的性能。基于数据集的大小、特征的性质和所用模型的类型选择合适的技术非常重要。

These feature selection techniques can be used alone or in combination to improve the performance of machine learning models. It is important to choose the appropriate technique based on the size of the dataset, the nature of the features, and the type of model being used.

Example

在下面的例子中,我们将实施三种特征选择方法:- 使用卡方检验的单变量特征选择、递归特征消除与交叉验证 (RFECV) 和原理成分分析 (PCA)。

In the below example, we will implement three feature selection methods − univariate feature selection using the chi-square test, recursive feature elimination with cross-validation (RFECV), and principal component analysis (PCA).

我们将使用乳腺癌威斯康星州 (诊断) 数据集,该数据集包含在 scikit-learn 中。此数据集包含 569 个样本和 30 个特征,且任务是基于这些特征将肿瘤分类为恶性或良性。

We will use the Breast Cancer Wisconsin (Diagnostic) Dataset, which is included in scikit-learn. This dataset contains 569 samples with 30 features, and the task is to classify whether a tumor is malignant or benign based on these features.

以下是用于在乳腺癌威斯康星州 (诊断) 数据集上实施这些特征选择方法的 Python 代码:-

Here is the Python code to implement these feature selection methods on the Breast Cancer Wisconsin (Diagnostic) Dataset −

# Import necessary libraries and dataset
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Load the dataset
diabetes = pd.read_csv(r'C:\Users\Leekha\Desktop\diabetes.csv')

# Split the dataset into features and target variable
X = diabetes.drop('Outcome', axis=1)
y = diabetes['Outcome']

# Apply univariate feature selection using the chi-square test
selector = SelectKBest(chi2, k=4)
X_new = selector.fit_transform(X, y)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.3, random_state=42)

# Fit a logistic regression model on the selected features
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Evaluate the model on the test set
accuracy = clf.score(X_test, y_test)
print("Accuracy using univariate feature selection: {:.2f}".format(accuracy))

# Recursive feature elimination with cross-validation (RFECV)
estimator = LogisticRegression()
selector = RFECV(estimator, step=1, cv=5)
selector.fit(X, y)
X_new = selector.transform(X)
scores = cross_val_score(LogisticRegression(), X_new, y, cv=5)
print("Accuracy using RFECV feature selection: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

# PCA implementation
pca = PCA(n_components=5)
X_new = pca.fit_transform(X)
scores = cross_val_score(LogisticRegression(), X_new, y, cv=5)
print("Accuracy using PCA feature selection: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

执行此代码时,它将在终端上生成以下输出:-

When you execute this code, it will produce the following output on the terminal −

Accuracy using univariate feature selection: 0.74
Accuracy using RFECV feature selection: 0.77 (+/- 0.03)
Accuracy using PCA feature selection: 0.75 (+/- 0.07)