Machine Learning 简明教程

Machine Learning - Categorical Data

What is Categorical Data?

机器学习中的分类数据是指由类别或标签（而不是数值）组成的数据。这些类别可能是分类的，这意味着它们之间没有固有顺序或排名（例如颜色、性别），或者可能是顺序的，这意味着类别之间存在自然排序（例如教育水平、收入范围）。

分类数据通常使用离散值（例如整数或字符串）表示，在用作机器学习模型的输入之前，通常会对其进行独热编码。独热编码涉及为每个类别创建一个二进制向量，其中该向量在对应于该类别的位置为 1，在所有其他位置为 0。

Techniques for Handling Categorical Data

处理分类数据是机器学习预处理的一个重要部分，因为许多算法需要数值输入。根据算法和分类数据的性质，可以使用不同的编码技术，例如标签编码、顺序编码、二进制编码等。

在本章的后续部分，我们将讨论处理机器学习中分类数据不同技术，以及它们在 Python 中的实现。

One-Hot Encoding

独热编码是一种处理机器学习中分类数据的流行技术。它涉及为每个类别创建一个二进制向量，其中向量的每个元素表示该类别是否存在。例如，如果我们有一个颜色分类变量，其值为红色、蓝色和绿色，那么独热编码将分别创建三个二进制向量：[1, 0, 0]、[0, 1, 0] 和 [0, 0, 1]。

Example

下面是一个使用 Pandas 库在 Python 中执行独热编码的示例 −

import pandas as pd

# Creating a sample dataset with a categorical variable
data = {'color': ['red', 'green', 'blue', 'red', 'green']}
df = pd.DataFrame(data)

# Performing one-hot encoding
one_hot_encoded = pd.get_dummies(df['color'], prefix='color')

# Combining the encoded data with the original data
df = pd.concat([df, one_hot_encoded], axis=1)

# Drop the original categorical variable
df = df.drop('color', axis=1)

# Print the encoded data
print(df)

这将创建一个独热编码数据框，带有三个二进制变量（“color_blue”，“color_green”和“color_red”），如果存在相应颜色则取值为 1，如果不存在则取值为 0。然后，可以将下面给出的输出编码数据用于分类和回归等机器学习任务。

      color_blue    color_green    color_red
0        0              0              1
1        0              1              0
2        1              0              0
3        0              0              1
4        0              1              0

独热编码技术适用于小且有限的分类变量，但对于大型分类变量而言可能会有问题，因为它会导致大量输入特征。

Label Encoding

标签编码是用于在机器学习中处理分类数据的一种技术。它涉及为分类变量中的每个类别分配一个唯一的数值，并且值顺序基于类别的顺序。

例如，假设我们有一个分类变量“大小”，它有三个类别：“小”、“中”和“大”。使用标签编码，我们会分别将值 0、1 和 2 分配给这些类别。

Example

下面是一个使用 scikit-learn 库在 Python 中执行标签编码的示例 −

from sklearn.preprocessing import LabelEncoder

# create a sample dataset with a categorical variable
data = ['small', 'medium', 'large', 'small', 'large']

# create a label encoder object
label_encoder = LabelEncoder()

# fit and transform the data using the label encoder
encoded_data = label_encoder.fit_transform(data)

# print the encoded data
print(encoded_data)

这将创建一个编码数组，其值 [0, 1, 2, 0, 2] 对应于已编码类别“小”、“中”和“大”。请注意，默认情况下，编码基于类别的字母顺序，但可以通过将自定义列表传递给 LabelEncoder 对象来更改顺序。

[2 1 0 2 0]

当类别之间存在自然排序时，标签编码很有用，例如序数分类变量。但是，对于名义分类变量应谨慎使用，因为数值可能暗示实际上不存在的顺序。在这些情况下，独热编码是一个更安全的选择。

Frequency Encoding

频率编码是用于在机器学习中处理分类数据的另一种技术。它涉及用分类变量中的每个类别替换其在数据集中的频率（或计数值）。频率编码背后的思想是，出现的更频繁的类别对于机器学习算法可能更重要或更有帮助。

Example

下面是一个在 Python 中执行频率编码的示例 −

import pandas as pd

# create a sample dataset with a categorical variable
data = {'color': ['red', 'green', 'blue', 'red', 'green']}
df = pd.DataFrame(data)

# calculate the frequency of each category in the categorical variable
freq = df['color'].value_counts(normalize=True)

# replace each category with its frequency
df['color_freq'] = df['color'].map(freq)

# drop the original categorical variable
df = df.drop('color', axis=1)

# print the encoded data
print(df)

这将创建一个编码 dataframe ，它具有一个变量（“color_freq”），该变量表示原始分类变量中每个类别的频率。例如，如果原始变量有两个“红色”和三个“绿色”，则相应的频率分别为 0.4 和 0.6。

      color_freq
0        0.4
1        0.4
2        0.2
3        0.4
4        0.4

频率编码可以是独热编码或标签编码的有用替代，特别是当处理高基数分类变量（即具有大量类别的变量）时。但是，它可能并不总是有效，并且其性能可能取决于特定的数据集和所使用的机器学习算法。

Target Encoding

目标编码是用于在机器学习中处理分类数据的另一种技术。它涉及将分类变量中的每个类别替换为该类别的目标变量（即你要预测的变量）的平均值（或其他聚合）。目标编码背后的思想是它可以捕获分类变量和目标变量之间的关系，从而提高机器学习模型的预测性能。

Example

下面用Python中的Scikit-learn库来执行目标编码的示例，方法是使用标签编码器和平均编码器的组合−

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# create a sample dataset with a categorical variable and a target variable
data = {'color': ['red', 'green', 'blue', 'red', 'green'],
   'target': [1, 0, 1, 0, 1]}
df = pd.DataFrame(data)

# create a label encoder object and fit it to the data
label_encoder = LabelEncoder()
label_encoder.fit(df['color'])

# transform the categorical variable using the label encoder
df['color_encoded'] = label_encoder.transform(df['color'])

# create a mean encoder object and fit it to the transformed data
mean_encoder = df.groupby('color_encoded')['target'].mean().to_dict()

# map the mean encoded values to the categorical variable
df['color_encoded'] = df['color_encoded'].map(mean_encoder)

# print the encoded data
print(df)

在本示例中，我们首先使用分类变量’color’和目标变量’target’创建一个 Pandas DataFrame df 。然后，我们从scikit-learn中创建一个 LabelEncoder 对象，并将其拟合到 df 的’color’列中。

接下来，我们使用标签编码器调用标签编码器对象上的transform方法并将其结果编码值分配给 df 中的新列' color_encoded '，从而转换分类变量’color'。

最后，我们通过按’color_encoded’列对df进行分组并计算每组’target’列的平均值来创建一个平均编码器对象。然后，我们将此平均编码器对象转换为字典并将平均编码值映射到 df 的原始’color’列中。

   color     target     color_encoded
0  red        1           0.5
1  green      0           0.5
2  blue       1           1.0
3  red        0           0.5
4  green      1           0.5

目标编码可成为一种提升机器学习模型预测性能的有力技术，尤其适用于包含高基数类别变量的数据集。然而，通过使用交叉验证和规则化技术，避免过拟合十分重要。

Binary Encoding

二进制编码是另一种用于编码机器学习中类别变量的技术。在二进制编码中，每个类别被分配有二进制代码，其中每位数字表示类别是否出现（1）或没有出现（0）。二进制代码通常基于类别在所有类别排序清单中的位置。

Example

以下是以 category_encoders 库为例，对二进制编码进行 Python 实现：

import pandas as pd
import category_encoders as ce

# create a sample dataset with a categorical variable
data = {'color': ['red', 'green', 'blue', 'red', 'green']}
df = pd.DataFrame(data)

# create a binary encoder object and fit it to the data
binary_encoder = ce.BinaryEncoder(cols=['color'])
binary_encoder.fit(df['color'])

# transform the categorical variable using the binary encoder
encoded_data = binary_encoder.transform(df['color'])

# merge the encoded variable with the original dataframe
df = pd.concat([df, encoded_data], axis=1)

# print the encoded data
print(df)

在此示例中，我们首先使用类别变量“颜色”创建一个 Pandas DataFrame df 。然后，我们从 category_encoders 库创建一个 BinaryEncoder 对象，并将其装配到 df 的“颜色”列。

接下来，通过调用二进制编码器对象上的变换方法并向新 DataFrame encoded_data 指派所产生编码值，使用二进制编码器变换类别变量“颜色”。

最后，我们使用 concat 方法通过列轴（axis=1）将编码变量与原始 DataFrame df 合并。生成 DataFrame 应该包含原始“颜色”列及编码二进制列。

当你运行此代码时，它将会生成以下输出：

   color    color_0    color_1
0   red       0           1
1   green     1           0
2   blue      1           1
3   red       0           1
4   green     1           0

二进制编码最适合于包含大量类别的类别变量，因为它能迅速为包含大量类别的变量变得低效。