Machine Learning 简明教程
Machine Learning - Categorical Data
What is Categorical Data?
机器学习中的分类数据是指由类别或标签(而不是数值)组成的数据。这些类别可能是分类的,这意味着它们之间没有固有顺序或排名(例如颜色、性别),或者可能是顺序的,这意味着类别之间存在自然排序(例如教育水平、收入范围)。
Categorical data in Machine Learning refers to data that consists of categories or labels, rather than numerical values. These categories may be nominal, meaning that there is no inherent order or ranking between them (e.g., color, gender), or ordinal, meaning that there is a natural ordering between the categories (e.g., education level, income bracket).
分类数据通常使用离散值(例如整数或字符串)表示,在用作机器学习模型的输入之前,通常会对其进行独热编码。独热编码涉及为每个类别创建一个二进制向量,其中该向量在对应于该类别的位置为 1,在所有其他位置为 0。
Categorical data is often represented using discrete values, such as integers or strings, and is frequently encoded as one-hot vectors before being used as input to machine learning models. One-hot encoding involves creating a binary vector for each category, where the vector has a 1 in the position corresponding to the category and 0s in all other positions.
Techniques for Handling Categorical Data
处理分类数据是机器学习预处理的一个重要部分,因为许多算法需要数值输入。根据算法和分类数据的性质,可以使用不同的编码技术,例如标签编码、顺序编码、二进制编码等。
Handling categorical data is an important part of machine learning preprocessing, as many algorithms require numerical input. Depending on the algorithm and the nature of the categorical data, different encoding techniques may be used, such as label encoding, ordinal encoding, or binary encoding etc.
在本章的后续部分,我们将讨论处理机器学习中分类数据不同技术,以及它们在 Python 中的实现。
In the subsequent sections of this chapter, we will discuss the different techniques for handling categorical data in machine learning along with their implementations in Python.
One-Hot Encoding
独热编码是一种处理机器学习中分类数据的流行技术。它涉及为每个类别创建一个二进制向量,其中向量的每个元素表示该类别是否存在。例如,如果我们有一个颜色分类变量,其值为红色、蓝色和绿色,那么独热编码将分别创建三个二进制向量:[1, 0, 0]、[0, 1, 0] 和 [0, 0, 1]。
One-hot encoding is a popular technique for handling categorical data in machine learning. It involves creating a binary vector for each category, where each element of the vector represents the presence or absence of the category. For example, if we have a categorical variable for color with values red, blue, and green, one-hot encoding would create three binary vectors: [1, 0, 0], [0, 1, 0], and [0, 0, 1] respectively.
Example
下面是一个使用 Pandas 库在 Python 中执行独热编码的示例 −
Below is an example of how to perform one-hot encoding in Python using the Pandas library −
import pandas as pd
# Creating a sample dataset with a categorical variable
data = {'color': ['red', 'green', 'blue', 'red', 'green']}
df = pd.DataFrame(data)
# Performing one-hot encoding
one_hot_encoded = pd.get_dummies(df['color'], prefix='color')
# Combining the encoded data with the original data
df = pd.concat([df, one_hot_encoded], axis=1)
# Drop the original categorical variable
df = df.drop('color', axis=1)
# Print the encoded data
print(df)
这将创建一个独热编码数据框,带有三个二进制变量(“color_blue”,“color_green”和“color_red”),如果存在相应颜色则取值为 1,如果不存在则取值为 0。然后,可以将下面给出的输出编码数据用于分类和回归等机器学习任务。
This will create a one-hot encoded dataframe with three binary variables ("color_blue," "color_green," and "color_red") that take the value 1 if the corresponding color is present and 0 if it is not. This encoded data, output given below, can then be used for machine learning tasks such as classification and regression.
color_blue color_green color_red
0 0 0 1
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0
独热编码技术适用于小且有限的分类变量,但对于大型分类变量而言可能会有问题,因为它会导致大量输入特征。
One-Hot Encoding technique works well for small and finite categorical variables but can be problematic for large categorical variables as it can lead to a high number of input features.
Label Encoding
标签编码是用于在机器学习中处理分类数据的一种技术。它涉及为分类变量中的每个类别分配一个唯一的数值,并且值顺序基于类别的顺序。
Label Encoding is another technique for handling categorical data in machine learning. It involves assigning a unique numerical value to each category in a categorical variable, with the order of the values based on the order of the categories.
例如,假设我们有一个分类变量“大小”,它有三个类别:“小”、“中”和“大”。使用标签编码,我们会分别将值 0、1 和 2 分配给这些类别。
For example, suppose we have a categorical variable "Size" with three categories: "small," "medium," and "large." Using label encoding, we would assign the values 0, 1, and 2 to these categories, respectively.
Example
下面是一个使用 scikit-learn 库在 Python 中执行标签编码的示例 −
Below is an example of how to perform label encoding in Python using the scikit-learn library −
from sklearn.preprocessing import LabelEncoder
# create a sample dataset with a categorical variable
data = ['small', 'medium', 'large', 'small', 'large']
# create a label encoder object
label_encoder = LabelEncoder()
# fit and transform the data using the label encoder
encoded_data = label_encoder.fit_transform(data)
# print the encoded data
print(encoded_data)
这将创建一个编码数组,其值 [0, 1, 2, 0, 2] 对应于已编码类别“小”、“中”和“大”。请注意,默认情况下,编码基于类别的字母顺序,但可以通过将自定义列表传递给 LabelEncoder 对象来更改顺序。
This will create an encoded array with the values [0, 1, 2, 0, 2], which correspond to the encoded categories "small," "medium," and "large." Note that the encoding is based on the alphabetical order of the categories by default, but you can change the order by passing a custom list to the LabelEncoder object.
[2 1 0 2 0]
当类别之间存在自然排序时,标签编码很有用,例如序数分类变量。但是,对于名义分类变量应谨慎使用,因为数值可能暗示实际上不存在的顺序。在这些情况下,独热编码是一个更安全的选择。
Label encoding can be useful when there is a natural ordering between the categories, such as in the case of ordinal categorical variables. However, it should be used with caution for nominal categorical variables because the numerical values may imply an order that does not actually exist. In these cases, one-hot encoding is a safer option.
Frequency Encoding
频率编码是用于在机器学习中处理分类数据的另一种技术。它涉及用分类变量中的每个类别替换其在数据集中的频率(或计数值)。频率编码背后的思想是,出现的更频繁的类别对于机器学习算法可能更重要或更有帮助。
Frequency Encoding is another technique for handling categorical data in machine learning. It involves replacing each category in a categorical variable with its frequency (or count) in the dataset. The idea behind frequency encoding is that categories that appear more frequently may be more important or informative for the machine learning algorithm.
Example
下面是一个在 Python 中执行频率编码的示例 −
Below is an example of how to perform frequency encoding in Python −
import pandas as pd
# create a sample dataset with a categorical variable
data = {'color': ['red', 'green', 'blue', 'red', 'green']}
df = pd.DataFrame(data)
# calculate the frequency of each category in the categorical variable
freq = df['color'].value_counts(normalize=True)
# replace each category with its frequency
df['color_freq'] = df['color'].map(freq)
# drop the original categorical variable
df = df.drop('color', axis=1)
# print the encoded data
print(df)
这将创建一个编码 dataframe ,它具有一个变量(“color_freq”),该变量表示原始分类变量中每个类别的频率。例如,如果原始变量有两个“红色”和三个“绿色”,则相应的频率分别为 0.4 和 0.6。
This will create an encoded dataframe with one variable ("color_freq") that represents the frequency of each category in the original categorical variable. For example, if the original variable had two occurrences of "red" and three occurrences of "green," then the corresponding frequencies would be 0.4 and 0.6, respectively.
color_freq
0 0.4
1 0.4
2 0.2
3 0.4
4 0.4
频率编码可以是独热编码或标签编码的有用替代,特别是当处理高基数分类变量(即具有大量类别的变量)时。但是,它可能并不总是有效,并且其性能可能取决于特定的数据集和所使用的机器学习算法。
Frequency encoding can be a useful alternative to one-hot encoding or label encoding, especially when dealing with high-cardinality categorical variables (i.e., variables with a large number of categories). However, it may not always be effective, and its performance can depend on the particular dataset and machine learning algorithm being used.
Target Encoding
目标编码是用于在机器学习中处理分类数据的另一种技术。它涉及将分类变量中的每个类别替换为该类别的目标变量(即你要预测的变量)的平均值(或其他聚合)。目标编码背后的思想是它可以捕获分类变量和目标变量之间的关系,从而提高机器学习模型的预测性能。
Target Encoding is another technique for handling categorical data in machine learning. It involves replacing each category in a categorical variable with the mean (or other aggregation) of the target variable (i.e., the variable you want to predict) for that category. The idea behind target encoding is that it can capture the relationship between the categorical variable and the target variable, and therefore improve the predictive performance of the machine learning model.
Example
下面用Python中的Scikit-learn库来执行目标编码的示例,方法是使用标签编码器和平均编码器的组合−
Below is an example of how to perform target encoding in Python with the Scikit-learn library by using a combination of a label encoder and a mean encoder −
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# create a sample dataset with a categorical variable and a target variable
data = {'color': ['red', 'green', 'blue', 'red', 'green'],
'target': [1, 0, 1, 0, 1]}
df = pd.DataFrame(data)
# create a label encoder object and fit it to the data
label_encoder = LabelEncoder()
label_encoder.fit(df['color'])
# transform the categorical variable using the label encoder
df['color_encoded'] = label_encoder.transform(df['color'])
# create a mean encoder object and fit it to the transformed data
mean_encoder = df.groupby('color_encoded')['target'].mean().to_dict()
# map the mean encoded values to the categorical variable
df['color_encoded'] = df['color_encoded'].map(mean_encoder)
# print the encoded data
print(df)
在本示例中,我们首先使用分类变量’color’和目标变量’target’创建一个 Pandas DataFrame df 。然后,我们从scikit-learn中创建一个 LabelEncoder 对象,并将其拟合到 df 的’color’列中。
In this example, we first create a Pandas DataFrame df with a categorical variable 'color' and a target variable 'target'. We then create a LabelEncoder object from scikit-learn and fit it to the 'color' column of df.
接下来,我们使用标签编码器调用标签编码器对象上的transform方法并将其结果编码值分配给 df 中的新列' color_encoded ',从而转换分类变量’color'。
Next, we transform the categorical variable 'color' using the label encoder by calling the transform method on the label encoder object and assigning the resulting encoded values to a new column 'color_encoded' in df.
最后,我们通过按’color_encoded’列对df进行分组并计算每组’target’列的平均值来创建一个平均编码器对象。然后,我们将此平均编码器对象转换为字典并将平均编码值映射到 df 的原始’color’列中。
Finally, we create a mean encoder object by grouping df by the 'color_encoded' column and calculating the mean of the 'target' column for each group. We then convert this mean encoder object to a dictionary and map the mean encoded values to the original 'color' column of df.
color target color_encoded
0 red 1 0.5
1 green 0 0.5
2 blue 1 1.0
3 red 0 0.5
4 green 1 0.5
目标编码可成为一种提升机器学习模型预测性能的有力技术,尤其适用于包含高基数类别变量的数据集。然而,通过使用交叉验证和规则化技术,避免过拟合十分重要。
Target encoding can be a powerful technique for improving the predictive performance of machine learning models, especially for datasets with high-cardinality categorical variables. However, it is important to avoid overfitting by using cross-validation and regularization techniques.
Binary Encoding
二进制编码是另一种用于编码机器学习中类别变量的技术。在二进制编码中,每个类别被分配有二进制代码,其中每位数字表示类别是否出现(1)或没有出现(0)。二进制代码通常基于类别在所有类别排序清单中的位置。
Binary encoding is another technique used for encoding categorical variables in machine learning. In binary encoding, each category is assigned a binary code, where each digit represents whether the category is present (1) or not (0). The binary codes are typically based on the position of the category in a sorted list of all categories.
Example
以下是以 category_encoders 库为例,对二进制编码进行 Python 实现:
Here’s an example Python implementation of binary encoding using the category_encoders library −
import pandas as pd
import category_encoders as ce
# create a sample dataset with a categorical variable
data = {'color': ['red', 'green', 'blue', 'red', 'green']}
df = pd.DataFrame(data)
# create a binary encoder object and fit it to the data
binary_encoder = ce.BinaryEncoder(cols=['color'])
binary_encoder.fit(df['color'])
# transform the categorical variable using the binary encoder
encoded_data = binary_encoder.transform(df['color'])
# merge the encoded variable with the original dataframe
df = pd.concat([df, encoded_data], axis=1)
# print the encoded data
print(df)
在此示例中,我们首先使用类别变量“颜色”创建一个 Pandas DataFrame df 。然后,我们从 category_encoders 库创建一个 BinaryEncoder 对象,并将其装配到 df 的“颜色”列。
In this example, we first create a Pandas DataFrame df with a categorical variable 'color'. We then create a BinaryEncoder object from the category_encoders library and fit it to the 'color' column of df.
接下来,通过调用二进制编码器对象上的变换方法并向新 DataFrame encoded_data 指派所产生编码值,使用二进制编码器变换类别变量“颜色”。
Next, we transform the categorical variable 'color' using the binary encoder by calling the transform method on the binary encoder object and assigning the resulting encoded values to a new DataFrame encoded_data.
最后,我们使用 concat 方法通过列轴(axis=1)将编码变量与原始 DataFrame df 合并。生成 DataFrame 应该包含原始“颜色”列及编码二进制列。
Finally, we merge the encoded variable with the original DataFrame df using the concat method along the column axis (axis=1). The resulting DataFrame should have the original 'color' column along with the encoded binary columns.
当你运行此代码时,它将会生成以下输出:
When you run the code, it will produce the following output −
color color_0 color_1
0 red 0 1
1 green 1 0
2 blue 1 1
3 red 0 1
4 green 1 0
二进制编码最适合于包含大量类别的类别变量,因为它能迅速为包含大量类别的变量变得低效。
The binary encoding works best for categorical variables with a moderate number of categories, as it can quickly become inefficient for variables with a large number of categories.