Machine Learning 简明教程

Machine Learning - Entropy

熵是一个起源于热力学的概念，后来被应用于各种领域，包括信息论、统计学和机器学习。在机器学习中，熵被用作衡量一组数据的不纯度或随机性的指标。具体来说，熵在决策树算法中用于决定如何分割数据以创建一个更同质的子集。在本文中，我们将讨论机器学习中的熵、其属性以及在 Python 中的实现。

Entropy is a concept that originates from thermodynamics and was later applied in various fields, including information theory, statistics, and machine learning. In machine learning, entropy is used as a measure of the impurity or randomness of a set of data. Specifically, entropy is used in decision tree algorithms to decide how to split the data to create a more homogeneous subset. In this article, we will discuss entropy in machine learning, its properties, and its implementation in Python.

熵被定义为系统中无序或随机性的度量。在决策树的背景下，熵被用作衡量节点不纯度的指标。如果节点中所有示例都属于同一类别，则该节点被认为是纯的。相反，如果节点包含来自多个类别的示例，则该节点是不纯的。

Entropy is defined as a measure of disorder or randomness in a system. In the context of decision trees, entropy is used as a measure of the impurity of a node. A node is considered pure if all the examples in it belong to the same class. In contrast, a node is impure if it contains examples from multiple classes.

为了计算熵，我们首先需要定义数据集中每个类别的概率。令 p(i) 是示例属于类别 i 的概率。如果我们有 k 个类别，则系统总熵（表示为 H(S)）的计算如下：

To calculate entropy, we need to first define the probability of each class in the data set. Let p(i) be the probability of an example belonging to class i. If we have k classes, then the total entropy of the system, denoted by H(S), is calculated as follows −

H\left ( S \right )=-sum\left ( p\left ( i \right )\ast log_{2}\left ( p\left ( i \right ) \right ) \right )

其中求和是对所有 k 个类别进行的。此方程称为香农熵。

where the sum is taken over all k classes. This equation is called the Shannon entropy.

例如，假设我们有一个包含 100 个示例的数据集，其中 60 个属于类别 A，40 个属于类别 B。那么类别 A 的概率为 0.6，类别 B 的概率为 0.4。数据集的熵为：

For example, suppose we have a dataset with 100 examples, of which 60 belong to class A and 40 belong to class B. Then the probability of class A is 0.6 and the probability of class B is 0.4. The entropy of the dataset is then −

H\left ( S \right )=-(0.6\times log_{2}(0.6)+ 0.4\times log_{2}(0.4)) = 0.971

如果数据集中所有示例都属于同一类别，则熵为 0，表示纯节点。另一方面，如果示例在所有类别中均匀分布，则熵很高，表示不纯节点。

If all the examples in the dataset belong to the same class, then the entropy is 0, indicating a pure node. On the other hand, if the examples are evenly distributed across all classes, then the entropy is high, indicating an impure node.

在决策树算法中，熵用于确定每个节点的最佳分裂。目标是创建导致子集最同质的分裂。这可以通过计算每个可能分裂的熵并选择导致总熵最低的分裂来实现。

In decision tree algorithms, entropy is used to determine the best split at each node. The goal is to create a split that results in the most homogeneous subsets. This is done by calculating the entropy of each possible split and selecting the split that results in the lowest total entropy.

例如，假设我们有一个包含两个特征 X1 和 X2 的数据集，目标是预测类别标签 Y。我们从计算整个数据集的熵 H(S) 开始。接下来，我们根据每个特征计算每个可能分裂的熵。例如，我们可以根据 X1 或 X2 的值对数据进行分割。每个分裂的熵计算如下：

For example, suppose we have a dataset with two features, X1 and X2, and the goal is to predict the class label, Y. We start by calculating the entropy of the entire dataset, H(S). Next, we calculate the entropy of each possible split based on each feature. For example, we could split the data based on the value of X1 or the value of X2. The entropy of each split is calculated as follows −

H\left ( X_{1} \right )=p_{1}\times H\left ( S_{1} \right )+p_{2}\times H\left ( S_{2} \right )H\left ( X_{2} \right )=p_{3}\times H\left ( S_{3} \right )+p_{4}\times H\left ( S_{4} \right )

其中 p1、p2、p3 和 p4 是每个子集的概率；H(S1)、H(S2)、H(S3) 和 H(S4) 是每个子集的熵。

where p1, p2, p3, and p4 are the probabilities of each subset; and H(S1), H(S2), H(S3), and H(S4) are the entropies of each subset.

然后我们选择导致总熵最低的划分，它由以下内容给定：

We then select the split that results in the lowest total entropy, which is given by −

H_{split}=H\left ( X_{1} \right )\, if\, H\left ( X_{1} \right )\leq H\left ( X_{2} \right );\: else\: H\left ( X_{2} \right )

然后使用此划分来创建决策树的子节点，并且该过程重复执行，直至所有节点都是纯的或满足停止准则为止。

This split is then used to create the child nodes of the decision tree, and the process is repeated recursively until all nodes are pure or a stopping criterion is met.

Example

我们举个例子来了解如何在 Python 中实现它。这里我们将使用“iris”数据集：

Let’s take an example to understand how it can be implemented in Python. Here we will use the "iris" dataset −

from sklearn.datasets import load_iris
import numpy as np

# Load iris dataset
iris = load_iris()

# Extract features and target
X = iris.data
y = iris.target

# Define a function to calculate entropy
def entropy(y):
   n = len(y)
   _, counts = np.unique(y, return_counts=True)
   probs = counts / n
   return -np.sum(probs * np.log2(probs))

# Calculate the entropy of the target variable
target_entropy = entropy(y)
print(f"Target entropy: {target_entropy:.3f}")

以上代码加载 iris 数据集，提取特征和目标，并定义一个函数来计算熵。entropy() 函数采用目标值向量并返回集合的熵。

The above code loads the iris dataset, extracts the features and target, and defines a function to calculate entropy. The entropy() function takes a vector of target values and returns the entropy of the set.

该函数首先计算集合中的示例数和每种类的数目。然后计算每个类的比例并使用这些比例根据熵公式计算集合的熵。最后，代码计算 iris 数据集中目标变量的熵并将其打印到控制台。

The function first calculates the number of examples in the set and the count of each class. It then calculates the proportion of each class and uses these to calculate the entropy of the set using the entropy formula. Finally, the code calculates the entropy of the target variable in the iris dataset and prints it to the console.

Output

执行此代码时，将生成以下输出 −

When you execute this code, it will produce the following output −

Target entropy: 1.585