Machine Learning 简明教程

Machine Learning - Decision Trees Algorithm

决策树算法是一个基于分层树的算法,用于基于一组规则对结果进行分类或预测。它是通过基于输入特征的值将数据分成各个子集来工作的。该算法会递归地对数据进行分割,直至到达每组数据都属于同一类别或目标变量值相同为止。生成的树是一组决策规则,可用于对新数据进行预测或分类。

The Decision Tree algorithm is a hierarchical tree-based algorithm that is used to classify or predict outcomes based on a set of rules. It works by splitting the data into subsets based on the values of the input features. The algorithm recursively splits the data until it reaches a point where the data in each subset belongs to the same class or has the same value for the target variable. The resulting tree is a set of decision rules that can be used to make predictions or classify new data.

决策树算法通过在每个节点处选择分割数据的最佳特征来工作。最优特征是能提供最大信息增益或熵减的最大特征。信息增益是通过在特定特征处分割数据所获得的信息量的度量,而熵是数据中随机性或混乱性的度量。该算法使用这些度量来确定在每个节点处分割数据的最佳特征。

The Decision Tree algorithm works by selecting the best feature to split the data at each node. The best feature is the one that provides the most information gain or the most reduction in entropy. Information gain is a measure of the amount of information gained by splitting the data at a particular feature, while entropy is a measure of the randomness or disorder in the data. The algorithm uses these measures to determine the best feature to split the data at each node.

以下是一个二叉树的示例,用于预测一个人是否适合,并提供年龄、饮食习惯和锻炼习惯等各种信息:

The example of a binary tree for predicting whether a person is fit or unfit providing various information like age, eating habits and exercise habits, is given below −

decision tree algorithm

在上述决策树中,问题是决策节点,最终结果是叶节点。

In the above decision tree, the question are decision nodes and final outcomes are leaves.

Types of Decision Tree Algorithm

有两种主要的决策树算法:

There are two main types of Decision Tree algorithm −

  1. Classification Tree − A classification tree is used to classify data into different classes or categories. It works by splitting the data into subsets based on the values of the input features and assigning each subset to a different class.

  2. Regression Tree − A regression tree is used to predict numerical values or continuous variables. It works by splitting the data into subsets based on the values of the input features and assigning each subset a numerical value.

Implementation in Python

让我们使用名为鸢尾花数据集的流行分类任务数据集在 Python 中实现决策树算法。它包含150个样本 iris 花,每个花有四个特征:萼片长度、萼片宽度、花瓣长度和花瓣宽度。这些花属于三类:山鸢尾、变色鸢尾和弗吉尼亚鸢尾。

Let’s implement the Decision Tree algorithm in Python using a popular dataset for classification tasks named Iris dataset. It contains 150 samples of iris flowers, each with four features: sepal length, sepal width, petal length, and petal width. The flowers belong to three classes: setosa, versicolor, and virginica.

首先,我们将导入必要的库并加载数据集:

First, we will import the necessary libraries and load the dataset −

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Load the iris dataset
iris = load_iris()

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data,
iris.target, test_size=0.3, random_state=0)

然后,我们创建一个决策树分类器实例并使用训练集对其进行训练:

We then create an instance of the Decision Tree classifier and train it on the training set −

# Create a Decision Tree classifier
dtc = DecisionTreeClassifier()

# Fit the classifier to the training data
dtc.fit(X_train, y_train)

现在,我们可以使用训练好的分类器对测试集进行预测 −

We can now use the trained classifier to make predictions on the testing set −

# Make predictions on the testing data
y_pred = dtc.predict(X_test)

我们可以通过计算其准确性来评估分类器的性能——

We can evaluate the performance of the classifier by calculating its accuracy −

# Calculate the accuracy of the classifier
accuracy = np.sum(y_pred == y_test) / len(y_test)
print("Accuracy:", accuracy)

我们可以使用 Matplotlib 库可视化决策树:

We can visualize the Decision Tree using Matplotlib library −

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

# Visualize the Decision Tree using Matplotlib
plt.figure(figsize=(20,10))
plot_tree(dtc, filled=True, feature_names=iris.feature_names,
class_names=iris.target_names)
plt.show()

可以使用 sklearn.tree 模块中的 plot_tree 函数绘制决策树。我们可以传入已训练的决策树分类器,将节点填充颜色的 feature_names 参数,为特征添加标签的 class_names 参数以及为目标类别添加标签的 figsize 参数。我们还指定 x 参数来设置图片的大小,并调用 show 函数来显示绘图。

The plot_tree function from the sklearn.tree module can be used to plot the Decision Tree. We can pass in the trained Decision Tree classifier, the filled argument to fill the nodes with color, the feature_names argument to label the features, and the class_names argument to label the target classes. We also specify the figsize argument to set the size of the figure and call the show function to display the plot.

Complete Implementation Example

以下是使用鸢尾花数据集在 python 中实现决策树分类算法的完整示例:

Given below is the complete implementation example of Decision Tree Classification algorithm in python using the iris dataset −

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Load the iris dataset
iris = load_iris()

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=0)

# Create a Decision Tree classifier
dtc = DecisionTreeClassifier()

# Fit the classifier to the training data
dtc.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = dtc.predict(X_test)

# Calculate the accuracy of the classifier
accuracy = np.sum(y_pred == y_test) / len(y_test)
print("Accuracy:", accuracy)

# Visualize the Decision Tree using Matplotlib
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
plt.figure(figsize=(20,10))
plot_tree(dtc, filled=True, feature_names=iris.feature_names,
class_names=iris.target_names)
plt.show()

这将创建一个如下所示的决策树绘图:

This will create a plot of the Decision Tree that looks like this −

plot of decision tree
Accuracy: 0.9777777777777777

如您所见,这个绘图展示了决策树的结构,其中每个节点表示基于某个特征的价值的决策,每个叶节点则表示一个类别或数值。每个节点的颜色表示该节点中样本的主要类别或值,底部的数字表示到达该节点的样本数量。

As you can see, the plot shows the structure of the Decision Tree, with each node representing a decision based on the value of a feature, and each leaf node representing a class or numerical value. The color of each node indicates the majority class or value of the samples in that node, and the numbers at the bottom indicate the number of samples that reach that node.