Scikit Learn 简明教程

Scikit Learn - Decision Trees

在本章中，我们将学习 Sklearn 中的学习方法，称为决策树。

In this chapter, we will learn about learning method in Sklearn which is termed as decision trees.

决策树 (DT) 是最强大的非参数监督学习方法。它们可用于分类和回归任务。DT 的主要目标是创建一个模型，通过学习从数据特征中推导出的简单决策规则来预测目标变量值。决策树有两个主要实体；一个是数据分裂的根节点，另一个是决策节点或叶节点，在这些节点中我们得到了最终输出。

Decisions tress (DTs) are the most powerful non-parametric supervised learning method. They can be used for the classification and regression tasks. The main goal of DTs is to create a model predicting target variable value by learning simple decision rules deduced from the data features. Decision trees have two main entities; one is root node, where the data splits, and other is decision nodes or leaves, where we got final output.

Decision Tree Algorithms

不同的决策树算法如下 −

Different Decision Tree algorithms are explained below −

ID3

它是由 Ross Quinlan 在 1986 年开发的。它也被称为 Iterative Dichotomiser 3。此算法的主要目标是为每个节点找到那些分类特征，这些特征将产生分类目标的最大信息增益。

It was developed by Ross Quinlan in 1986. It is also called Iterative Dichotomiser 3. The main goal of this algorithm is to find those categorical features, for every node, that will yield the largest information gain for categorical targets.

它允许树长到最大尺寸，然后为了提高树对未见数据的能力，应用了一个剪枝步骤。此算法的输出将是一棵多向树。

It lets the tree to be grown to their maximum size and then to improve the tree’s ability on unseen data, applies a pruning step. The output of this algorithm would be a multiway tree.

C4.5

它继承了 ID3，并动态定义了一个离散属性，将连续属性值划分为一组离散间隔。这就是它取消分类特征限制的原因。它将 ID3 训练的树转换为一组“IF-THEN”规则。

It is the successor to ID3 and dynamically defines a discrete attribute that partition the continuous attribute value into a discrete set of intervals. That’s the reason it removed the restriction of categorical features. It converts the ID3 trained tree into sets of ‘IF-THEN’ rules.

为了确定应用这些规则的顺序，首先将评估每条规则的准确性。

In order to determine the sequence in which these rules should applied, the accuracy of each rule will be evaluated first.

C5.0

它的工作方式类似于 C4.5，但使用的内存更少并且构建更小的规则集。它比 C4.5 更准确。

It works similar as C4.5 but it uses less memory and build smaller rulesets. It is more accurate than C4.5.

CART

它被称为分类和回归树算法。它通过使用特征和阈值生成二元分割，从而在每个节点产生最大的信息增益（称为基尼指数）。

It is called Classification and Regression Trees alsgorithm. It basically generates binary splits by using the features and threshold yielding the largest information gain at each node (called the Gini index).

同质性取决于基尼指数，基尼指数的值越高，同质性就越高。它类似 C4.5 算法，但它的不同之处在于，它不计算规则集，也不支持数字目标变量（回归）。

Homogeneity depends upon Gini index, higher the value of Gini index, higher would be the homogeneity. It is like C4.5 algorithm, but, the difference is that it does not compute rule sets and does not support numerical target variables (regression) as well.

Classification with decision trees

在这种情况下，决策变量是分类的。

In this case, the decision variables are categorical.

Sklearn Module − Scikit-learn 库提供模块名称 DecisionTreeClassifier ，用于对数据集执行多类分类。

Sklearn Module − The Scikit-learn library provides the module name DecisionTreeClassifier for performing multiclass classification on dataset.

Parameters

下表包含 sklearn.tree.DecisionTreeClassifier 模块使用的参数−

Following table consist the parameters used by sklearn.tree.DecisionTreeClassifier module −

Sr.No	Parameter & Description
1	criterion − string, optional default= “gini” It represents the function to measure the quality of a split. Supported criteria are “gini” and “entropy”. The default is gini which is for Gini impurity while entropy is for the information gain.
2	splitter − string, optional default= “best” It tells the model, which strategy from “best” or “random” to choose the split at each node.
3	max_depth − int or None, optional default=None This parameter decides the maximum depth of the tree. The default value is None which means the nodes will expand until all leaves are pure or until all leaves contain less than min_smaples_split samples.
4	min_samples_split − int, float, optional default=2 This parameter provides the minimum number of samples required to split an internal node.
5	min_samples_leaf − int, float, optional default=1 This parameter provides the minimum number of samples required to be at a leaf node.
6	min_weight_fraction_leaf − float, optional default=0. With this parameter, the model will get the minimum weighted fraction of the sum of weights required to be at a leaf node.
7	max_features − int, float, string or None, optional default=None It gives the model the number of features to be considered when looking for the best split.
8	random_state − int, RandomState instance or None, optional, default = none This parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options − int − In this case, random_state is the seed used by random number generator. RandomState instance − In this case, random_state is the random number generator. None − In this case, the random number generator is the RandonState instance used by np.random.
9	max_leaf_nodes − int or None, optional default=None This parameter will let grow a tree with max_leaf_nodes in best-first fashion. The default is none which means there would be unlimited number of leaf nodes.
10	min_impurity_decrease − float, optional default=0. This value works as a criterion for a node to split because the model will split a node if this split induces a decrease of the impurity greater than or equal to min_impurity_decrease value.
11	min_impurity_split − float, default=1e-7 It represents the threshold for early stopping in tree growth.
12	class_weight − dict, list of dicts, “balanced” or None, default=None It represents the weights associated with classes. The form is {class_label: weight}. If we use the default option, it means all the classes are supposed to have weight one. On the other hand, if you choose class_weight: balanced, it will use the values of y to automatically adjust weights.
13	presort − bool, optional default=False It tells the model whether to presort the data to speed up the finding of best splits in fitting. The default is false but of set to true, it may slow down the training process.

Attributes

下表包含 sklearn.tree.DecisionTreeClassifier 模块使用的属性 -

Following table consist the attributes used by sklearn.tree.DecisionTreeClassifier module −

Sr.No	Parameter & Description
1	feature_importances_ − array of shape =[n_features] This attribute will return the feature importance.
2	classes_: − array of shape = [n_classes] or a list of such arrays It represents the classes labels i.e. the single output problem, or a list of arrays of class labels i.e. multi-output problem.
3	max_features_ − int It represents the deduced value of max_features parameter.
4	n_classes_ − int or list It represents the number of classes i.e. the single output problem, or a list of number of classes for every output i.e. multi-output problem.
5	n_features_ − int It gives the number of features when fit() method is performed.
6	n_outputs_ − int It gives the number of outputs when fit() method is performed.

Methods

下表包含 sklearn.tree.DecisionTreeClassifier 模块使用的属性 -

Following table consist the methods used by sklearn.tree.DecisionTreeClassifier module −

Sr.No	Parameter & Description
1	apply(self, X[, check_input]) This method will return the index of the leaf.
2	decision_path(self, X[, check_input]) As name suggests, this method will return the decision path in the tree
3	fit(self, X, y[, sample_weight, …]) fit() method will build a decision tree classifier from given training set (X, y).
4	get_depth(self) As name suggests, this method will return the depth of the decision tree
5	get_n_leaves(self) As name suggests, this method will return the number of leaves of the decision tree.
6	get_params(self[, deep]) We can use this method to get the parameters for estimator.
7	predict(self, X[, check_input]) It will predict class value for X.
8	predict_log_proba(self, X) It will predict class log-probabilities of the input samples provided by us, X.
9	predict_proba(self, X[, check_input]) It will predict class probabilities of the input samples provided by us, X.
10	score(self, X, y[, sample_weight]) As the name implies, the score() method will return the mean accuracy on the given test data and labels..
11	set_params(self, \\params) We can set the parameters of estimator with this method.

Implementation Example

下面的 Python 脚本将会用 sklearn.tree.DecisionTreeClassifier 模块，为我们的数据集（有 25 个样本以及两个特征，即“身高”和“头发长度”）构建一个用于预测男性或女性的分类器 −

The Python script below will use sklearn.tree.DecisionTreeClassifier module to construct a classifier for predicting male or female from our data set having 25 samples and two features namely ‘height’ and ‘length of hair’ −

from sklearn import tree
from sklearn.model_selection import train_test_split
X=[[165,19],[175,32],[136,35],[174,65],[141,28],[176,15]
,[131,32],[166,6],[128,32],[179,10],[136,34],[186,2],[12
6,25],[176,28],[112,38],[169,9],[171,36],[116,25],[196,2
5], [196,38], [126,40], [197,20], [150,25], [140,32],[136,35]]
Y=['Man','Woman','Woman','Man','Woman','Man','Woman','Ma
n','Woman','Man','Woman','Man','Woman','Woman','Woman','
Man','Woman','Woman','Man', 'Woman', 'Woman', 'Man', 'Man', 'Woman', 'Woman']
data_feature_names = ['height','length of hair']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 1)
DTclf = tree.DecisionTreeClassifier()
DTclf = clf.fit(X,Y)
prediction = DTclf.predict([[135,29]])
print(prediction)

Output

['Woman']

我们也可以使用以下 python predict_proba() 方法预测每个类的概率，如下所示 −

We can also predict the probability of each class by using following python predict_proba() method as follows −

Example

prediction = DTclf.predict_proba([[135,29]])
print(prediction)

Output

[[0. 1.]]

Regression with decision trees

在这种情况下，决策变量是连续的。

In this case the decision variables are continuous.

Sklearn Module - Scikit-learn 库提供了模块名称 DecisionTreeRegressor ，以将决策树应用于回归问题。

Sklearn Module − The Scikit-learn library provides the module name DecisionTreeRegressor for applying decision trees on regression problems.

Parameters

DecisionTreeRegressor 使用的参数与 DecisionTreeClassifier 模块中使用的参数基本相同。区别在于“标准”参数。对于 DecisionTreeRegressor 模块 ‘criterion ：string，可选的默认值为 “mse”的参数有以下值 −

Parameters used by DecisionTreeRegressor are almost same as that were used in DecisionTreeClassifier module. The difference lies in ‘criterion’ parameter. For DecisionTreeRegressor modules ‘criterion: string, optional default= “mse”’ parameter have the following values −

mse − It stands for the mean squared error. It is equal to variance reduction as feature selectin criterion. It minimises the L2 loss using the mean of each terminal node.
freidman_mse − It also uses mean squared error but with Friedman’s improvement score.
mae − It stands for the mean absolute error. It minimizes the L1 loss using the median of each terminal node.

另一个区别是它没有 ‘class_weight’ 参数。

Another difference is that it does not have ‘class_weight’ parameter.

Attributes

DecisionTreeRegressor 的属性也与 DecisionTreeClassifier 模块的属性相同。区别是它没有 ‘classes_’ 和 ‘n_classes_ 的属性。

Attributes of DecisionTreeRegressor are also same as that were of DecisionTreeClassifier module. The difference is that it does not have ‘classes_’ and ‘n_classes_’ attributes.

Methods

DecisionTreeRegressor 的方法也与 DecisionTreeClassifier 模块的方法相同。区别是它没有 ‘predict_log_proba()’ 和 ‘predict_proba()’ 属性。

Methods of DecisionTreeRegressor are also same as that were of DecisionTreeClassifier module. The difference is that it does not have ‘predict_log_proba()’ and ‘predict_proba()’’ attributes.

Implementation Example

决策树回归模型中的 fit() 方法会获取 y 的浮点值。让我们使用 Sklearn.tree.DecisionTreeRegressor 看看一个简单的实现示例 −

The fit() method in Decision tree regression model will take floating point values of y. let’s see a simple implementation example by using Sklearn.tree.DecisionTreeRegressor −

from sklearn import tree
X = [[1, 1], [5, 5]]
y = [0.1, 1.5]
DTreg = tree.DecisionTreeRegressor()
DTreg = clf.fit(X, y)

一旦拟合，我们就可以使用此回归模型如下所示进行预测：

Once fitted, we can use this regression model to make prediction as follows −

DTreg.predict([[4, 5]])

Output

array([1.5])