Scikit Learn 简明教程

Scikit Learn - Decision Trees

在本章中,我们将学习 Sklearn 中的学习方法,称为决策树。

决策树 (DT) 是最强大的非参数监督学习方法。它们可用于分类和回归任务。DT 的主要目标是创建一个模型,通过学习从数据特征中推导出的简单决策规则来预测目标变量值。决策树有两个主要实体;一个是数据分裂的根节点,另一个是决策节点或叶节点,在这些节点中我们得到了最终输出。

Decision Tree Algorithms

不同的决策树算法如下 −

ID3

它是由 Ross Quinlan 在 1986 年开发的。它也被称为 Iterative Dichotomiser 3。此算法的主要目标是为每个节点找到那些分类特征,这些特征将产生分类目标的最大信息增益。

它允许树长到最大尺寸,然后为了提高树对未见数据的能力,应用了一个剪枝步骤。此算法的输出将是一棵多向树。

C4.5

It is the successor to ID3 and dynamically defines a discrete attribute that partition the continuous attribute value into a discrete set of intervals. That’s the reason it removed the restriction of categorical features. It converts the ID3 trained tree into sets of ‘IF-THEN’ rules.

In order to determine the sequence in which these rules should applied, the accuracy of each rule will be evaluated first.

C5.0

It works similar as C4.5 but it uses less memory and build smaller rulesets. It is more accurate than C4.5.

CART

It is called Classification and Regression Trees alsgorithm. It basically generates binary splits by using the features and threshold yielding the largest information gain at each node (called the Gini index).

Homogeneity depends upon Gini index, higher the value of Gini index, higher would be the homogeneity. It is like C4.5 algorithm, but, the difference is that it does not compute rule sets and does not support numerical target variables (regression) as well.

Classification with decision trees

In this case, the decision variables are categorical.

Sklearn Module − The Scikit-learn library provides the module name DecisionTreeClassifier for performing multiclass classification on dataset.

Parameters

Following table consist the parameters used by sklearn.tree.DecisionTreeClassifier module −

Sr.No

Parameter & Description

1

criterion − string, optional default= “gini” It represents the function to measure the quality of a split. Supported criteria are “gini” and “entropy”. The default is gini which is for Gini impurity while entropy is for the information gain.

2

splitter − string, optional default= “best” It tells the model, which strategy from “best” or “random” to choose the split at each node.

3

max_depth − int or None, optional default=None This parameter decides the maximum depth of the tree. The default value is None which means the nodes will expand until all leaves are pure or until all leaves contain less than min_smaples_split samples.

4

min_samples_split − int, float, optional default=2 This parameter provides the minimum number of samples required to split an internal node.

5

min_samples_leaf − int, float, optional default=1 This parameter provides the minimum number of samples required to be at a leaf node.

6

min_weight_fraction_leaf − float, optional default=0. With this parameter, the model will get the minimum weighted fraction of the sum of weights required to be at a leaf node.

7

max_features − int, float, string or None, optional default=None It gives the model the number of features to be considered when looking for the best split.

8

random_state − int, RandomState instance or None, optional, default = none This parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options − int − In this case, random_state is the seed used by random number generator. RandomState instance − In this case, random_state is the random number generator. None − In this case, the random number generator is the RandonState instance used by np.random.

9

max_leaf_nodes − int or None, optional default=None This parameter will let grow a tree with max_leaf_nodes in best-first fashion. The default is none which means there would be unlimited number of leaf nodes.

10

min_impurity_decrease − float, optional default=0. This value works as a criterion for a node to split because the model will split a node if this split induces a decrease of the impurity greater than or equal to min_impurity_decrease value.

11

min_impurity_split − float, default=1e-7 It represents the threshold for early stopping in tree growth.

12

class_weight − 字典、字典列表,“平衡”或 None,默认值=None它表示与类关联的权重。形式为 {class_label: weight}。如果我们使用默认选项,这意味着所有类都应具有权重一。另一方面,如果您选择 class_weight: balanced ,它会使用 y 的值自动调整权重。

13

presort − 布尔值,可选,默认值=False它告诉该模型是否对数据进行预排序以加快在装配中找到最佳拆分的操作。默认值是 false,但如果设置为 true,它可能会减慢训练流程。

Attributes

下表包含 sklearn.tree.DecisionTreeClassifier 模块使用的属性 -

Sr.No

Parameter & Description

1

feature_importances_ - 形状 =[n_features] 的阵列此属性将返回特征重要性。

2

classes_: - 形状为 [n_classes] 的阵列或此类阵列的列表它表示类别标签,即单输出问题,或类标签的阵列列表,即多输出问题。

3

max_features_ − 整数它表示 max_features 参数推论出的值。

4

n_classes_ − 整数或列表它表示类数,即单输出问题,或每个输出的类数列表,即多输出问题。

5

n_features_ − 整数在执行 fit() 方法时,它给出 features 的数量。

6

n_outputs_ − 整数在执行 fit() 方法时,它给出 outputs 的数量。

Methods

下表包含 sklearn.tree.DecisionTreeClassifier 模块使用的属性 -

Sr.No

Parameter & Description

1

apply (self, X[, check_input])此方法将返回叶子索引。

2

decision_path (self, X[, check_input])顾名思义,此方法将返回树中的决策路径

3

fit (self, X, y[, sample_weight, …])fit() 方法将根据给定的训练集 (X, y) 构建决策树分类器。

4

get_depth (self)顾名思义,此方法将返回决策树的深度

5

get_n_leaves (self)顾名思义,此方法将返回决策树的叶数。

6

get_params (self[, deep])我们可以使用此方法来获取估计器的参数。

7

predict (self, X[, check_input])它将预测 X 的类值。

8

predict_log_proba (self, X)它将预测 X 提供的输入样本的类对数概率。

9

predict_proba (self, X[, check_input])它将预测 X 提供的输入样本的类概率。

10

score (self、X、y[、sample_weight]),如名字表示的那样,score() 方法会返回给定测试数据和标签上的平均准确度。

11

set_params (self、**params),我们可以使用此方法设置估计器的参数。

Implementation Example

下面的 Python 脚本将会用 sklearn.tree.DecisionTreeClassifier 模块,为我们的数据集(有 25 个样本以及两个特征,即“身高”和“头发长度”)构建一个用于预测男性或女性的分类器 −

from sklearn import tree
from sklearn.model_selection import train_test_split
X=[[165,19],[175,32],[136,35],[174,65],[141,28],[176,15]
,[131,32],[166,6],[128,32],[179,10],[136,34],[186,2],[12
6,25],[176,28],[112,38],[169,9],[171,36],[116,25],[196,2
5], [196,38], [126,40], [197,20], [150,25], [140,32],[136,35]]
Y=['Man','Woman','Woman','Man','Woman','Man','Woman','Ma
n','Woman','Man','Woman','Man','Woman','Woman','Woman','
Man','Woman','Woman','Man', 'Woman', 'Woman', 'Man', 'Man', 'Woman', 'Woman']
data_feature_names = ['height','length of hair']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 1)
DTclf = tree.DecisionTreeClassifier()
DTclf = clf.fit(X,Y)
prediction = DTclf.predict([[135,29]])
print(prediction)

Output

['Woman']

我们也可以使用以下 python predict_proba() 方法预测每个类的概率,如下所示 −

Example

prediction = DTclf.predict_proba([[135,29]])
print(prediction)

Output

[[0. 1.]]

Regression with decision trees

在这种情况下,决策变量是连续的。

Sklearn Module - Scikit-learn 库提供了模块名称 DecisionTreeRegressor ,以将决策树应用于回归问题。

Parameters

DecisionTreeRegressor 使用的参数与 DecisionTreeClassifier 模块中使用的参数基本相同。区别在于“标准”参数。对于 DecisionTreeRegressor 模块 ‘criterion :string,可选的默认值为 “mse”的参数有以下值 −

  1. mse - 它代表均方误差。它等于方差缩减作为特征选择标准。它最小化 L2 损失,使用每个终端节点的平均值。

  2. freidman_mse - 它也使用均方误差,但使用 Friedman 的改进分数。

  3. mae - 它代表平均绝对误差。它最小化 L1 损失,使用每个终端节点的中位数。

另一个区别是它没有 ‘class_weight’ 参数。

Attributes

DecisionTreeRegressor 的属性也与 DecisionTreeClassifier 模块的属性相同。区别是它没有 ‘classes_’‘n_classes_ 的属性。

Methods

DecisionTreeRegressor 的方法也与 DecisionTreeClassifier 模块的方法相同。区别是它没有 ‘predict_log_proba()’‘predict_proba()’ 属性。

Implementation Example

决策树回归模型中的 fit() 方法会获取 y 的浮点值。让我们使用 Sklearn.tree.DecisionTreeRegressor 看看一个简单的实现示例 −

from sklearn import tree
X = [[1, 1], [5, 5]]
y = [0.1, 1.5]
DTreg = tree.DecisionTreeRegressor()
DTreg = clf.fit(X, y)

一旦拟合,我们就可以使用此回归模型如下所示进行预测:

DTreg.predict([[4, 5]])

Output

array([1.5])