Scikit Learn 简明教程

Scikit Learn - Randomized Decision Trees

本章将帮助你理解 Sklearn 中的随机决策树。

This chapter will help you in understanding randomized decision trees in Sklearn.

Randomized Decision Tree algorithms

众所周知,DT 通常通过递归分割数据来训练,但由于容易过拟合,它们已通过在数据的各种子样本上训练许多树而转化为随机森林。 sklearn.ensemble 模块具有基于随机决策树的以下两种算法 -

As we know that a DT is usually trained by recursively splitting the data, but being prone to overfit, they have been transformed to random forests by training many trees over various subsamples of the data. The sklearn.ensemble module is having following two algorithms based on randomized decision trees −

The Random Forest algorithm

对于考虑的每个特征,它计算局部最优特征/分割组合。在随机森林中,集合中的每个决策树都是通过从训练集中有放回地抽取样本构建的,然后从每个样本中获取预测,最后通过投票的方式选择最佳解。它既可用于分类任务,也可用于回归任务。

For each feature under consideration, it computes the locally optimal feature/split combination. In Random forest, each decision tree in the ensemble is built from a sample drawn with replacement from the training set and then gets the prediction from each of them and finally selects the best solution by means of voting. It can be used for both classification as well as regression tasks.

Classification with Random Forest

为了创建随机森林分类器,Scikit-learn 模块提供 sklearn.ensemble.RandomForestClassifier 。在构建随机森林分类器时,此模块使用的主要参数是 ‘max_features’‘n_estimators’

For creating a random forest classifier, the Scikit-learn module provides sklearn.ensemble.RandomForestClassifier. While building random forest classifier, the main parameters this module uses are ‘max_features’ and ‘n_estimators’.

在此, ‘max_features’ 是拆分节点时要考虑的随机特征子集的大小。如果我们将此参数的值选择为 none,那么它将考虑所有特征,而不是随机子集。另一方面, n_estimators 是森林中的树的数量。树的数量越多,结果会越好。但计算也会花费更长时间。

Here, ‘max_features’ is the size of the random subsets of features to consider when splitting a node. If we choose this parameter’s value to none then it will consider all the features rather than a random subset. On the other hand, n_estimators are the number of trees in the forest. The higher the number of trees, the better the result will be. But it will take longer to compute also.

Implementation example

在下面的示例中,我们通过使用 sklearn.ensemble.RandomForestClassifier 来构建随机森林分类器,并通过使用 cross_val_score 模块来检查其准确性。

In the following example, we are building a random forest classifier by using sklearn.ensemble.RandomForestClassifier and also checking its accuracy also by using cross_val_score module.

from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
X, y = make_blobs(n_samples = 10000, n_features = 10, centers = 100,random_state = 0) RFclf = RandomForestClassifier(n_estimators = 10,max_depth = None,min_samples_split = 2, random_state = 0)
scores = cross_val_score(RFclf, X, y, cv = 5)
scores.mean()

Output

0.9997

Example

我们还可以使用 sklearn 数据集来构建随机森林分类器。与下面的示例一样,我们正在使用鸢尾花数据集。我们还将找到它的准确度得分和混淆矩阵。

We can also use the sklearn dataset to build Random Forest classifier. As in the following example we are using iris dataset. We will also find its accuracy score and confusion matrix.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

path = "https://archive.ics.uci.edu/ml/machine-learning-database
s/iris/iris.data"
headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
dataset = pd.read_csv(path, names = headernames)
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
RFclf = RandomForestClassifier(n_estimators = 50)
RFclf.fit(X_train, y_train)
y_pred = RFclf.predict(X_test)
result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:",result2)

Output

Confusion Matrix:
[[14 0 0]
[ 0 18 1]
[ 0 0 12]]
Classification Report:
                  precision recall f1-score support
Iris-setosa       1.00        1.00  1.00     14
Iris-versicolor   1.00        0.95  0.97     19
Iris-virginica    0.92        1.00  0.96     12

micro avg         0.98        0.98  0.98     45
macro avg         0.97        0.98  0.98     45
weighted avg      0.98        0.98  0.98     45

Accuracy: 0.9777777777777777

Regression with Random Forest

为了创建随机回归森林,Scikit-learn 模块提供了 sklearn.ensemble.RandomForestRegressor 。在构建随机森林回归器时,它将使用 sklearn.ensemble.RandomForestClassifier 使用的相同参数。

For creating a random forest regression, the Scikit-learn module provides sklearn.ensemble.RandomForestRegressor. While building random forest regressor, it will use the same parameters as used by sklearn.ensemble.RandomForestClassifier.

Implementation example

在下面的示例中,我们通过使用 sklearn.ensemble.RandomForestregressor 来构建随机森林回归器,并通过使用 predict() 方法对其进行预测。

In the following example, we are building a random forest regressor by using sklearn.ensemble.RandomForestregressor and also predicting for new values by using predict() method.

from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_features = 10, n_informative = 2,random_state = 0, shuffle = False)
RFregr = RandomForestRegressor(max_depth = 10,random_state = 0,n_estimators = 100)
RFregr.fit(X, y)

Output

RandomForestRegressor(
   bootstrap = True, criterion = 'mse', max_depth = 10,
   max_features = 'auto', max_leaf_nodes = None,
   min_impurity_decrease = 0.0, min_impurity_split = None,
   min_samples_leaf = 1, min_samples_split = 2,
   min_weight_fraction_leaf = 0.0, n_estimators = 100, n_jobs = None,
   oob_score = False, random_state = 0, verbose = 0, warm_start = False
)

拟合完成后,可以预测回归模型,如下所示:

Once fitted we can predict from regression model as follows −

print(RFregr.predict([[0, 2, 3, 0, 1, 1, 1, 1, 2, 2]]))

Output

[98.47729198]

Extra-Tree Methods

对于每个要考虑的特征,它为拆分选择一个随机值。使用额外树方法的好处是,它可以进一步减少模型的方差。使用这些方法的缺点是,它会略微增加方差。

For each feature under consideration, it selects a random value for the split. The benefit of using extra tree methods is that it allows to reduce the variance of the model a bit more. The disadvantage of using these methods is that it slightly increases the bias.

Classification with Extra-Tree Method

为了使用 Extra-tree 方法创建分类器,Scikit-learn 模块提供了 sklearn.ensemble.ExtraTreesClassifier 。它使用 sklearn.ensemble.RandomForestClassifier 使用的相同参数。唯一区别在于,按照上面讨论的方式,它们构建树。

For creating a classifier using Extra-tree method, the Scikit-learn module provides sklearn.ensemble.ExtraTreesClassifier. It uses the same parameters as used by sklearn.ensemble.RandomForestClassifier. The only difference is in the way, discussed above, they build trees.

Implementation example

在下面的示例中,我们通过使用 sklearn.ensemble.ExtraTreeClassifier 来构建随机森林分类器,并通过使用 cross_val_score 模块来检查其准确性。

In the following example, we are building a random forest classifier by using sklearn.ensemble.ExtraTreeClassifier and also checking its accuracy by using cross_val_score module.

from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_blobs
from sklearn.ensemble import ExtraTreesClassifier
X, y = make_blobs(n_samples = 10000, n_features = 10, centers=100,random_state = 0)
ETclf = ExtraTreesClassifier(n_estimators = 10,max_depth = None,min_samples_split = 10, random_state = 0)
scores = cross_val_score(ETclf, X, y, cv = 5)
scores.mean()

Output

1.0

Example

我们还可以使用 sklearn 数据集来使用 Extra-Tree 方法构建分类器。与下面的示例一样,我们正在使用 Pima-Indian 数据集。

We can also use the sklearn dataset to build classifier using Extra-Tree method. As in the following example we are using Pima-Indian dataset.

from pandas import read_csv

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import ExtraTreesClassifier
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
seed = 7
kfold = KFold(n_splits=10, random_state=seed)
num_trees = 150
max_features = 5
ETclf = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features)
results = cross_val_score(ETclf, X, Y, cv=kfold)
print(results.mean())

Output

0.7551435406698566

Regression with Extra-Tree Method

为了创建 Extra-Tree 回归,Scikit-learn 模块提供了 sklearn.ensemble.ExtraTreesRegressor 。在构建随机森林回归器时,它将使用 sklearn.ensemble.ExtraTreesClassifier 使用的相同参数。

For creating a Extra-Tree regression, the Scikit-learn module provides sklearn.ensemble.ExtraTreesRegressor. While building random forest regressor, it will use the same parameters as used by sklearn.ensemble.ExtraTreesClassifier.

Implementation example

在下面的示例中,我们应用 sklearn.ensemble.ExtraTreesregressor 以及我们在创建随机森林回归器时使用的相同数据。让我们看看输出中的差异

In the following example, we are applying sklearn.ensemble.ExtraTreesregressor and on the same data as we used while creating random forest regressor. Let’s see the difference in the Output

from sklearn.ensemble import ExtraTreesRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_features = 10, n_informative = 2,random_state = 0, shuffle = False)
ETregr = ExtraTreesRegressor(max_depth = 10,random_state = 0,n_estimators = 100)
ETregr.fit(X, y)

Output

ExtraTreesRegressor(bootstrap = False, criterion = 'mse', max_depth = 10,
   max_features = 'auto', max_leaf_nodes = None,
   min_impurity_decrease = 0.0, min_impurity_split = None,
   min_samples_leaf = 1, min_samples_split = 2,
   min_weight_fraction_leaf = 0.0, n_estimators = 100, n_jobs = None,
   oob_score = False, random_state = 0, verbose = 0, warm_start = False)

Example

拟合完成后,可以预测回归模型,如下所示:

Once fitted we can predict from regression model as follows −

print(ETregr.predict([[0, 2, 3, 0, 1, 1, 1, 1, 2, 2]]))

Output

[85.50955817]