Machine Learning With Python 简明教程

Classification Algorithms - Random Forest

Introduction

随机森林是一种监督学习算法,用于分类和回归。但它主要用于分类问题。我们知道森林是由树木组成的,树越多,森林就越健壮。同样,随机森林算法针对数据样本创建决策树,然后从每个决策树中获取预测,最后通过投票方式选择最佳解决方案。它是一种集成方法,优于单个决策树,因为它通过平均结果来减少过度拟合。

Random forest is a supervised learning algorithm which is used for both classification as well as regression. But however, it is mainly used for classification problems. As we know that a forest is made up of trees and more trees means more robust forest. Similarly, random forest algorithm creates decision trees on data samples and then gets the prediction from each of them and finally selects the best solution by means of voting. It is an ensemble method which is better than a single decision tree because it reduces the over-fitting by averaging the result.

Working of Random Forest Algorithm

借助以下步骤,我们可以了解随机森林算法的工作原理:

We can understand the working of Random Forest algorithm with the help of following steps −

  1. Step 1 − First, start with the selection of random samples from a given dataset.

  2. Step 2 − Next, this algorithm will construct a decision tree for every sample. Then it will get the prediction result from every decision tree.

  3. Step 3 − In this step, voting will be performed for every predicted result.

  4. Step 4 − At last, select the most voted prediction result as the final prediction result.

以下图表将说明其工作原理——

The following diagram will illustrate its working −

test set

Implementation in Python

首先,从导入必要的 Python 包开始——

First, start with importing necessary Python packages −

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

接下来,从其 Web 链接下载 iris 数据集,如下所示——

Next, download the iris dataset from its weblink as follows −

path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

接下来,我们需要按照以下方式为数据集分配列名称 −

Next, we need to assign column names to the dataset as follows −

headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

现在,我们需要按照以下方式将数据集读入 Pandas 数据框 −

Now, we need to read dataset to pandas dataframe as follows −

dataset = pd.read_csv(path, names=headernames)
dataset.head()

sepal-length

sepal-width

petal-length

petal-width

Class

0

5.1

3.5

1.4

0.2

Iris-setosa

1

4.9

3.0

1.4

0.2

Iris-setosa

2

4.7

3.2

1.3

0.2

Iris-setosa

3

4.6

3.1

1.5

0.2

Iris-setosa

4

5.0

3.6

1.4

0.2

Iris-setosa

数据预处理将借助以下脚本行执行 −

Data Preprocessing will be done with the help of following script lines −

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values

接下来,我们将数据分为训练和测试拆分。以下代码将数据集拆分为 70% 的训练数据和 30% 的测试数据——

Next, we will divide the data into train and test split. The following code will split the dataset into 70% training data and 30% of testing data −

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

接下来,使用 sklearn 的 RandomForestClassifier 类按照以下方式训练模型 −

Next, train the model with the help of RandomForestClassifier class of sklearn as follows −

from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=50)
classifier.fit(X_train, y_train)

最后,我们需要进行预测。可以借助以下脚本执行此操作 −

At last, we need to make prediction. It can be done with the help of following script −

y_pred = classifier.predict(X_test)

接下来,按照以下方式打印结果 −

Next, print the results as follows −

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:",result2)

Output

Confusion Matrix:
[
   [14 0 0]
   [ 0 18 1]
   [ 0 0 12]
]
Classification Report:
               precision       recall     f1-score       support
Iris-setosa        1.00         1.00        1.00         14
Iris-versicolor    1.00         0.95        0.97         19
Iris-virginica     0.92         1.00        0.96         12
micro avg          0.98         0.98        0.98         45
macro avg          0.97         0.98        0.98         45
weighted avg       0.98         0.98        0.98         45

Accuracy: 0.9777777777777777

Pros and Cons of Random Forest

Pros

以下是 Random Forest 算法的优点 −

The following are the advantages of Random Forest algorithm −

  1. It overcomes the problem of overfitting by averaging or combining the results of different decision trees.

  2. Random forests work well for a large range of data items than a single decision tree does.

  3. Random forest has less variance then single decision tree.

  4. Random forests are very flexible and possess very high accuracy.

  5. Scaling of data does not require in random forest algorithm. It maintains good accuracy even after providing data without scaling.

  6. Scaling of data does not require in random forest algorithm. It maintains good accuracy even after providing data without scaling.

Cons

以下是 Random Forest 算法的缺点 −

The following are the disadvantages of Random Forest algorithm −

  1. Complexity is the main disadvantage of Random forest algorithms.

  2. Construction of Random forests are much harder and time-consuming than decision trees.

  3. More computational resources are required to implement Random Forest algorithm.

  4. It is less intuitive in case when we have a large collection of decision trees .

  5. The prediction process using random forests is very time-consuming in comparison with other algorithms.