R 简明教程

R - Random Forest

在随机森林方法中,会创建大量决策树。每个观察都会传入到每个决策树。每个观察的最常见结果将用作最终输出。将新观察传入到所有树,并针对每个分类模型进行多数表决。

In the random forest approach, a large number of decision trees are created. Every observation is fed into every decision tree. The most common outcome for each observation is used as the final output. A new observation is fed into all the trees and taking a majority vote for each classification model.

对在创建树时未使用的案例进行误差估计。这称为 OOB (Out-of-bag) 误差估计,以百分比表示。

An error estimate is made for the cases which were not used while building the tree. That is called an OOB (Out-of-bag) error estimate which is mentioned as a percentage.

R 包 "randomForest" 用于创建随机森林。

The R package "randomForest" is used to create random forests.

Install R Package

在 R 控制台中使用以下命令安装包。还需要安装相关包(如果存在)。

Use the below command in R console to install the package. You also have to install the dependent packages if any.

install.packages("randomForest)

“randomForest”包具有 randomForest() 函数,该函数用于创建和分析随机森林。

The package "randomForest" has the function randomForest() which is used to create and analyze random forests.

Syntax

在 R 中创建随机森林的基本语法为 −

The basic syntax for creating a random forest in R is −

randomForest(formula, data)

以下是所用参数的描述 -

Following is the description of the parameters used −

  1. formula is a formula describing the predictor and response variables.

  2. data is the name of the data set used.

Input Data

我们将使用 R 中名为 readingSkills 的内置数据集来创建决策树。它描述了如果我们知道变量“年龄”、“鞋码”、“分数”以及该人是否是母语人士,则某个人的阅读技能得分。

We will use the R in-built data set named readingSkills to create a decision tree. It describes the score of someone’s readingSkills if we know the variables "age","shoesize","score" and whether the person is a native speaker.

以下是示例数据。

Here is the sample data.

# Load the party package. It will automatically load other
# required packages.
library(party)

# Print some records from data set readingSkills.
print(head(readingSkills))

当我们执行上述代码时,它会产生以下结果和图表:

When we execute the above code, it produces the following result and chart −

  nativeSpeaker   age   shoeSize      score
1           yes     5   24.83189   32.29385
2           yes     6   25.95238   36.63105
3            no    11   30.42170   49.60593
4           yes     7   28.66450   40.28456
5           yes    11   31.88207   55.46085
6           yes    10   30.07843   52.83124
Loading required package: methods
Loading required package: grid
...............................
...............................

Example

我们将使用 randomForest() 函数来创建决策树并查看其图表。

We will use the randomForest() function to create the decision tree and see it’s graph.

# Load the party package. It will automatically load other
# required packages.
library(party)
library(randomForest)

# Create the forest.
output.forest <- randomForest(nativeSpeaker ~ age + shoeSize + score,
           data = readingSkills)

# View the forest results.
print(output.forest)

# Importance of each predictor.
print(importance(fit,type = 2))

当我们执行上述代码时,会产生以下结果 -

When we execute the above code, it produces the following result −

Call:
 randomForest(formula = nativeSpeaker ~ age + shoeSize + score,
                 data = readingSkills)
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 1

        OOB estimate of  error rate: 1%
Confusion matrix:
    no yes class.error
no  99   1        0.01
yes  1  99        0.01
         MeanDecreaseGini
age              13.95406
shoeSize         18.91006
score            56.73051

Conclusion

从上面显示的随机森林中,我们可以得出结论,鞋码和分数是决定某人是否是母语人士的重要因素。此外,该模型只有 1% 的错误,这意味着我们可以预测 99% 的准确性。

From the random forest shown above we can conclude that the shoesize and score are the important factors deciding if someone is a native speaker or not. Also the model has only 1% error which means we can predict with 99% accuracy.