Decision Tree and Random Forest

Decision tree: We are going to build a decision tree to predict the species a given iris plant (represented by the four measurements Sepal.Width, Sepal.Length, Petal.Length, Petal.Width) belongs to. Try to go through all the steps and to understand what each step is doing. You will need the following packages:

install.packages("rpart")
install.packages("rpart.plot")

Load the iris dataset and split it into a training set and test set:

data(iris)

set.seed(2)
ind=sample(2,nrow(iris),replace=TRUE,prob=c(0.80,0.20))

iris.training=iris[ind==1,]
iris.test=iris[ind==2,]

Make the decision tree model using the entropy as the impurity index:

library(rpart)
library(rpart.plot)
tree=rpart(data=iris.training,Species~Sepal.Width+Sepal.Length+Petal.Length+Petal.Width,method="class",control=rpart.control(minsplit=10,minbucket=5),parms=list(split="information"))
rpart.plot(tree,main="Classification tree for the iris data (using 80% of data as training set)",extra=101)

We see that the first two splits of the decision tree use Petal.Length and Petal.Width. We may want to look again at the exploratory analysis that we have done before building the neural network. This may help you to understand better the iris tree structure. At least let us plot Petal.Width versus Petal.Length:

#install.packages("ggplot2")
library(ggplot2)
qplot(Petal.Length, Petal.Width, data=iris, colour=Species, size=I(4))

We clearly see the three groups (setosa, versicolor, virginica) are well separated by using only Petal.Length and Petal.Width.

Predict the species for the “iris.training” dataset:

predictions=predict(tree,newdata=iris.training,type="class")
actuals=iris.training$Species
table(actuals,predictions)

##             predictions
## actuals      setosa versicolor virginica
##   setosa         36          0         0
##   versicolor      0         40         1
##   virginica       0          3        35

Predict the species for the “iris.test” dataset:

predictions=predict(tree,newdata=iris.test,type="class")
actuals=iris.test$Species
confusion.matrix=table(actuals,predictions)
print(confusion.matrix)

##             predictions
## actuals      setosa versicolor virginica
##   setosa         14          0         0
##   versicolor      0          9         0
##   virginica       0          2        10

This confusion matrix shows that the decision tree has made 2 mistakes in 35 preditions. The test sets used for the neural network and the decision tree are the same, so we may compare their performances.

Let us finally compute the accuracy of the tree:

accuracy=sum(diag(confusion.matrix))/sum(confusion.matrix)
print(accuracy)

## [1] 0.9428571

Above we used (minsplit=10,minbucket=5) as stopping rules. If you want to use the pruning method instead to find the best tree:

tree=rpart(data=iris,Species~Sepal.Width+Sepal.Length+Petal.Length+Petal.Width,method="class",control=rpart.control(minsplit=1,minbucket=1,cp=0.000001),parms=list(split="information"))
rpart.plot(tree,main="Bigest Tree",extra=101)

printcp(tree)

## 
## Classification tree:
## rpart(formula = Species ~ Sepal.Width + Sepal.Length + Petal.Length + 
##     Petal.Width, data = iris, method = "class", parms = list(split = "information"), 
##     control = rpart.control(minsplit = 1, minbucket = 1, cp = 1e-06))
## 
## Variables actually used in tree construction:
## [1] Petal.Length Petal.Width  Sepal.Length Sepal.Width 
## 
## Root node error: 100/150 = 0.66667
## 
## n= 150 
## 
##        CP nsplit rel error xerror     xstd
## 1 5.0e-01      0      1.00   1.20 0.048990
## 2 4.4e-01      1      0.50   0.82 0.060970
## 3 2.0e-02      2      0.06   0.10 0.030551
## 4 1.0e-02      3      0.04   0.10 0.030551
## 5 5.0e-03      6      0.01   0.09 0.029086
## 6 1.0e-06      8      0.00   0.10 0.030551

plotcp(tree)

ptree=prune(tree,cp=2.0e-02)
rpart.plot(ptree,main="Pruned Tree",extra=101)

Questions: Play around with different hyperparameter values [split (“gini”), minsplit, minbucket, Species (Petal.Length+Petal.Width)] in the hope that your model will perform better. Use the test set as a validation set.

Random forest: We are going to build a random forest to predict the species a given iris plant belongs to. Try to go through all the steps and to understand what each step is doing. You will need the following package:

install.packages("randomForest")

Load the package:

library(randomForest)

Build the random forest model (using the Gini index):

random_forest=randomForest(data=iris.training,Species~Sepal.Width+Sepal.Length+Petal.Length+Petal.Width,impurity='gini',ntree=200,replace=TRUE)
print(random_forest)

## 
## Call:
##  randomForest(formula = Species ~ Sepal.Width + Sepal.Length +      Petal.Length + Petal.Width, data = iris.training, impurity = "gini",      ntree = 200, replace = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 5.22%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         36          0         0  0.00000000
## versicolor      0         39         2  0.04878049
## virginica       0          4        34  0.10526316

The above confusion matrix and the out-of-bag (OOB) error rate [6/115*100=5.22%] are obtained as follows: for each plant in the training dataset, make a random forest prediction using only the trees that did not use that particular plant in their bootstrap training subset. The random forest model is thus used to predict the data NOT drawn (the “out-of-bag” sample).

Let us plot the misclassification error rate as a function of the number of trees used in the random forest:

plot(random_forest)
legend("top",cex=0.8,legend=colnames(random_forest$err.rate),lty=c(1,2,3),col=c(1,2,3),horiz=T)

The x-axis is the “number of trees” and the y-axis is the “misclassification error rate”. The black solid line represents the overall OOB error. The colour lines are the class errors (one for each species: red=Setosa, green=Versicolor and blue=Virginica). We see that we should use about 25 trees:

random_forest=randomForest(data=iris.training,Species~Sepal.Width+Sepal.Length+Petal.Length+Petal.Width,impurity='gini',ntree=25,replace=TRUE)

Predict the species for the “iris.training” dataset (using all trees in the random forest):

predictions=predict(random_forest,newdata=iris.training,type="class")
actuals=iris.training$Species
table(actuals,predictions)

##             predictions
## actuals      setosa versicolor virginica
##   setosa         36          0         0
##   versicolor      0         41         0
##   virginica       0          0        38

We see 100% accuracy: the random forest has perfecly learned the training data.

Predict the species for the “iris.test” dataset:

predictions=predict(random_forest,newdata=iris.test,type="class")
actuals=iris.test$Species
confusion.matrix=table(actuals,predictions)
print(confusion.matrix)

##             predictions
## actuals      setosa versicolor virginica
##   setosa         14          0         0
##   versicolor      0          8         1
##   virginica       0          2        10

Finally the accuracy:

accuracy=sum(diag(confusion.matrix))/sum(confusion.matrix)
print(accuracy)

## [1] 0.9142857

We see that the accuracy of the random forest is slighly lower than that of the decision tree. This may happen when the number of feature variables (4 for the iris data) is small.

Compute and plot the importance (of all the variables):

importance(random_forest)

##              MeanDecreaseGini
## Sepal.Width          1.522782
## Sepal.Length         9.836576
## Petal.Length        27.427145
## Petal.Width         36.896801

varImpPlot(random_forest)

As in the decision tree, we see that the Petal.Length and Petal.Width are the most important predictors.

Questions: Play around with different parameter values (impurity (‘entropy’), ntree, replace (FALSE), sampsize (size of sample to draw; default: N if replace=TRUE, 0.63*N otherwise), mtry (the number of variables randomly sampled as candidates at each split; default: sqrt(4)=2)] in the hope that your model will perform better.

Decision Tree and Random Forest

Philippe Jacquet

June 7, 2018