Updated 2014 September 17th to reflect changes in the R packages
Source http://mkseo.pe.kr/stats/?p=220.
Using Random Forests in predicting wines derived from three different cultivars. Download the wine data set from the Machine Learning Repository.
# install if necessary install.packages("randomForest") library(randomForest) randomForest 4.6-12 Type rfNews() to see new features/changes/bug fixes. # load the wine dataset data <- read.table("https://davetang.org/file/wine.csv", header=TRUE, sep=",") # check out the data str(data) 'data.frame': 178 obs. of 14 variables: $ class : Factor w/ 3 levels "class_1","class_2",..: 1 1 1 1 1 1 1 1 1 1 ... $ alcohol : num 14.2 13.2 13.2 14.4 13.2 ... $ malic_acid : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ... $ ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ... $ alcalinity_of_ash : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ... $ magnesium : int 127 100 101 113 118 112 96 121 97 98 ... $ total_phenols : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ... $ flavanoids : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ... $ nonflavanoid_phenols: num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ... $ proanthocyanins : num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ... $ colour_intensity : num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ... $ hue : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ... $ od280_od315 : num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ... $ proline : int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ... # table of wine classes table(data$class) class_1 class_2 class_3 59 71 48 # prepare test and training set test <- data[ c(1:10, 60:69, 131:140), ] train <- data[ c(11:59, 70:130, 141:178), ] # run the Random Forest classifier r <- randomForest(class ~ ., data=train, importance=TRUE, do.trace=100) ntree OOB 1 2 3 100: 1.35% 0.00% 3.28% 0.00% 200: 1.35% 0.00% 3.28% 0.00% 300: 1.35% 0.00% 3.28% 0.00% 400: 1.35% 0.00% 3.28% 0.00% 500: 1.35% 0.00% 3.28% 0.00% # check out the results r Call: randomForest(formula = class ~ ., data = train, importance = TRUE, do.trace = 100) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 3 OOB estimate of error rate: 1.35% Confusion matrix: class_1 class_2 class_3 class.error class_1 49 0 0 0.00000000 class_2 1 59 1 0.03278689 class_3 0 0 38 0.00000000 names(r) [1] "call" "type" "predicted" "err.rate" "confusion" [6] "votes" "oob.times" "classes" "importance" "importanceSD" [11] "localImportance" "proximity" "ntree" "mtry" "forest" [16] "y" "test" "inbag" "terms" r$call randomForest(formula = class ~ ., data = train, importance = TRUE, do.trace = 100) r$type [1] "classification" head(r$predicted) 11 12 13 14 15 16 class_1 class_1 class_1 class_1 class_1 class_1 Levels: class_1 class_2 class_3 # two incorrect predictions # as seen above table(r$predicted == train$class) FALSE TRUE 2 146 head(r$err.rate) OOB class_1 class_2 class_3 [1,] 0.12500000 0.17647059 0.1481481 0.00000000 [2,] 0.11458333 0.06451613 0.1666667 0.08695652 [3,] 0.15517241 0.14285714 0.2200000 0.06451613 [4,] 0.12195122 0.12820513 0.1538462 0.06250000 [5,] 0.11029412 0.11111111 0.1090909 0.11111111 [6,] 0.09558824 0.11111111 0.1090909 0.05555556 r$confusion class_1 class_2 class_3 class.error class_1 49 0 0 0.00000000 class_2 1 59 1 0.03278689 class_3 0 0 38 0.00000000 head(r$votes) class_1 class_2 class_3 11 0.9892473 0.010752688 0.000000000 12 0.7752809 0.151685393 0.073033708 13 0.8901734 0.086705202 0.023121387 14 0.9768786 0.011560694 0.011560694 15 0.9900990 0.004950495 0.004950495 16 0.9890110 0.010989011 0.000000000 head(r$oob.times) [1] 186 178 173 173 202 182 r$classes [1] "class_1" "class_2" "class_3" head(r$importance) class_1 class_2 class_3 MeanDecreaseAccuracy MeanDecreaseGini alcohol 0.1525347592 0.0928248591 0.023430629 0.093836483 13.7934327 malic_acid 0.0021385789 0.0002331466 0.014002376 0.004613423 1.8313973 ash -0.0002340754 0.0010472105 0.003502219 0.001340635 0.8750481 alcalinity_of_ash 0.0185050050 0.0023801169 0.017335616 0.011531540 3.0038507 magnesium 0.0410753409 0.0055964424 0.006044563 0.017276572 3.1034223 total_phenols 0.0883397880 0.0026153845 0.043181163 0.041287427 5.7167077 head(r$importanceSD) class_1 class_2 class_3 MeanDecreaseAccuracy alcohol 0.0084789054 0.0059100068 0.003047730 0.0046735969 malic_acid 0.0011525786 0.0008391581 0.002548371 0.0008575549 ash 0.0007456213 0.0005977076 0.001228513 0.0004172596 alcalinity_of_ash 0.0026827364 0.0012745702 0.002734288 0.0014002414 magnesium 0.0042727455 0.0012352037 0.001371906 0.0017315456 total_phenols 0.0072416021 0.0012143907 0.004509257 0.0031221191 r$localImportance NULL r$proximity NULL r$ntree [1] 500 r$mtry [1] 3 names(r$forest) [1] "ndbigtree" "nodestatus" "bestvar" "treemap" "nodepred" "xbestsplit" "pid" "cutoff" [9] "ncat" "maxcat" "nrnodes" "ntree" "nclass" "xlevels" head(r$y) 11 12 13 14 15 16 class_1 class_1 class_1 class_1 class_1 class_1 Levels: class_1 class_2 class_3 r$test NULL r$inbag NULL
Analysis of the Area Under the Curve (AUC)
The rattle package provides a very nice GUI to the R workspace for conducting data analysis. Within it is a GUI for random forests, and below I follow the analysis steps that rattle performs for the analysis of the area under the curve (AUC).
#install if necessary install.packages('pROC') library(pROC) #there is a test dataset within the pROC package, #containing subarachnoid hemorrhage data #which can be loaded by running #data(aSAH) #calculate the area under the curve (AUC) ## Default S3 method: #roc(response, predictor, controls, cases, ... #for the test dataset it would be #roc(aSAH$outcome, aSAH$s100b) #using flavanoids as my predictor, for class 1 and 2 #only the first two levels are used, the third is ignored roc(data$class, data$flavanoids) #Call: #roc.default(response = data$class, predictor = data$flavanoids) # #Data: data$flavanoids in 59 controls (data$class class_1) > 71 cases (data$class class_2). #Area under the curve: 0.8855 #calculate the AUC confidence interval for flavanoids #for the test data set it would be #ci.auc(aSAH$outcome, aSAH$s100b) ci.auc(data$class, data$flavanoids) #95% CI: 0.8266-0.9444 (DeLong) #list the importance of the variables rn <- round(importance(r), 2) #sort by mean decrease accuracy rn[order(rn[,4], decreasing=TRUE),] class_1 class_2 class_3 MeanDecreaseAccuracy MeanDecreaseGini colour_intensity 17.12 18.52 16.99 24.64 17.09 flavanoids 16.94 9.98 19.09 22.77 15.43 alcohol 18.95 16.91 7.55 21.12 14.77 proline 19.40 12.02 7.80 21.00 13.36 od280_od315 14.44 9.14 16.30 18.14 12.74 hue 10.30 6.60 12.38 13.76 6.79 total_phenols 10.74 -0.80 8.57 11.31 4.67 magnesium 9.90 5.95 4.42 10.63 2.86 alcalinity_of_ash 7.51 1.10 5.79 8.77 2.62 proanthocyanins 4.76 1.23 6.93 7.83 1.88 malic_acid 2.94 0.29 6.22 6.26 2.03 nonflavanoid_phenols 4.08 -0.72 2.80 4.57 0.92 ash 0.91 2.03 1.48 2.64 0.96 #plot the OOB ROC curve #install if necessary install.packages('verification') library(verification) #how good at predicting class_2? aucc <- roc.area(as.integer(train$class=='class_2'), r$votes[,2])$A roc.plot(as.integer(train$class=='class_2'), r$votes[,2], main="") legend("bottomright", bty="n", sprintf("Area Under the Curve (AUC) = %1.3f", aucc)) title(main="OOB ROC Curve Random Forest for class_2 wine.csv")
Prediction
data.predict <- predict(r, test) t <- table(observed=test[,'class'], predict=data.predict) prop.table(t,1) predict observed class_1 class_2 class_3 class_1 1.0 0.0 0.0 class_2 0.0 0.9 0.1 class_3 0.0 0.0 1.0
Does increasing the number of trees decrease the error rate?
r_1000 <- randomForest(class ~ ., data=train, importance=TRUE, ntree=1000) r_1000 Call: randomForest(formula = class ~ ., data = train, importance = TRUE, ntree = 1000) Type of random forest: classification Number of trees: 1000 No. of variables tried at each split: 3 OOB estimate of error rate: 1.35% Confusion matrix: class_1 class_2 class_3 class.error class_1 49 0 0 0.00000000 class_2 1 59 1 0.03278689 class_3 0 0 38 0.00000000 predict_1000 <- predict(r_1000, test) t_1000 = table(observed=test[,'class'], predict=predict_1000) prop.table(t_1000,1) predict observed class_1 class_2 class_3 class_1 1.0 0.0 0.0 class_2 0.0 0.9 0.1 class_3 0.0 0.0 1.0 plot(r_1000, type="l")
Compared to a classifcation tree
library(tree) tree1 <- tree(class ~ alcohol + malic_acid + ash + alcalinity_of_ash + magnesium + total_phenols + flavanoids + nonflavanoid_phenols + proanthocyanins + colour_intensity + hue, data = data) > summary(tree1) Classification tree: tree(formula = class ~ alcohol + malic_acid + ash + alcalinity_of_ash + magnesium + total_phenols + flavanoids + nonflavanoid_phenols + proanthocyanins + colour_intensity + hue, data = data) Variables actually used in tree construction: [1] "flavanoids" "colour_intensity" "malic_acid" "alcohol" "magnesium" Number of terminal nodes: 7 Residual mean deviance: 0.07393 = 12.64 / 171 Misclassification error rate: 0.01685 = 3 / 178 plot(tree1) text(tree1)
5 characteristics were used to classify the three classes of wine, despite inputting all 14, with a very low misclassification error rate.
This work is licensed under a Creative Commons
Attribution 4.0 International License.
Is there any way of getting confidence scores or votes for each record in the test set in R using Random Forest
The votes are stored in the randomForest object. If your object is called r, then you can get the votes for each record via r$votes.
I would like to generate a ROC from the test data in your wine example. I believe the ROC’s from training data are too optimistic. Can you provide the correct syntax in teh following commands for doing this?
aucc <- roc.area(as.integer(train$class=='class_2'), r$votes[,2])$Aroc.plot(as.integer(train$class=='class_2'), r$votes[,2], main="")
Thank you for your help,
David
I learned about the ROCR package in R and recommend that for creating ROC curves. I have an example in my GitHub repository using votes as the prediction: https://github.com/davetang/learning_random_forest#random-forest-on-breast-cancer-data.
Dave, when I run data.predict <- predict(r, test) R never finishes running. Do you know what's going on?
I just re-ran the code in this post and data.predict works for me.