Supervised Classification: an Exploration with R

2019/07/01 11 min read

The twoClass dataset
Model comparison
- Global comparison
  - K-NN Comparison

We will use the twoClass dataset from Applied Predictive Modeling, the book of M. Kuhn and K. Johnson to illustrate the most classical supervised classification algorithms. We will use some advanced R packages: the ggplot2 package for the figures and the caret package for the learning part. caret that provides an unified interface to many other packages. We will also use h2o, a package dedicated to in memory learning of large data set, for its deep learning algorithm.

library("plyr")
library("dplyr")
library("ggplot2")
library("grid")
library("gridExtra")
library("caret")
library("h2o")
library("doFuture")
library("patchwork")
registerDoFuture()
plan(multiprocess)

The twoClass dataset

We read first the dataset and use ggplot2 to display it.

library(AppliedPredictiveModeling)
library(RColorBrewer)
data(twoClassData)
twoClass=cbind(as.data.frame(predictors),classes)
twoClassColor <- brewer.pal(3,'Set1')[1:2]
names(twoClassColor) <- c('Class1','Class2')
ggplot(data = twoClass,aes(x = PredictorA, y = PredictorB)) + 
  geom_point(aes(color = classes), size = 6, alpha = .5) +
  scale_colour_manual(name = 'classes', values = twoClassColor) +
  scale_x_continuous(expand = c(0,0)) +
  scale_y_continuous(expand = c(0,0)) +
  coord_equal()

We create a few functions that will be useful to display our classifiers.

nbp <- 250;
PredA <- seq(min(twoClass$PredictorA), max(twoClass$PredictorA), length = nbp)
PredB <- seq(min(twoClass$PredictorB), max(twoClass$PredictorB), length = nbp)
Grid <- expand.grid(PredictorA = PredA, PredictorB = PredB)

PlotGrid <- function(pred,title) {
  surf <- (ggplot(data = twoClass, aes(x = PredictorA, y = PredictorB, 
                                      color = classes)) +
          geom_tile(data = cbind(Grid, classes = pred), aes(fill = classes)) +
          scale_fill_manual(name = 'classes', values = twoClassColor) +
          ggtitle("Decision region") + theme(legend.text = element_text(size = 10)) +
  scale_colour_manual(name = 'classes', values = twoClassColor)) +
  scale_x_continuous(expand = c(0,0)) +
  scale_y_continuous(expand = c(0,0))
  pts <- (ggplot(data = twoClass, aes(x = PredictorA, y = PredictorB,  
                                    color = classes)) +
          geom_contour(data = cbind(Grid, classes = pred), aes(z = as.numeric(classes)), 
                       color = "red", breaks = c(1.5)) +
          geom_point(size = 4, alpha = .5) + 
          ggtitle("Decision boundary") +
          theme(legend.text = element_text(size = 10)) +
          scale_colour_manual(name = 'classes', values = twoClassColor)) +
    scale_x_continuous(expand = c(0,0)) +
    scale_y_continuous(expand = c(0,0)) +
    coord_equal()
  surf + pts +
    plot_annotation(title = title)
}

As explained in the introduction, we will use caret for the learning part. This package provides a unified interface to a huge number of classifier available in R. It is a very powerful tool when exploring the different models. In particular, it proposes to compute a resampling accuracy estimate and gives the user the choice of the specific methodology. We will use a repeated V-fold strategy with \(10\) folds and \(2\) repetitions. We will reuse the same seed for each model in order to be sure that the same folds are used.

library("caret")
V <- 10
T <- 4
TrControl <- trainControl(method = "repeatedcv",
                          number = V,
                          repeats = T)

Seed <- 345

Finally, we provide a function that will store the accuracies for every resample and every model. We will also compute an accuracy based on the same data than the one used to learn in order to show the over-fitting phenomenon.

ErrsCaret <- function(Model, Name) {
  Errs <- data.frame(t(postResample(predict(Model, newdata = twoClass), twoClass[["classes"]])),
                     Resample = "None", model = Name)
  rbind(Errs, data.frame(Model$resample, model = Name))
}

Errs <- data.frame()

We are now ready to define a function that take in input the current collection of accuracies, a name of a model, the corresponding formula and methods, as well as more parameters used to specify the model, and computes the trained model, displays its prediction in a figure and add the errors in the collection.

CaretLearnAndDisplay <- function (Errs, Name, Formula, Method, ...) {
  set.seed(Seed)
  Model <- train(as.formula(Formula), data = twoClass, method = Method, trControl = TrControl, ...)
  Pred <- predict(Model, newdata = Grid)
  print(PlotGrid(Pred, Name))
  Errs <- rbind(Errs, ErrsCaret(Model, Name))
}

We can apply this function to any model available in caret. We will pick a few models and sort them depending on the heuristic used to define them. We will distinguish models coming from a statistical point of view in which one try to estimate the conditional law and plug it into the Bayes classifier and from an optimization point of view in which one try to enforce a small training error by minimizing a relaxed criterion.

Statistical point of view

Generative Modeling

In the generative modeling approach, one propose to estimate the joint law of the covariates and the label and to derive the conditional law using the Bayes formula. The generative models differ by the choice of the density estimator (often specified by an intra class model). We consider here the LDA and QDA methods, in which a Gaussian model is used, and two variants of the Naive Bayes method, in which all the features are assumed to be independent.

Errs <- CaretLearnAndDisplay(Errs, "Linear Discrimant Analysis", "classes ~ .", "lda")

Errs <- CaretLearnAndDisplay(Errs, "Quadratic Discrimant Analysis", "classes ~ . ", "qda")

Errs <- CaretLearnAndDisplay(Errs, "Naive Bayes with Gaussian model", "classes ~ .", "nb",
                         tuneGrid = data.frame(usekernel = c(FALSE), fL = c(0), adjust = c(1)))

Errs <- CaretLearnAndDisplay(Errs, "Naive Bayes with kernel density estimates", "classes ~ .", "nb",
                         tuneGrid = data.frame(usekernel = c(TRUE), fL = c(0), adjust = c(1)))

Logistic regression

The most classical model is probably the logistic model, which is the canonical example of a parametric conditional regression estimate.

Errs <- CaretLearnAndDisplay(Errs, "Logistic", "classes ~ .", "glm")

We are not restricted to the use of two features but may use any transformation of them. For instance, we may use all the possible monomials up to degree \(2\).

Errs <- CaretLearnAndDisplay(Errs, "Quadratic Logistic", "classes ~ PredictorA + PredictorB + 
                           + I(PredictorA^2) + I(PredictorB^2) 
                           + I(PredictorA*PredictorB)", "glm")

Kernel Methods

In the nearest neighbor method, we need to supply a parameter: \(k\) the number of neighbors used to define the kernel. We will compare visually the solution obtained with a few \(k\) values.

ErrsKNN <- data.frame()
KNNKS <- c(seq(1, 33, by = 4),seq(37,85, by = 8), seq(101, 200, by = 8))
for (k in KNNKS) {
  ErrsKNN <- CaretLearnAndDisplay(ErrsKNN, sprintf("k-NN with k=%i", k), 
                           "classes ~ .","knn", tuneGrid = data.frame(k = c(k)))
}

Errs <- rbind(Errs, ErrsKNN)

caret provides an automatic way to select the best \(k\), or any method parameter, using the chosen resampling scheme.

Errs <- CaretLearnAndDisplay(Errs, "k-NN with CV choice", "classes ~ .", "knn",
                             tuneGrid = data.frame(k = KNNKS))

We will look more closely at this choice at the end of this document.

Optimization point of view

SVM

Here is first the optimal hyperplan using the plain vanilla SVM.

Errs <- CaretLearnAndDisplay(Errs, "Support Vector Machine", "classes ~ .", "svmLinear")

If we use a polynomial kernel (with automatic degree selection), we can obtain a more complex classifier.

Errs <- CaretLearnAndDisplay(Errs, "Support Vector Machine with polynomial kernel", 
                         "classes ~ .", "svmPoly")

We can also try a Gaussian kernel.

Errs <- CaretLearnAndDisplay(Errs, "Support Vector Machine with Gaussian kernel", 
                         "classes ~ .", "svmRadial")

NN

Neural networks are also present in caret.

Errs <- CaretLearnAndDisplay(Errs, "Neural Network", "classes ~ .", "mlp")

A promising, but not adapted to this too small dataset, is the h2o package which is an interface to the highly scalable h2o in memory learning library.

Name <- "H2O NN"

h2o.init()
twoClass.h2o <- as.h2o(twoClass)
Model.h2o <- h2o.deeplearning(x = 1:2, y = 3,
                              training_frame = twoClass.h2o,
                              activation = "Tanh",
                              hidden = c(4,3),
                              loss = 'CrossEntropy')
Grid.h2o <- as.h2o(Grid)
Pred.h2o <- h2o.predict(Model.h2o, newdata = Grid.h2o)
Pred <- factor(as.matrix(Pred.h2o$predict), 
               levels = c('Class1','Class2'))
PlotGrid(Pred, Name)

Pred.h2o <- h2o.predict(Model.h2o, newdata = twoClass.h2o)
Pred <- factor(as.matrix(Pred.h2o$predict), 
               levels = c('Class1','Class2'))
Errs <- rbind(Errs, data.frame(t(postResample(Pred, twoClass[["classes"]])),
                               Resample = "None", model = Name))

set.seed(Seed)
Folds <- caret::createMultiFolds(twoClass[["classes"]], k = V, times = T)

for (t in 1:T) {
  for (v in 1:V) {
    twoClassTrain <- slice(twoClass, Folds[[(t-1)*V+v]])
    twoClassTest <- slice(twoClass, -Folds[[(t-1)*V+v]])
    twoClassTrain.h2o <- as.h2o(twoClassTrain)
    Model.h2o <- h2o.deeplearning(x = 1:2, y = 3,
                                  training_frame =  twoClassTrain.h2o,
                                  activation = "Tanh",
                                  hidden = c(4,3),
                              loss = 'CrossEntropy')
    twoClassTest.h2o <- as.h2o(twoClassTest)
    Pred.h2o <- h2o.predict(Model.h2o, newdata = twoClassTest.h2o)
    Pred <- factor(as.matrix(Pred.h2o$predict), 
                   levels = c('Class1','Class2'))
    Errs <- rbind(Errs, data.frame(t(postResample(Pred, twoClassTest[["classes"]])),
                                    Resample = sprintf("Fold%d.Rep%d",v,t), model = Name))
    }
}

A new flexible package called mxnet allows to define custom deep neural network.

Name <- "mxnet NN"

data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=10)
act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu")
fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=2)
softmax <- mx.symbol.SoftmaxOutput(fc2, name="sm")

mx.ctx.default(new = mx.cpu())

set.seed(Seed)
Folds <- caret::createMultiFolds(twoClass[["classes"]], k = V, times = T)

X.train <- data.matrix(twoClass[,1:2])
y.train <- as.vector(as.numeric(twoClass[,3])-1)

X.test <- X.train
y.test <- y.train

capture.output(model <- mx.model.FeedForward.create(softmax, X.train,
                                     y.train , eval.metric=mx.metric.accuracy,
                                     learning.rate=0.07, momentum=0.9,
                                     num.round=200,
                                     ctx=mx.cpu()),
                file = "/dev/null")

y.pred <- predict(model, X.test)
y.label = max.col(t(y.pred))-1
Errs <- rbind(Errs,data.frame(t(postResample(factor(y.label, levels=c(0,1)), factor(y.test, levels=c(0,1)))),
                   Resample = "None", model = Name))

grid.pred <- predict(model, as.matrix(Grid))
grid.label <- max.col(t(grid.pred))-1
Pred <- factor(as.matrix(grid.label), levels = c(0,1),
               labels = c('Class1','Class2'))
PlotGrid(Pred, Name)

for (t in 1:T) {
  for (v in 1:V) {
    X.train <- data.matrix(twoClass[Folds[[(t-1)*V+v]],1:2])
    y.train <- as.vector(as.numeric(twoClass[Folds[[(t-1)*V+v]],3])-1)
    
    X.test <- data.matrix(twoClass[-Folds[[(t-1)*V+v]],1:2])
    y.test <- as.vector(as.numeric(twoClass[-Folds[[(t-1)*V+v]],3])-1)
    
    capture.output( model <- mx.model.FeedForward.create(softmax, X.train,
                                         y.train , eval.metric=mx.metric.accuracy,
                                         learning.rate=0.07, momentum=0.9,
                                         num.round=100,
                                         ctx=mx.cpu()),
                    file = "/dev/null")
    y.pred <- predict(model, X.test)
    y.label = max.col(t(y.pred))-1
    Errs <- rbind(Errs,data.frame(t(postResample(factor(y.label, levels=c(0,1)), factor(y.test, levels=c(0,1)))),
                                  Resample = sprintf("Fold%d.Rep%d",v,t), model = Name))
    
  }
}

Trees

We conclude this model exploration by some tree base methods: plain tree, bagging, random forest, adaboost, C5.0…

Errs <- CaretLearnAndDisplay(Errs, "CART", "classes ~ .", "rpart", 
                                 control = rpart::rpart.control(minsplit = 5, cp = 0), tuneGrid = data.frame(cp = .02))

library(rpart.plot)
Tree <- train(classes ~ ., data = twoClass, method = "rpart", control = rpart::rpart.control(minsplit = 5, cp = 0),
              tuneGrid = data.frame(cp = .02), trControl = TrControl)
Tree$finalModel

## n= 208 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 208 97 Class1 (0.5336538 0.4663462)  
##    2) PredictorB>=0.19795 121 30 Class1 (0.7520661 0.2479339)  
##      4) PredictorA>=0.12705 114 25 Class1 (0.7807018 0.2192982)  
##        8) PredictorA< 0.31135 56  6 Class1 (0.8928571 0.1071429) *
##        9) PredictorA>=0.31135 58 19 Class1 (0.6724138 0.3275862)  
##         18) PredictorB>=0.28795 43  9 Class1 (0.7906977 0.2093023)  
##           36) PredictorA< 0.61615 41  7 Class1 (0.8292683 0.1707317) *
##           37) PredictorA>=0.61615 2  0 Class2 (0.0000000 1.0000000) *
##         19) PredictorB< 0.28795 15  5 Class2 (0.3333333 0.6666667) *
##      5) PredictorA< 0.12705 7  2 Class2 (0.2857143 0.7142857)  
##       10) PredictorB>=0.3196 2  0 Class1 (1.0000000 0.0000000) *
##       11) PredictorB< 0.3196 5  0 Class2 (0.0000000 1.0000000) *
##    3) PredictorB< 0.19795 87 20 Class2 (0.2298851 0.7701149) *

rpart.plot(Tree$finalModel)

prp(Tree$finalModel, type = 2, extra = 104, nn = TRUE, fallen.leaves = TRUE, 
    box.col = twoClassColor[Tree$finalModel$frame$yval])

Errs <- CaretLearnAndDisplay(Errs, "Bagging", "classes ~ .", "treebag", 
                         control = rpart.control(minsplit = 5))

Errs <- CaretLearnAndDisplay(Errs, "Random Forest", "classes ~ .", "rf",
                         tuneLength = 1, 
                         control = rpart.control(minsplit = 5))

Errs <- CaretLearnAndDisplay(Errs, "AdaBoost", "classes ~ .", "ada",
                        control = rpart.control(minsplit = 5))

Errs <- CaretLearnAndDisplay(Errs, "C5.0", "classes ~ .", "C5.0")

Model comparison

We will now compare our model according to 4 criterion: - Accuracy: the naive empirical accuracy computed on the same dataset than the one used to learn the classifier, - AccuracyCV: the mean of the accuracies obtained by the resampling scheme, - AccuracyInf: a lowest bound on the mean accuracy obtained by substracting to the mean two times the standard deviation divided by the square root of the number of resamples. - AccuracyPAC: a highly probably bound on the accuracy obtained by substracting the standard deviation.

Global comparison

ErrCaretAccuracy <- function(Errs) {
  Errs <- group_by(Errs, model)
  cbind(dplyr::summarize(Errs, mAccuracy = mean(Accuracy, na.rm = TRUE), mKappa = mean(Kappa, na.rm = TRUE),
                             sdAccuracy = sd(Accuracy, na.rm = TRUE), sdKappa = sd(Kappa, na.rm = TRUE)))
}

ErrAndPlotErrs <- function(Errs, simple = FALSE) {
  ErrCV <- ErrCaretAccuracy(dplyr::filter(Errs, !(Resample == "None")))
  ErrCVtmp <- transmute(ErrCV, AccuracyCV = mAccuracy, AccuracyCVInf = mAccuracy - 2 * sdAccuracy/sqrt(T * V),
                     AccuracyCVPAC = mAccuracy - sdAccuracy,
                     model = model)
  ErrEmp <- dplyr::select(dplyr::filter(Errs, (Resample == "None")), Accuracy, model)
  Err <- dplyr::left_join(ErrCVtmp, ErrEmp)
  if (simple) {
  print(ggplot(data = reshape2::melt(dplyr::select(Err, model, Accuracy, AccuracyCV),
                                     "model"), 
               aes(x = model, y = value, color = variable)) +
          geom_point(size = 5) +
          theme(axis.text.x = element_text(angle = 45, hjust = 1), 
                plot.margin = grid::unit(c(1, 1, .5, 1.5), "lines")))
  } else {
    print(ggplot(data = reshape2::melt(Err, "model"), 
               aes(x = model, y = value, color = variable)) +
          geom_errorbar(data = ErrCV, 
                        aes(x = model,
                            ymin = mAccuracy - 2 * sdAccuracy/sqrt(T * V),
                        ymax = mAccuracy + 2 * sdAccuracy/sqrt(T * V), y = mAccuracy), color = "black") +
          geom_point(size = 5) +
          theme(axis.text.x = element_text(angle = 45, hjust = 1), 
                plot.margin = grid::unit(c(1, 1, .5, 1.5), "lines")))
  }
  Err
  }

Err <- ErrAndPlotErrs(Errs)

FindBestErr <- function(Err) {
  for (name in names(Err)[!(names(Err) == "model")]) {
    ind <- which.max(Err[, name])
    writeLines(strwrap(paste("Best method according to", name, ":", Err[ind, "model"])))
    }
  }

FindBestErr(Err)

## Best method according to AccuracyCV : k-NN with k=61
## Best method according to AccuracyCVInf : k-NN with k=61
## Best method according to AccuracyCVPAC : Support Vector Machine with
## Gaussian kernel
## Best method according to Accuracy : k-NN with k=1

An interesting plot to assess the variability of the Cross Validation result is the following parallel plot in which an error is computed for each resample in addition to the mean.

PlotParallels <- function(Errs) {
  pl <- ggplot(data = dplyr::filter(Errs, Resample != "None"), 
               aes(x = model, y = Accuracy, color = Resample)) + 
    geom_line(aes(x = as.numeric(model))) +
    scale_x_discrete(labels = unique(Errs$model)) +
    scale_color_grey(guide = FALSE)
  Err <- dplyr::summarize(group_by(dplyr::filter(Errs, Resample != "None"), model),
                          Accuracy = mean(Accuracy), Resample = "CV")
  pl <- pl + geom_line(data = Err, aes(x = as.numeric(model)), size = 3) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1), 
          plot.margin = grid::unit(c(1, 1, .5, 1.5), "lines"))
  print(pl)
}

PlotParallels(Errs)

K-NN Comparison

We focus here on the choice of the neighborhood in the \(k\) Nearest Neighbor method.

ErrKNN <- ErrAndPlotErrs(ErrsKNN)

FindBestErr(ErrKNN)

## Best method according to AccuracyCV : k-NN with k=61
## Best method according to AccuracyCVInf : k-NN with k=61
## Best method according to AccuracyCVPAC : k-NN with k=69
## Best method according to Accuracy : k-NN with k=1

ErrKNN <- ErrAndPlotErrs(ErrsKNN, simple = TRUE)

FindBestErr(ErrKNN)

## Best method according to AccuracyCV : k-NN with k=61
## Best method according to AccuracyCVInf : k-NN with k=61
## Best method according to AccuracyCVPAC : k-NN with k=69
## Best method according to Accuracy : k-NN with k=1

Professor of Applied Mathematics

Applied Math professor, my research and teaching interests range from Signal Processing to Data Science and Artificial Intelligence.