Executive Summary

In this project, we analyze the data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants who were asked to perform barbell lifts correctly and incorrectly in 5 different ways. Our goal is to predict the manner in which they did the exercise based on this data. At the end we also use our prediction model to predict 20 different test cases. Our Analysis and Modeling Work led us to identify 52 predictors in the data that enable us to make the prediction required with a very high degree of accuracy using a C5.0 model which was the best performing model from 4 different types we tried. During the Feature Selection and Model Training and Evaluation stages, we used K-Fold Cross-Validation.

Data Processing

Let’s start by loading the libraries we’ll need. Then we’ll read in the Original training and Original testing data sets.

library(caret);library(plyr);library(dplyr);library(C50);library(mlbench);
trainingOrigin<-read.csv("pml-training.csv")
testingOrigin<-read.csv("pml-testing.csv")

Next, lets split the Orignal training data set into training and testing data sets.

set.seed(13579)
inTrain<-createDataPartition(y=trainingOrigin$classe,p=0.70,list=FALSE)
training<-trainingOrigin[inTrain,]
testing<-trainingOrigin[-inTrain,]

Exploratory Data Analysis & Feature Selection

Let’s have a look at the dimensions of our training data set.

dim(training)
## [1] 13737   160

Let’s have a look at the first few variables in the training data set.

head(names(training),n=10)
##  [1] "X"                    "user_name"            "raw_timestamp_part_1"
##  [4] "raw_timestamp_part_2" "cvtd_timestamp"       "new_window"          
##  [7] "num_window"           "roll_belt"            "pitch_belt"          
## [10] "yaw_belt"

Let’s have a look at the last few variables in the training data set.

tail(names(training),n=10)
##  [1] "gyros_forearm_x"  "gyros_forearm_y"  "gyros_forearm_z" 
##  [4] "accel_forearm_x"  "accel_forearm_y"  "accel_forearm_z" 
##  [7] "magnet_forearm_x" "magnet_forearm_y" "magnet_forearm_z"
## [10] "classe"

Notice the variable X which could be just an index variable. Let’s see if it actually is an index and if it has any relationship with the classe. In other words, is the data in the training set ordered by classe?

plot(training$X,col=training$classe)

It seems that X was indeed equal to the index in the original data set. Splitting the original data set caused it to appear here not exactly equal to the index of this training set obviously but it does keep it’s linearity with the index here. Additionally, we see that the classe variable values seem to be grouped together as they change as the index (row number) as well as X change. This X variable has the potential to confuse any machine learning algorithm we use as it would trick it into believing it can predict the classe of a testing set just based on the X value (=index value in Original Training Data Set). Additionally, the next 6 variables after X are specific to this experiment and dataset and would therefore not be good predictors overall. Let’s have a look at these variable names again.

names(training)[2:7]
## [1] "user_name"            "raw_timestamp_part_1" "raw_timestamp_part_2"
## [4] "cvtd_timestamp"       "new_window"           "num_window"

Let’s make sure X and those next 6 variables are not included as predictors. Going forward, we’ll also remove unneeded variables from versions of the trainingOrigin as well so we can train the final model at the end on the predictors we’ll end up selecting.

remove<-1:7
training2<-select(training,-remove)
trainingOrigin2<-select(trainingOrigin,-remove)

A closer look at the training set yields an interesting observation. 67 variables have over 97.5% of their values as NA. As this is such a huge percentage, imputation would not make sense here. Therefore we won’t use these variables as predictors.

## We need to identify which columns have over 97.5% of their values as NAs from the training2 data set to remove them.
indexer<-integer()
for (i in 1:ncol(training2)){
        if((sum(is.na(training2[,i]))/nrow(training2))>=0.975){
                indexer<-append(indexer,i)
        }
}
length(indexer)
## [1] 67
training3<-select(training2,-indexer)
trainingOrigin3<-select(trainingOrigin2,-indexer)

Next, let’s have a look at the remaining variables in our training3 data set to identify those variables with Near Zero Variability as these won’t help our machine learning algorithms to make predictions. Let’s remove those variables too. We know from the plot above that the classe variable doesn’t have near zero variability so we can keep it to go through the below and it will remain in the dataset. Below we find 33 variables with Near Zero Variability to remove.

## Identify Variables with Near Zero Variability and Remove them from the Training Set
nsv<-nearZeroVar(training3,saveMetrics = TRUE)
indexer2<-integer()
for (i in 1:nrow(nsv)){
        if (nsv[i,4]==TRUE){
                indexer2<-append(indexer2,i)
        }
}
length(indexer2)
## [1] 33
training4<-select(training3,-indexer2)
trainingOrigin4<-select(trainingOrigin3,-indexer2)

The variables remaining now are those which should be the most useful in making predictions. Let’s see how many those are.

## We subtract 1 as classe (the outcome) is counted as one of the variables and we want to know the number of predictors.
ncol(training4)-1
## [1] 52

That’s still a lot of variables and training our algorithms on all of those carries the risk of overfitting. Perhaps we can identify how many features we actually need to make predictions and what those features are with Recursive Feature Elimination (RFE). We’ll use Random Forests in each iteration for evaluation purposes and we’ll use K-Fold cross-validation setting the value of K to 10.

## forSizes sets the various numbers of variables the rfe function will try out and evaluate
forSizes <- c(1:10, 15, 20, 25, 30, 35, 40, 50, 52)
control <- rfeControl(functions=rfFuncs, method="cv", number=10)

set.seed(54321)
theresults <- rfe(training4[,1:52], training4[,53], sizes=forSizes, rfeControl=control)

Now let’s check at what number of predictors do we get the highest accuracy and what those predictors are.

plot(theresults, type=c("g", "o"))

theresults
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy  Kappa AccuracySD  KappaSD Selected
##          1   0.5102 0.3658   0.025028 0.036349         
##          2   0.7299 0.6577   0.007458 0.009082         
##          3   0.8893 0.8599   0.008113 0.010271         
##          4   0.9415 0.9260   0.008300 0.010504         
##          5   0.9643 0.9549   0.005891 0.007449         
##          6   0.9774 0.9714   0.003632 0.004597         
##          7   0.9785 0.9728   0.006142 0.007769         
##          8   0.9827 0.9782   0.004402 0.005565         
##          9   0.9854 0.9815   0.003835 0.004849         
##         10   0.9850 0.9810   0.003910 0.004946         
##         15   0.9878 0.9846   0.004118 0.005209         
##         20   0.9911 0.9888   0.002945 0.003726         
##         25   0.9921 0.9900   0.001891 0.002392         
##         30   0.9933 0.9915   0.001902 0.002407         
##         35   0.9933 0.9915   0.002135 0.002702         
##         40   0.9936 0.9919   0.002191 0.002772         
##         50   0.9944 0.9929   0.001783 0.002256         
##         52   0.9947 0.9933   0.001284 0.001625        *
## 
## The top 5 variables (out of 52):
##    roll_belt, yaw_belt, magnet_dumbbell_z, pitch_belt, magnet_dumbbell_y

We can see that the highest accuracy comes when using all the remaining 52 predictors. So we don’t actually get a higher accuracy when using fewer predictors here.

Model Training & Selection

Let’s think about how we’re gonna train and evaluate various models on our training4 set. I have decided to use K-Fold Cross-Validation and I’ll set the value of K at 10. Therefore the dataset will be split into 10 parts. The algorithm will be run 10 times on the data, each time using 9 of the 10 subsamples as a training set and the remaining 1 subsample as a testing set. Each one of the subsamples will get to be a testing set once and a part of the training set 9 times. The mean error rate we get from an algorithm after getting tested these 10 times would be a good estimate of the mean out of sample error rate.

I will try 4 different classification model types to try and predict classe based on the 52 predictor variables in the training4 data set.

I’ll evaluate the following algorithms:

  1. Recursive Partitioning (rpart)
  2. Random Forests (rf)
  3. Gradient Boosted Method (gbm)
  4. C5.0 Algorithm

Notice that I set the seed everytime before training the algorithms below to ensure that we’re comparing results on the same exact subsamples.

control <- trainControl(method="cv", number=10)

set.seed(12345)
ctmodFit<-train(classe~.,data=training4,method="rpart",trControl=control)

set.seed(12345)
rfmodFit<-train(classe~.,data=training4,method="rf",PROX=TRUE, trControl=control)

set.seed(12345)
bomodFit<-train(classe~.,data=training4,method="gbm", trControl=control,verbose=FALSE)

set.seed(12345)
c5modFit<-train(classe~.,data=training4,method="C5.0", trControl=control,verbose=FALSE)

Now let’s evaluate the results from each of the models.

results<-resamples(list(RPART=ctmodFit, RF=rfmodFit, GBM=bomodFit, C5=c5modFit))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: RPART, RF, GBM, C5 
## Number of resamples: 10 
## 
## Accuracy 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RPART 0.4799709 0.5023671 0.5101882 0.5108059 0.5205350 0.5411508    0
## RF    0.9876184 0.9914406 0.9919883 0.9919186 0.9934486 0.9956395    0
## GBM   0.9548763 0.9566404 0.9581670 0.9596701 0.9623088 0.9672965    0
## C5    0.9898108 0.9923656 0.9948998 0.9938127 0.9954507 0.9963610    0
## 
## Kappa 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RPART 0.3194661 0.3493124 0.3598700 0.3611182 0.3742436 0.4024562    0
## RF    0.9843320 0.9891722 0.9898655 0.9897762 0.9917125 0.9944828    0
## GBM   0.9429028 0.9451336 0.9470893 0.9489761 0.9523130 0.9586115    0
## C5    0.9871096 0.9903444 0.9935482 0.9921736 0.9942448 0.9953966    0

We can see from the summary that the mean accuracy of the C5.0 and Random Forests algorithms are over 0.99, and the mean accuracy of the GBM algorithm is over 0.96. They are all much higher than the RPART algorithm. Our estimate of the average out of sample error rate we would get from each algorithm would be 1 - the mean accuracy of that algorithm.

Let’s see a boxplot of the top 3 algorithm results.

topresults<-resamples(list(RF=rfmodFit, GBM=bomodFit, C5=c5modFit))
bwplot(topresults)

As our top 2 algorithms have a mean accuracy of over 0.99, we really don’t need to use Ensembling Methods. We can just use the top performing algorithm.

The top algorithm in accuracy seems to be C5.0. Based on its K-Fold Cross-Validation mean accuracy results (0.9938), we can estimate its average out of sample error rate to be approximately 0.0062 which equals 0.62 %.

Next, let’s test it out on our own testing data set split we created and see the accuracy. The accuracy should be close to the mean accuracy we saw earlier with cross-validation.

ourC5TestPredict<-predict(c5modFit,testing)
confusionMatrix(ourC5TestPredict,testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1670   13    0    0    0
##          B    4 1119    2    2    2
##          C    0    6 1021    1    0
##          D    0    0    3  957    4
##          E    0    1    0    4 1076
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9929          
##                  95% CI : (0.9904, 0.9949)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.991           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9976   0.9824   0.9951   0.9927   0.9945
## Specificity            0.9969   0.9979   0.9986   0.9986   0.9990
## Pos Pred Value         0.9923   0.9911   0.9932   0.9927   0.9954
## Neg Pred Value         0.9990   0.9958   0.9990   0.9986   0.9988
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2838   0.1901   0.1735   0.1626   0.1828
## Detection Prevalence   0.2860   0.1918   0.1747   0.1638   0.1837
## Balanced Accuracy      0.9973   0.9902   0.9968   0.9957   0.9967

Indeed, the accuracy of our model on our testing data set comes out to be almost exactly equal to what we expected, coming at 0.9968. Therefore, we are now also more confident about our expected out of sample error rate.

Next, let’s train our model again but this time on the entire original training set (without our split for training and testing). This is in preparation to predict on the original testing set given in the project assignment.

We’ll use the best tuning parameters we got when training the c5modFit model. We can see those parameters here:

c5modFit$bestTune
##   trials model winnow
## 6     20  tree   TRUE
c5Grid<-expand.grid(trials=20, model="rules", winnow=TRUE)
c5modFitFull<-train(classe ~ ., data=trainingOrigin4, method="C5.0", verbose=FALSE, tuneGrid=c5Grid)

Predicting For The Project Test

Now let’s use our c5modFitFull C5.0 model to predict the classe variable in the original testing data set.

c5predictOriginTest<-predict(c5modFitFull,testingOrigin)
c5predictOriginTest
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Submitting those classe predictions into the Course Project Prediction Quiz yielded a score of 20/20.

Data Source

The data used in this project can be accessed here.
The research paper this data is based on can be accessed here.