Machine Learning Classification, an example with barbell lifts

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).The goal of the project is to predict the manner in which they did the exercise using any of the other variables to predict with. This was the course project for the Practical Machine Learning course, part of the Data Science Specialization by Johns Hopkins University on Coursera.

1. Exploratory Data Analysis and selection of variables

First, we read the data. We hide the results due to constrains of space in the report:

train.data <- read.csv("pml-training.csv")
test.data <- read.csv("pml-testing.csv")
dim(train.data)
dim(test.data)
summary(train.data)
summary(test.data)
str(train.data)
str(test.data)

As there are several columns with NA and blank cells, we reload the data specifying the NA values:

train.data <- read.csv("pml-training.csv", na.strings=c("NA","#DIV/0!",""))
test.data <- read.csv("pml-testing.csv", na.strings=c("NA","#DIV/0!",""))
dim(train.data)
## [1] 19622   160
dim(test.data)
## [1]  20 160

As we pointed out before, there are a lot of variables (columns) containing mainly NA values. We exclude variables with more than 95% of NA values.

#Training data
train.data.na = is.na(train.data)
train.na.columns = which(colSums(train.data.na) > 19622*0.95)
train.data.clean = train.data[, -train.na.columns]
dim(train.data.clean)
## [1] 19622    60
#Test data
test.data.na = is.na(test.data)
test.na.columns = which(colSums(test.data.na) > 20*0.95)
test.data.clean = test.data[, -test.na.columns]
dim(test.data.clean)
## [1] 20 60

Now, we select sensor related columns, in addition to the classe variable in training dataset (which is not present in testing dataset):

#Training data
sensor.train.columns = grep(pattern = "_belt|_arm|_dumbbell|_forearm", names(train.data.clean))
length(sensor.train.columns)
## [1] 52
final.train.data <- train.data.clean[, c(sensor.train.columns,60)]
dim(final.train.data)
## [1] 19622    53
table(complete.cases(final.train.data))
## 
##  TRUE 
## 19622
#Test data
sensor.test.columns = grep(pattern = "_belt|_arm|_dumbbell|_forearm", names(test.data.clean))
length(sensor.test.columns)
## [1] 52
final.test.data <- test.data.clean[, sensor.test.columns]
dim(final.test.data)
## [1] 20 52
table(complete.cases(final.test.data))
## 
## TRUE 
##   20

So we have 52 predictor variables, with no NA values. Now we can perform the Machine Learning.

2. Machine Learning

2.1. Training data for the model

Although the datasets originally provided are called training and testing, the testing doesn’t contain the classe variable, so we can’t validate the results of machine learning in the testing dataset. We followed the common procedure of the course considering the training dataset as the complete dataset and splitting it into a training dataset and a testing dataset:

suppressMessages(library(caret))
set.seed(888)

data <- final.train.data
inTrain <- createDataPartition(y=data$classe, p=0.75, list=FALSE)
training <- data[inTrain,]
testing <- data[-inTrain,]

2.2 Model selection

We selected the Random Forest (rf) and Gradient Boosting (gbm) for comparison to obtain the best model. A 10-fold cross validation is employed during the model building to reduce the overfitting:

modelControl <- trainControl(method="cv", number = 10)

First, we use Random Forest. In cross validation we use Kappa parameter, as we are trying to predict a factor (categorical) variable, and it is generally thought to be a more robust measure than simple percent agreement calculation (accuracy):

suppressMessages(library(randomForest))

set.seed(999)
model.rf <- train(classe~., data=training, method="rf", metric="Kappa", trControl=modelControl)
model.rf
## Random Forest 
## 
## 14718 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 13245, 13246, 13246, 13246, 13246, 13247, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##    2    0.9921863  0.9901164  0.001929336  0.002439643
##   27    0.9930021  0.9911481  0.001783628  0.002255139
##   52    0.9850534  0.9810924  0.004387838  0.005547967
## 
## Kappa was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.

There are slight differences in kappa and accuracy parameters, with minimal out of sample error. The cross validation selected mtry=27, that is, 27 randomly sampled variables in each split to select the optimal model.

Now we use Gradient Boosting (gbm):

suppressMessages(library(gbm))
set.seed(999)
model.gbm <- train(classe~., data=training, method="gbm", metric="Kappa", trControl=modelControl,verbose=FALSE)
## Loading required package: plyr
model.gbm
## Stochastic Gradient Boosting 
## 
## 14718 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 13245, 13246, 13246, 13246, 13246, 13247, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa      Accuracy SD
##   1                   50      0.7520016  0.6854161  0.010324455
##   1                  100      0.8203543  0.7725959  0.012781208
##   1                  150      0.8525592  0.8134708  0.012360661
##   2                   50      0.8546646  0.8158417  0.014288428
##   2                  100      0.9063709  0.8815200  0.012625104
##   2                  150      0.9310360  0.9127263  0.006932967
##   3                   50      0.8965889  0.8690751  0.012012422
##   3                  100      0.9426560  0.9274482  0.007787846
##   3                  150      0.9620202  0.9519559  0.005980831
##   Kappa SD   
##   0.013067440
##   0.016132642
##   0.015629499
##   0.018120972
##   0.015930541
##   0.008761702
##   0.015177789
##   0.009839790
##   0.007562657
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Kappa was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

At the end of the text there are the parameters selected according to the cross validation in the gbm models. Here, in contrast with random forest, there are significant diferences in values of Kappa and Accuracy parameters.

Finally, we compared the two models to select which is the best model, using the resamples function:

comparison <- resamples(list(rf=model.rf, gbm=model.gbm))
summary(comparison)
## 
## Call:
## summary.resamples(object = comparison)
## 
## Models: rf, gbm 
## Number of resamples: 10 
## 
## Accuracy 
##       Min. 1st Qu. Median  Mean 3rd Qu.   Max. NA's
## rf  0.9905  0.9913 0.9932 0.993  0.9944 0.9959    0
## gbm 0.9531  0.9579 0.9616 0.962  0.9671 0.9708    0
## 
## Kappa 
##       Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## rf  0.9880  0.9890 0.9914 0.9911  0.9929 0.9948    0
## gbm 0.9407  0.9467 0.9514 0.9520  0.9583 0.9630    0
library(lattice)
bwplot(comparison, metric="Kappa", main= "RF vs GBM Kappa parameters")

_config.yml

As the boxplot and the summary show, Random Forest has a better Kappa parameter than the Gradient Boosting, so we choose this model.

2.3 Validation of the random forest model and prediction

We applied the selected random forest model to the testing data to validate the model, using the confusionMatrix function:

suppressMessages(library(randomForest))
prediction.rf <- predict(model.rf, newdata=testing)
confusionMatrix(prediction.rf,testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1395    6    0    0    0
##          B    0  943    4    0    1
##          C    0    0  849    5    4
##          D    0    0    2  799    1
##          E    0    0    0    0  895
## 
## Overall Statistics
##                                         
##                Accuracy : 0.9953        
##                  95% CI : (0.993, 0.997)
##     No Information Rate : 0.2845        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.9941        
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9937   0.9930   0.9938   0.9933
## Specificity            0.9983   0.9987   0.9978   0.9993   1.0000
## Pos Pred Value         0.9957   0.9947   0.9895   0.9963   1.0000
## Neg Pred Value         1.0000   0.9985   0.9985   0.9988   0.9985
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2845   0.1923   0.1731   0.1629   0.1825
## Detection Prevalence   0.2857   0.1933   0.1750   0.1635   0.1825
## Balanced Accuracy      0.9991   0.9962   0.9954   0.9965   0.9967

Accuracy of the model is 0.995, with a Kappa value of 0.994, so the model is valid with the testing data. The out of sample error (1- accuracy) is 0.0047.

3. Results

Now, with this model we can predict the result for the original testing dataset obtained in the exploratory data analysis:

prediction.testing <- predict(model.rf, newdata=final.test.data)
as.data.frame(prediction.testing)
##    prediction.testing
## 1                   B
## 2                   A
## 3                   B
## 4                   A
## 5                   A
## 6                   E
## 7                   D
## 8                   B
## 9                   A
## 10                  A
## 11                  B
## 12                  C
## 13                  B
## 14                  A
## 15                  E
## 16                  E
## 17                  A
## 18                  B
## 19                  B
## 20                  B
Written on March 25, 2016