Samsung Phone Data Analysis Project
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Below are my findings from the second data analysis project in Dr. Jeffery Leek’s John Hopkins Coursera class.
Introduction
I used the “Human Activity Recognition Using Smartphones Dataset” (UCI, 2013) to build a model. This data was recorded from a Samsung prototype smartphone with a builtin accelerometer. The purpose of my model was to recognize the type of activity (walking, walking upstairs, walking downstairs, sitting, standing, laying) the wearer of the device was performing based on their 3axial linear acceleration and 3axial angular velocity at a constant rate of 50Hz, as recorded by the accelerometer and the additional gyroscope worn on the wrist.
Methods
The “Human Activity Recognition Using Smartphones Dataset” consists of recordings from 30 participants over specified time who wore a Samsung device on their waist. The assignment specified I only use a subset consisting of data from 21 participants. The data was already normalized and bounded within [1,1] which meant extreme values were not to be expected. The dataset included 7352 observations with 563 variables. These variables can be summarized as:
 Accelerometer and gyroscope 3axial raw signals tAccXYZ and tGyroXYZ. These time domain signals (prefix ‘t’ to denote time) were captured at a constant rate of 50 Hz.
 Similarly, the acceleration signal was then separated into body and gravity acceleration signals (tBodyAccXYZ and tGravityAccXYZ).
 Subsequently, the body linear acceleration and angular velocity were derived in time to obtain Jerk signals (tBodyAccJerkXYZ and tBodyGyroJerkXYZ). Also, the magnitude of these threedimensional signals was calculated using the Euclidean norm (tBodyAccMag, tGravityAccMag, tBodyAccJerkMag, tBodyGyroMag, tBodyGyroJerkMag).
The following features were estimated from the aforementioned signals.
 in array
 min(): Smallest value in array
 sma(): Signal magnitude area
 energy(): Energy measure. Sum of the squares divided by the number of values.
 iqr(): Interquartile range
 entropy(): Signal entropy
 arCoeff(): Autorregresion coefficients with Burg order equal to 4
 correlation(): correlation coefficient between two signals
 maxInds(): index of the frequency component with largest magnitude
 meanFreq(): Weighted average of the frequency components to obtain a mean frequency
 skewness(): skewness of the frequency domain signal
 kurtosis(): kurtosis of the frequency domain signal
 bandsEnergy(): Energy of a frequency interval within the 64 bins of the FFT of each window.
 angle(): Angle between to vectors.
Exploratory Analysis
I started the exploration process by examining structures and summaries as well as distribution plots of the variables. I searched for missing data (supposedly normalized), naming convention or level issues, and eventually determined the variables to be used in a classifying or regression model. Special characters and spaces were converted and removed. In order to utilize Random Forests the “activity” and “subject” column needed to be converted to factors. Additionally, character vectors needed to be made syntactically valid (i.e. “a and b” and “a‐and‐b become “a.and.b” and “a.andb.1”)(gsub,makenames).
The dataset needed to be divided into a training set that per assignment instructions included at minimum participants 1, 3, 5, and 6. The test data needed to include at minimum participants 27, 28, 29, and 30. Some participants might be responsible for more of the observations in the dataset than others, thus skewing the data. As such, I took random equal samples without replacement of of observations for each participant.
I wanted to check the distribution of the training and test set to see if any differences were egregious enough to warrant transformation. The power of most modeling methods such decision trees and kNN depend on homogeneity of data. I used Principal Components Analysis to assess the distributions. The image below shows each data set broken into 2 principal components that represent 90% of the data. There is definite homogeneity between the sets. (I’m still exploring ways to plot this using ggplot2 instead of base graphics)
Statistical Modeling
I decided to employ random forests modeling (an excellent layman’s explanation of the algorithm can be found here). If you are already familiar with classification trees then Dr. Breiman’s explanation should make sense:
Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a classification, and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest).
I chose random forests after I considered the nature of the activities I was trying to predict. I intuitively thought that some of the variables would be hard to distinguish, such as walking up versus down or standing vs sitting. I thought random forests robust enough as single decision trees are likely to suffer from high variance or high bias, but random forests can use averages to strike a natural balance between the two extremes. I also appreciate the principles of Occam’s razor and wanted to employ and algorithm with the least assumptions regarding distribution.
This model was successful as the OOB estimate of error rate was 1.4% .The Random Forests importance() output column MeanDecreaseAccuracy describes how much each factor contributes to the models ability to predict activity. Below Table 1 displays variables with the top ten MeanDecreaseAccuracy score.
Table 1
MeanDecreaseAccuracy 
MeanDecreaseGini 

angle.Y.gravityMean. 
17.2437045 
57.2346914 
tGravityAcc.min…Y 
16.8232063 
54.0606836 
tGravityAcc.mean…Y 
16.7407855 
57.1805001 
tGravityAcc.max…Y 
16.592065 
48.0024562 
tGravityAcc.energy…Y 
14.1988198 
33.2963559 
tGravityAcc.min…X 
13.0430767 
63.426501 
tGravityAcc.energy…X 
12.8807273 
58.1172211 
angle.X.gravityMean. 
12.7779327 
54.9267196 
tGravityAcc.mean…X 
12.6165942 
48.6299317 
This dotchart displays variable importance as measured by a Random Forest (varImpPlot) similar to the table above.
Examining a confusion matrix (Table 2) of prediction on the test set, we see that the highest prediction success rate was for laying at 100% (intuitive considering the above weighting of angle) and the lowest was walkdown.
Table 2
observed  laying  sitting  standing  walk  walkdown  walkup 
laying 
100.00% 
0.00% 
0.00% 
0.00% 
0.00% 
0.00% 
sitting 
0.00% 
86.53% 
13.47% 
0.00% 
0.00% 
0.00% 
standing 
0.00% 
10.02% 
89.98% 
0.00% 
0.00% 
0.00% 
walk 
0.00% 
0.00% 
0.24% 
97.62% 
1.43% 
0.71% 
walkdown 
0.00% 
0.00% 
0.00% 
1.93% 
81.77% 
16.30% 
walkup 
0.00% 
0.00% 
7.24% 
0.26% 
6.72% 
85.79% 
There were 245 misclassification errors on a test set size of 2658 for an Error Rate of 9.21 %.
Conclusions
The confusion matrix (Table 2) shows that misclassifications or falsepositives were mostly in transitionary stages such as sitting to standing and viceversa. This makes me curious to know what the orientation of these devices were on the waist. Walkdown and walkup differentiation proved to be difficult as well. This is interesting, as I believe that the speed and acceleration of walk down would be significantly higher in magnitude compared to walkup due to simple physical excretion and gravity. Variations are subjective as they’re based on the health and ability of individual users though.
Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job.
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.