In this post, we will once again return to the concept of convergent validity, and examine data from two fitness trackers to determine the extent to which their measurements agree. We will also examine the impact of the position of the tracker on the wrist, and see whether this makes a difference in the number of steps the devices record.
This post happened sort of by accident. I bought a Mi Band 5 in the summer of 2020, and had been quite happy with it – one of the things that’s great about using the Mi Band with Gadgetbridge is that it’s so easy to access the data. Unfortunately, the band got badly damaged at the beginning of December 2020 and so I ordered a replacement. I thought this would provide an interesting opportunity to test the convergent validity of both devices’ readings.
So, for 16 days in December of 2020 I wore both the old and the new bands simultaneously. Furthermore, as a small experiment, I switched the location of the devices every morning. Half of the time, the new (old) Mi Band was closer to (further from) the wrist, and vice versa. In the discussion that follows, I will refer to the position closest to the wrist as the “lower” position, and the position further from the wrist as the “upper” position. The picture below illustrates how I wore both fitness trackers, along with the positions.
You can find the data and all the code from this blog post on Github here.
I extracted the data from both devices using the method outlined in a previous post, keeping a final level of granularity of day/hour, e.g. one line of data per each hour of the day from 6 AM to midnight (I’m not including measurements from the middle of the night because there’s basically no movement then). I merged the data from both devices and created a dummy variable indicating the position of the bands on my arm.
In total, the dataset contains information on from 16 days, with one observation per device from the hours of 6 AM to midnight on each day, for a total of 288 rows of data.
The head of the dataset (named merged_data) looks like this:
Let’s first make a scatterplot showing the correspondence between the measurements for the old and new Mi Bands, while also calculating the pairwise correlation between the values. We will make one plot for the hourly steps, and one for the cumulative steps.
On each plot, I draw a red dashed line as the identity line. If the measurements were exactly the same, all points should lie on this line. I’m also drawing regression lines separately for both tracker locations on my arm.
The colors in all the plots in this blog post are from the Economist palette in the excellent ggthemes package.
Hourly Step Counts
We can produce the hourly plot and calculate the correlation between the measurements with the following code:
Which returns the following plot:
The hourly step count measurements are essentially identical. The separate regression lines indicating device position lie nearly directly on the identity line, and the correlation between the two measurements is .998 – incredibly high. In comparison, when I examined the correlation between hourly steps as measured by Fitbit vs. Accupedo, the correlation was only .52. We’re at a whole different level of agreement here.
Cumulative Step Counts
We can produce the cumulative step count plot and compute the correlation between the measurements with the following code:
Once again, there is a near-perfect relationship between the measurements from the two devices. The regression lines basically lie on the identity line, and the correlation between the two measurements is .999. In comparison, when I examined the correlation of cumulative steps for Fitbit vs. Accupedo, the correlation was .97.
There is a small deviation in the correspondence of the measurements at the highest levels of cumulative step counts. Specifically, it looks like there’s a slight tendency for new Mi Band to register higher step counts when it is in the lower position (closer to the wrist).
Bland Altman Plots
Hourly Step Counts
Another way of examining the correspondence between two measurements is the Bland-Altman plot. The Bland-Altman plot displays the mean of the measurements on the x-axis, and the difference between the measurements on the y-axis. A horizontal line (in red in the plot blow) is drawn on the plot to indicate the mean difference between the measurements. In addition, two lines (in blue in the plot below) are drawn at +/- 1.96 standard deviations above and below the mean difference, respectively.
We will use the excellent BlandAltmanLeh package in R to make the Bland-Altman plot. Note that it takes some additional work to get the plot to have the same color scheme as our above correspondence plots, with separate colors for device position.
The mean difference between the counts (the red horizontal dashed line) is very close to zero (the exact value is 8.64, as we’ll see below). Because the difference score represents the new minus the old Mi Band, a positive average difference indicates that the new Mi Band gives directionally higher readings than the old Mi Band. The horizontal blue lines represent +/- 1.96 standard deviations above and below the mean difference, and therefore 95% of the difference scores fall within this range, which does not exceed 200 on the positive or negative side. There is a very slight tendency for there to be more positive difference scores (indicating the new Mi Band counts more steps than the old one) when the new Mi Band is in the lower position.
We can test the statistical significance of the mean difference of the hourly readings (M = 8.64, SD = 88.82), and we find that the difference is not statistically significant, t(287) = 1.65, p = .10. In any event, the difference is easy to interpret in practical terms, and 8.6 steps an hour is not a large difference. Given that the average hourly step count for both devices exceeds 950, a difference of 8.6 steps represents a difference of less than 1%.
Cumulative Step Counts
Let’s make the Bland Altman plot for the cumulative step counts:
This plot looks very similar to that from the hourly step counts, though the mean difference between the measurements is larger for the cumulative step counts. Once again, the mean difference score sits above zero, indicating that the new Mi Band gives higher cumulative step count readings than the old Mi Band.
We can test the statistical significance of the mean difference of the cumulative readings (M = 77.04, SD = 275.92), and we find that the difference is statistically significant, t(287) = 4.74, p <.001. Despite the low p-value, the effect size is small in practical terms: 77 steps is not a large difference, especially given the average cumulative step counts for both devices is around 9,900. Our difference of 77 steps represents a difference of less than 1%!
Differences Across the Day
Finally, we’ll take a look at the differences between the device readings across the course of the day. In the previous posts on convergent validity, I used the ggridges package to show the densities of the distribution of step differences. However, that technique works less well in the current case, where the number of observations at each hour is only equal to the number of days in our data, e.g. 16.
Therefore, we will use bar charts to examine the average differences between the device readings across the hours of the day. We will use separate panels to display the results for the different positions of the devices on my wrist.
In order to give some sense of the uncertainty surrouding the averages displayed in the bar chart, I’m including 95% confidence intervals in the plots. Confidence intervals have an intuitive-sounding name, but the definition is somewhat convoluted. I won’t get into the details here (see the Wikipedia link above if you’re interested), but in essence the logic goes like this: if we were to repeat the data collection 100 times, our 95% confidence interval would contain the population parameter estimates 95% of the time. Not so intuitive, but if we simply think of the confidence intervals as giving us a sense of the uncertainty surrounding our estimates (with wider bars indicating more uncertainty), then we’ll be OK!
Hourly Step Counts
We can make the plot for hourly steps like so:
The average differences per hour between the devices are all small and the 95% confidence intervals are wide, indicating large uncertainty about the average size of our differences. When looking at the direction of the differences, however, there does seem to be a slight trend. On the left-hand side of the plot we see the differences between the devices when the new Mi Band is in the lower position. In this graph, nearly all of the differences are positive, indicating that the new Mi Band records more steps than the old one when it is in the lower position. There does not seem to be any systematic difference between the devices when the new Mi Band is in the upper position (the chart on the right-hand side of the figure). In all cases, however, the differences in hourly step counts are small compared to the variation in the measurements.
Cumulative Step Counts
We can make the plot for cumulative steps like so:
The plot looks quite different than the one for the hourly step counts! The directional pattern we saw on the left-hand side of the graph for the hourly steps is much more pronounced here. Specifically, from around 10 AM, when the new Mi Band is in the lower position, it counts slightly more steps than the old Mi Band every hour. Across the hours of the day, these small differences accumulate such that by the end of the day the new Mi Band has a step count of around 400 more steps than the old Mi Band. The 95% confidence intervals exclude zero from 11 AM onwards, indicating a systematic difference that is distinguishable from zero. However, the size of these differences is small (fewer than 400 steps by 11 PM, where the average step count at 11 PM exceeds 17,000, for a difference of around 2%).
When the old Mi Band is in the lower position (the right-hand side of the graph), the pattern is reversed, indicating that the old Mi Band counts more steps than the new Mi Band. However, the 95% confidence intervals all include zero, indicating that the variation in the differences overwhelms the average size of the differences.
Summary and Conclusion
In this post, we examined data from two Mi Band 5 fitness trackers, comparing the step counts between both devices across 16 days.
In comparison to my previous examinations of convergent validity among step count trackers, the convergence between both Mi Band devices was nearly perfect. Specifically, the scatterplot of the hourly and cumulative step count measurements indicated nearly perfect correspondence between the two, and the correlations for both sets of measurements was greater than .99.
However, the Bland Altman analysis suggested small directional differences between the devices, with the new Mi Band giving slightly higher readings than the old Mi Band. The analysis of the difference scores across the day revealed why. While both devices seemed to register more steps when they sat lower on my arm, this pattern was more systematic for the new Mi Band, resulting in slightly higher step counts for the new Mi Band across the days of data collection. However, in all cases the size of the differences was very small, never exceeding 2% of the total number of steps recorded.
Coming Up Next
In the next post, we will explore Bayesian regression analysis with the rstanarm package.