Abstract
As wearable technologies are being increasingly used for clinical research and healthcare, it is critical to understand their accuracy and determine how measurement errors may affect research conclusions and impact healthcare decision-making. Accuracy of wearable technologies has been a hotly debated topic in both the research and popular science literature. Currently, wearable technology companies are responsible for assessing and reporting the accuracy of their products, but little information about the evaluation method is made publicly available. Heart rate measurements from wearables are derived from photoplethysmography (PPG), an optical method for measuring changes in blood volume under the skin. Potential inaccuracies in PPG stem from three major areas, includes (1) diverse skin types, (2) motion artifacts, and (3) signal crossover. To date, no study has systematically explored the accuracy of wearables across the full range of skin tones. Here, we explored heart rate and PPG data from consumer- and research-grade wearables under multiple circumstances to test whether and to what extent these inaccuracies exist. We saw no statistically significant difference in accuracy across skin tones, but we saw significant differences between devices, and between activity types, notably, that absolute error during activity was, on average, 30% higher than during rest. Our conclusions indicate that different wearables are all reasonably accurate at resting and prolonged elevated heart rate, but that differences exist between devices in responding to changes in activity. This has implications for researchers, clinicians, and consumers in drawing study conclusions, combining study results, and making health-related decisions using these devices.
Similar content being viewed by others
Introduction
Wearable technology has the potential to transform healthcare and healthcare research by enabling accessible, continuous, and longitudinal health monitoring. With the number of chronically ill patients and health system utilization in the US at an all-time high,1,2 the development of low-cost, convenient, and accurate health technologies is increasingly sought after to promote health as well as improve research and healthcare capabilities. It is expected that 121 million Americans will use wearable devices by 2021.3 The ubiquity of wearable technology provides an opportunity to revolutionize health care, particularly in communities with traditionally limited healthcare access.
The growing interest in using wearable technologies for clinical and research applications has accelerated the development of research-grade wearables to meet the needs of biomedical researchers for clinical research and digital biomarker development.4 Consumer-grade wearables, in contrast to research-grade wearables, are designed, developed, and marketed to consumers for personal use. While research- and consumer-grade wearables often contain the same sensors and are quite similar functionally, their markets and use cases are different, which may influence accuracy (Supplementary Table 1). Digital biomarkers are digitally collected data that are transformed into indicators of health outcomes. Digital biomarkers are expected to enable actionable health insights in real time and outside of the clinic. Both consumer- and research-grade wearables are frequently being used in research, with the most common brands being Fitbit (PubMed: 476 studies, ClinicalTrials.gov: 449 studies) for consumer-grade wearables and Empatica (PubMed: 22 studies, ClinicalTrials.gov: 22 studies) for research-grade wearables (Supplementary Table 2).
It is, therefore, of critical importance to evaluate the accuracy of the wearable technologies that are being used in clinical research, digital biomarker development, and personal health. The lack of clarity surrounding the verification and validation procedures and the unknown reliability of the data generated by these wearable technologies poses significant challenges for their adoption in research and healthcare applications.4,5,6
Recently, the accuracy of wearable optical heart rate (HR) measurements using photoplethysmography (PPG) has been questioned extensively.7,8,9,10,11,12,13 Wearables manufacturers sometimes report some expected sources of error, but the reporting and evaluation methods are inconsistent14,15,16,17,18,19,20,21,22 (Table 1). Of particular interest, previous research demonstrated that inaccurate PPG HR measurements occur up to 15% more frequently in dark skin as compared to light skin, likely because darker skin contains more melanin which absorbs more green light than lighter skin.23,24,25,26,27,28,29,30,31 Interestingly, some manufacturers of wearable devices recommend using their device only in light skin tones and/or at rest.17,32
Another suspected measurement error in wrist-worn devices is motion artifact, which is typically caused by displacement of the PPG sensor over the skin, changes in skin deformation, blood flow dynamics, and ambient temperature.33,34 Motion artifacts may manifest as missing or false beats which result in incorrect HR calculations.35,36,37 Several studies have demonstrated that HR measurements from wearable devices are often less accurate during physical activity or cyclic wrist motions.8,11,35,38,39 Several research groups and manufacturers have identified that cyclical motion can affect accuracy of HR in wearable sensors.9,10,15 The cyclical motion challenge has been described as a “signal crossover” effect wherein the optical HR sensors on wearables tend to lock on to the periodic signal stemming from the repetitive motion (e.g., walking and jogging) and mistake that motion as the cardiovascular cycle.40
To date, no studies have systematically validated wearables under various movement conditions across the complete range of skin tones, and particularly on skin tones at the darkest end of the spectrum. Here, we present a comprehensive analysis of wearables HR measurement accuracy during various activities in a group of 53 individuals equally representing all skin tones. To our knowledge, this is the first reported characterization of wearable sensors across the complete range of skin tones. Validation of wearable devices during activity and across all skin tones is critical to enabling their equitable use in clinical and research applications.
Results
Study summary
A group of 53 individuals successfully completed the entire study protocol (32 females, 21 males; ages 18–54; equal distribution across the Fitzpatrick (FP) skin tone scale). This protocol was designed to assess error and reliability in a total of six wearable devices (four consumer-grade and two research-grade models) over the course of approximately 1 h (Fig. 1). Each round of the study protocol, included (1) seated rest to measure baseline (4 min), (2) paced deep breathing41 (1 min), (3) physical activity (walking to increase HR up to 50% of the recommended maximum;42 5 min), (4) seated rest (washout from physical activity) (~2 min), and (5) a ty** task (1 min). This protocol was performed three times per study participant in order to test all devices. In each round, the participant wore multiple devices according to the following: Round 1: Empatica E4 + Apple Watch 4; Round 2: Fitbit Charge 2; Round 3: Garmin Vivosmart 3, ** for six wearable devices representing both consumer wearables and research-grade wearables. HR metrics are compared to the clinical-grade electrocardiogram (ECG) as the standard for heart rate measurement.
Potential relationships between error in HR measurements and (1) skin tone, (2) activity condition, (3) wearable device, and (4) wearable device category were examined using mixed effects statistical models. We developed comprehensive, individual, and interaction mixed effects models for the independent variables using mean HR measurement error as the dependent variable (Table 2). We found that wearable device, wearable device category, and activity condition all significantly correlated with HR measurement error, but changes in skin tone did not impact measurement error or wearable device accuracy.
Wearables accuracy across skin tones
Anecdotal evidence and incidental study findings supported the hypothesis that PPG measurements may be less accurate on darker skin tones than on lighter skin tones.8,9,10,11,12,13 To systematically explore this hypothesis, we examined the mean directional error (MDE) and the mean absolute error (MAE) of HR measurements within each FP skin tone group at rest and during physical activity.
Among skin tone groups at rest, FP5 had the largest MDE across all devices and FP1 had the lowest MDE (−4.25 bpm and −0.53 bpm, respectively) (Supplementary Figs 1a, 2a, Supplementary Table 7a). In absolute error terms, the darkest skin tone (FP6) had the highest MAE and the second darkest skin tone (FP5) had the lowest MAE at rest (10.6 bpm and 8.6 bpm, respectively) (Fig. 2c, e, Supplementary Table 6a). The average MDE and MAE across all skin tone groups at rest were −2.99 bpm and 9.5 bpm, respectively. Among skin tone groups during activity, FP5 had the highest MDE and FP3 had the lowest MDE (9.21 bpm and 7.21 bpm, respectively; Fig. 2b, Supplementary Table 7b). FP4 had the highest MAE and FP3 had the lowest MAE (14.8 bpm and 10.1 bpm, respectively; Fig. 2d, f, Supplementary Table 6b). Skin tone appears to not be the driver of MAE or MDE.
Mean error in heart rate (bpm) across skin tones and devices at a rest and b during physical activity. The green horizontal line represents no error (no difference from the true measurement of HR from ECG). Mean absolute error in heart rate (bpm) across skin tones and devices at c rest and d during physical activity. Error is calculated as the difference between the ECG and wearable reported heart rate at every simultaneous measurement. Fitzpatrick skin tones 1–6 are represented with an approximately equal number of participants in each skin tone. Error bars represent the 95% confidence interval. Mean absolute error across devices and across skin tones at rest (e) and during activity (f). Error bars represent the 95% confidence interval.
In the comprehensive and marginal mixed effects models, we found no significant correlation between skin tone and HR measurement error (Table 2). While we found no overall effect of skin tone, we tested whether the effect of skin tone differed based on individual devices. We did find a significant interaction between skin tone and device (Table 2). Upon further examination, this was shown to be based on the Biovotion device, which showed a decrease in resting HR and increase active HR (Fig. 2). During activity, the highest MDE occurs in FP5 and/or FP6 in all devices except for the ** (Supplementary Fig. 3) and found that MAE was higher during ty** compared with rest in all devices, and often nearly as high as during walking, except for the Apple Watch and the Empatica E4 (Supplementary Fig. 3a). The MDE was higher during ty** as compared with rest in the Miband, Empatica, and Biovotion. Interestingly, while both ty** and walking had poor performance overall, walking tended to cause reported HR to be higher than true HR, whereas ty** caused the reported HR to be lower than the true HR (Supplementary Fig. 3b). Surprisingly, the MAE and MDE were lower during deep breathing than at rest in all devices except for the Apple Watch, in which the deep breathing condition was the condition with the worst performance (Supplementary Fig. 3). During deep breathing, reported HR was generally lower than true HR (Supplementary Fig. 3).
Signal alignment
Lags between the ECG- and PPG-derived HR signals ranging between 0 and 43 s were discovered during our preliminary exploratory data analysis. These lags were inconsistent; in some cases, the lag was fixed and in other cases the lag was dynamic (Supplementary Fig. 5). The source of these lags could not be pinned down with certainty and may possibly be attributed to (1) misaligned time stamps (highly unlikely due to our time synchronization protocol described in the methods as well as the sometimes dynamic time lags observed), (2) data processing artifacts (uneven or delayed sampling, compute, and/or data reporting), (3) missed heart beats due to low frequency measurements by the wearable, or (4) a delay between the actual heart beat and the change in blood volume at wrist.
In order to remove lag as a factor that could contribute to error calculated in the previous sections, we performed signal alignment using two different approaches (cross-correlation and smoothing with a rolling window) and recalculated MAE and MDE on the newly aligned signals (Supplementary Fig. 6). Using the updated MAE and MDE at each window size from the smoothing, we reanalyzed the relationships in the previous sections and found no differences in conclusions from the previous sections. Our model did show that window length is related to HR measurement error (Supplementary Table 9). We performed a sensitivity analysis to determine how smoothing could affect improvements in accuracy, and we found that in most cases, smoothing reduced HR measurement error as demonstrated by the fact that the median optimal window size >0 (Supplementary Fig. 7). MAE and MDE were in general improved the most by smaller window sizes (less smoothing) during activity and wider window sizes (more smoothing) at rest, likely because changes in activity intensity would not be captured by wider smoothing windows. (Supplementary Fig. 7b). This did not hold true for the Apple Watch 4 and Empatica E4 for MDE or the Biovotion Everion for MAE.
Potential relationship between wearable device cost, market size, release year, and error
Wearables vary widely in terms of release year, data accessibility, and cost (Supplementary Table 1). We used devices across a wide range of costs, market sizes, and release times at the time of this study (Apple Watch 4, Fitbit Charge 2, Garmin Vivosmart 3, and ** review of the literature. PLoS ONE 13 (2018)." href="/article/10.1038/s41746-020-0226-6#ref-CR47" id="ref-link-section-d62210771e1281">47 Here, we explored one important aspect regarding the accuracy of wearables across the full range of skin tones. We found no statistically significant differences in wearable HR measurement accuracy across skin tones, however, we did find other sources of measurement inaccuracies, including activity type and type of device. Researchers, clinicians, and health consumers must recognize that the information derived from different wearables should not be weighted equally for drawing study conclusions, combining study results, and making health-related decisions. Algorithms that are used to calculate digital biomarkers should consider error and measurement quality under the various circumstances that we have shown in this study. Digital biomarker interpretation must take this data quality into account when making healthcare decisions.