Introduction

There is growing interest in using consumer wrist-worn wearable technology to collect personal health data, including heart rate and rhythm. Several companies including Apple, Samsung, and Fitbit have been approved by the Food and Drug Administration to market these sensors as a form of early detection of arrhythmias such as atrial fibrillation. The associated hardware in modern devices relies on two components: photoplethysmography (PPG), which passively monitors heart rate and rhythm, and electrical sensors, which require the user to remain still and touch the device with their opposite hand throughout the detection period [1].

Given that the electrical sensors require active user engagement to measure heart rate and rhythm, the majority of wearable heart rate and rhythm data is collected using PPG. This technology relies on LEDs projecting light onto the skin, and photodetectors measuring the quantity of reflected light [1, 2]. Hemoglobin in dermal blood vessels drives the bulk of light absorption, and thus changes in blood pressure and flow rate affect light absorption, which is then detected by PPG optical sensors [2, 3]. Proprietary algorithms then convert light absorption into heart rate and rhythm data.

While safe and inexpensive, the accuracy of PPG is impaired by factors that impede light transmission, such as elevated body mass index, tattoos, and higher hair follicle density [4]. As such, PPG has recently undergone closer scrutiny of its reliability in users with darker skin tones. Given that melanin is the main absorbent of light in the epidermis, it is theorized that this might similarly interfere with the PPG signal estimation process.

A racial discrepancy has been established in pulse oximeters, which rely on transmission PPG, a slightly different form of the technology. Instead of photodetectors measuring reflected light, they measure LED light that shines through the tissue to the opposite side, restricting it to locations that can be transilluminated, such as the earlobes and fingers [5]. A recent study by Sjoding et al. found that when compared to white patients, Black patients had a threefold increased frequency of undetected hypoxemia when using pulse oximeters [6]. Specifically, in the first cohort (n = 10,789), 11.4% Black patients had undetected hypoxemia (pulse oximetry > 92% and arterial oxygen saturation < 88%), compared to 3.6% white patients, and in the second cohort (n = 37,308), 17.0% Black patients had undetected hypoxemia, compared to 6.2% white patients. Similarly, a retrospective study of 7126 patients with COVID-19 found that pulse oximetry overestimated arterial oxygen saturation among Asian, Black, and Hispanic patients compared to white patients, leading to a delay in the initiation of guideline-based therapies [7].

While the inaccuracy of transmission PPG used in pulse oximeters is well established in darker skin tones, it remains unclear whether this holds true for the reflectance PPG used in consumer wrist-worn devices. Our systematic review aims to review current literature on the accuracy of cardiac data of these devices in populations of various skin tones.

Methods

Search Strategy

The authors systematically searched four databases from database inception to Nov 5, 2021: (1) the Ovid versions of MEDLINE and MEDLINE Daily including e-publications, in progress, and non-indexed citations; (2) Embase Classic and Embase; (3) CINAHL; and (4) Cochrane CENTRAL. The search included all original studies in any language that evaluated race in measuring cardiovascular health data by consumer wearable technologies (Table 1). No exclusions were applied on the basis of language or country of origin. The complete search strategy can be found in Table 2.

Table 1 Characteristics of included observational studies
Table 2 Search strategy (Nov 5, 2021)

Study Screening

This review was conducted according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. Title and abstract screening, full-text review, and data extraction were performed independently by two investigators (SK and DK), with a third reviewer (AK) resolving discrepancies. Backwards searching was also performed by screening the reference lists of all included studies to identify relevant articles. Findings regarding race outcomes were extracted from included studies and qualitatively analyzed.

Terminology

In this review, we use the term “white” to refer to the classification of people of European descent with largely lighter skin tones. We use the term “Black” to refer to the classification of people of African and/or Caribbean descent with darker skin tones.

The race-related terms “white” and “Black” are contextual categories shaped by sociocultural forces and how one self-identifies, encompassing various skin tones and ethnic backgrounds. Consequently, several studies use the Fitzpatrick phototype scale to describe skin tone. The Fitzpatrick scale is a semi-quantitative means of describing skin tone, ranging from types 1 to 6, where type 1 skin refers to skin that always burns and does not tan, and type 6 refers to skin that never burns and always tans darkly [8].

Assessing Methodological Quality

The Risk of Bias Assessment Tool for Nonrandomized Studies (RoBANS) tool was used to assess the methodological quality of included studies. Studies were appraised independently by two parallel reviewers (SK, DK).

Results

The search strategy yielded 58 results from MEDLINE, 147 from Embase, 65 from CINAHL, and 386 from Cochrane CENTRAL, totaling 656 articles. After removing duplicate articles, 581 records remained. There were 25 records after title and abstract screening; the full texts of these papers were reviewed, and 6 articles met the inclusion criteria. An additional two studies were found through hand searching the literature, and 2 studies through backwards searching, yielding a total of 10 included studies. The results of the systematic search are presented in a PRISMA flowchart in Fig. 1 and the search strategy is presented in Table 2.

Fig. 1
figure 1

PRISMA flowchart

Table 1 outlines the characteristics of included observational studies. The 10 studies included a total of 469 participants, with a mean age of 34 years (range 20–62); a total of 38% (range 0–67%) of participants were women. All studies were observational cohort studies. The majority of studies were conducted in the USA (n = 5), with the remaining studies arising from Australia (n = 1), Canada (n = 1), France (n = 1), Israel (n = 1), and Spain (n = 1). The risk of bias assessment of included studies is shown in Fig. 2.

Fig. 2
figure 2

Risk of Bias Assessment Tool for 10 included non-randomized studies (RoBANS)

The most common wearable device brands that were studied included the Apple Watch (n = 6), Fitbit (n = 5), Mio Alpha (n = 3), and Garmin (n = 3). Of note, since only a few studies provided information on which generation of a given device was used for testing, this review groups the devices by manufacturer. Of the ten studies, five compared the relative accuracy of different manufacturers. Of those, all five found the Apple Watch to be more accurate than the other commercial devices [2, 9,10,11,12]. For instance, Etiwy et al. found the Apple Watch to follow ECG data with a correlation coefficient of 0.80, compared to 0.52 in Garmin devices [9].

As a gold standard comparison for accuracy, either an ECG (n = 7) or a chest strap (n = 3) was used, two modalities that are known to correlate well [13]. Similar to ECGs, chest straps use electrodes to measure electrical activity and determine heart rate. None of the included studies compared the detection of atrial fibrillation or other arrhythmias in participants with various skin tones. Similarly, none of the studies assessed oxygen saturation or other vital signs.

In three of the included studies, skin tone was the primary variable considered in the accuracy of wrist-worn heart rate monitors [1, 2, 14]. For the seven other studies, skin tone was one of several covariates considered in the accuracy analysis.

The Fitzpatrick scale was used to classify participant skin tones in eight of the ten studies. In the two studies that did not use the Fitzpatrick scale, one classified participants as white, Black, or other, and another classified participants based on ethnic background, including Asian, Black, and Hispanic — precluding aggregate analysis of skin tone outcomes. Three studies outlined the proportion of participants of each Fitzpatrick skin type, three studies solely reported the mean Fitzpatrick score, one study grouped participants as Fitzpatrick < 4 and Fitzpatrick > 4, and one study only reported skin tone data for dark-skinned individuals (Fitzpatrick 5 or 6), which pertained to 4 of the 24 participants. Only three of the ten studies included participants with Fitzpatrick 6 skin tones.

Six studies reported an average Fitzpatrick score, and across these studies the frequency-weighted mean score was 3.5 (range 1–6) (n = 293).

Four of the ten studies reported statistically significant reductions in accuracy of heart rate data in participants with darker skin tones [3, 10, 11, 15]. Four studies found no difference in accuracy between participants of different skin tones [1, 2, 9, 14]. The remaining two studies observed mixed results across different wearable devices [12, 16]. No studies reported a higher accuracy of heart rate measurement in participants with darker skin tones.

Hermand et al. reported lower heart rate accuracy in patients with darker skin (p < 0.001) [15]. Pasadyn et al. similarly found that the accuracy of heart rate detection was slightly lower in non-white participants (p = 0.01) [10]. However, the magnitude of measurement inaccuracy was not quantified in either study. Shcherbina et al. noted that smart watch device error was higher for darker skin tones, but the degree of this effect was not reported [11]. Hochstadt et al. noted a linear regression coefficient of 0.98 (p < 0.001) when comparing PPG to ECG data in patients with darker skin (Fitzpatrick 5 or 6), suggesting that darker skin tone reduced accuracy, albeit with a relatively small impact [3].

An equal number of studies did not support this relationship. Bent et al. found equivalent accuracy for all devices tested across various skin tones [2]. Etiwy et al. and Sanudo et al. similarly found that skin color did not influence heart rate accuracy [1, 9]. Ray et al. found that various WearOS watches systematically underestimate the reliability of HR readings taken from dark skin, despite no substantial differences in error, leading to significantly fewer recorded data points in patients with dark skin [14].

The remaining two studies had mixed findings. Wallen et al. noted lower heart rate accuracy in participants with a Fitzpatrick scale score > 4 with the Apple Watch, but not in the other studied smartwatches [12]. Spierer et al. found that one of the two devices tested (Mio Alpha) had a higher mean average error in Fitzpatrick 6 patients compared to Fitzpatrick 1 patients (16 beats/min compared to 3 beats/min, respectively) [16].

Discussion

This systematic review of 10 studies and 469 participants summarizes the accuracy of heart rate measurement of wearable devices across diverse skin tones. This review identified a relative scarcity of studies that consider the interactions of skin tone when characterizing smart watch device accuracy in cardiac outcomes, resulting in inconclusive findings.

There has been increased interest in the use of wearable devices to measure heart rate and detect arrhythmias, such as atrial fibrillation. Their use is supported by a growing body of studies which have demonstrated accuracy in recording cardiac data. One study showed utility in consumer devices for post-discharge monitoring of tachycardia in ICU patients, noting high sensitivity (99%) with low-to-moderate specificity (70%) [17]. A study by Banerjee et al. found that the sensitivity and specificity of consumer wrist-worn devices were 92% and 88% respectively for atrial fibrillation detection [18].

While some studies have posited that artificial intelligence algorithms may play a role in optimizing data measurement in outpatient and emergency cardiology, including the detection of abnormal heart rates and rhythms, other studies have identified potential racial biases within machine learning [19]. As such, further scrutiny of the tools we use and the data they from which they are derived is necessary in order to reduce bias in medicine. This review highlights the importance of research and development studies enrolling diverse participants, and that validation studies must ensure that devices are tested in a range of skin types.

While the racial limitations of transmission PPG have been consistently demonstrated in pulse oximetry, this systematic review demonstrated that this relationship is less evident in studies performed to date on the reflectance PPG used in wearable devices. One potential reason for this difference in accuracy is the adoption of green light in newer wrist-worn devices, compared to the red light used in pulse oximetry. Fallow et al. identified that green wavelengths (520 nm) displayed greater accuracy in heart rate measurement regardless of skin type when compared to other wavelengths, both at rest and during exercise [20]. Further research may identify the utility of green wavelengths, instead of infrared wavelengths, in reducing racial bias in pulse oximetry.

Overall, this is the first systematic review of the accuracy of cardiac data by wearable devices based on race and/or skin tone. The study used rigorous research methodology including the search of multiple research databases and screening of publications by two independent reviewers in duplicate. However, the included studies described mixed results and as such, it remains unclear whether wearable devices have reduced accuracy in heart rate and arrhythmia detection in people with darker skin tones. This review highlights the importance of research and development studies enrolling diverse participants, and that validation studies must ensure that devices are tested in individuals with a range of skin types.

There are limitations to this systematic review, largely related to the preliminary nature of the evidence base. The identified studies in this review were not blinded, included small sample sizes, evaluated primarily young subjects, and used different wearable device brands and models. Many of the studies that reported a significant interaction between skin tone and cardiac data accuracy did not quantify the magnitude of error. Furthermore, discrepancies existed in how skin tone was reported, such as categorizations of skin tones as either light or dark, or by race, ethnicity, or Fitzpatrick scale subgroups. Some studies grouped participants by Fitzpatrick scale ranges given a shortage of participants at the extremes of skin tone (i.e., Fitzpatrick scale 1 and 6). This heterogeneity precluded meta-analysis and may contribute to the variability in reported results. Further, despite being used as a gold standard in three of the ten included studies, chest straps have never been validated across different skin tones. As a result, the authors recommend that future work should report accuracy of wearable device data stratified by race and/or skin tone, and the use of spectrocolorimeters whenever possible, to provide more objective skin color measurements than the Fitzpatrick scale [21, 22].

Conclusion

Early evidence of racial bias in wrist-worn wearables is mixed, but some studies demonstrate reduced accuracy of heart rate data in users with darker skin tones. Further higher quality evidence is needed, involving a greater proportion of patients with darker skin tones, as well as objective measurements of pigmentation, to better characterize potential racial bias in the accuracy of heart rate measurements.