This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Multisensor fitness trackers offer the ability to longitudinally estimate sleep quality in a home environment with the potential to outperform traditional actigraphy. To benefit from these new tools for objectively assessing sleep for clinical and research purposes, multisensor wearable devices require careful validation against the gold standard of sleep polysomnography (PSG). Naturalistic studies favor validation.
This study aims to validate the Fitbit Charge 2 against portable home PSG in a shift-work population composed of 59 first responder police officers and paramedics undergoing shift work.
A reliable comparison between the two measurements was ensured through the data-driven alignment of a PSG and Fitbit time series that was recorded at night. Epoch-by-epoch analyses and Bland-Altman plots were used to assess sensitivity, specificity, accuracy, the Matthews correlation coefficient, bias, and limits of agreement.
Sleep onset and offset, total sleep time, and the durations of rapid eye movement (REM) sleep and non–rapid-eye movement sleep stages N1+N2 and N3 displayed unbiased estimates with nonnegligible limits of agreement. In contrast, the proprietary Fitbit algorithm overestimated REM sleep latency by 29.4 minutes and wakefulness after sleep onset (WASO) by 37.1 minutes. Epoch-by-epoch analyses indicated better specificity than sensitivity, with higher accuracies for WASO (0.82) and REM sleep (0.86) than those for N1+N2 (0.55) and N3 (0.78) sleep. Fitbit heart rate (HR) displayed a small underestimation of 0.9 beats per minute (bpm) and a limited capability to capture sudden HR changes because of the lower time resolution compared to that of PSG. The underestimation was smaller in N2, N3, and REM sleep (0.6-0.7 bpm) than in N1 sleep (1.2 bpm) and wakefulness (1.9 bpm), indicating a state-specific bias. Finally, Fitbit suggested a distribution of all sleep episode durations that was different from that derived from PSG and showed nonbiological discontinuities, indicating the potential limitations of the staging algorithm.
We conclude that by following careful data processing processes, the Fitbit Charge 2 can provide reasonably accurate mean values of sleep and HR estimates in shift workers under naturalistic conditions. Nevertheless, the generally wide limits of agreement hamper the precision of quantifying individual sleep episodes. The value of this consumer-grade multisensor wearable in terms of tackling clinical and research questions could be enhanced with open-source algorithms, raw data access, and the ability to blind participants to their own sleep data.
Highly sensitive and precise instruments are necessary for the accurate measurement of sleep in healthy and clinical populations. Polysomnography (PSG), the prevailing gold standard in clinical and research settings [
Currently, the only validated and United States Food and Drug Administration–approved alternative to PSG in ambulatory settings is actigraphy [
Recently, there has been greater acceptance, but also controversy, among the scientific community about using commercially available wearable devices such as fitness trackers in research [
A recent laboratory-based validation study suggested that the proprietary algorithm of Fitbit Charge 2 (Fitbit Inc) to estimate different sleep variables performed reasonably well [
Regarding HR, a study found a moderate underestimation of 5.9 beats per minute (bpm) with Fitbit Charge 2 compared with the electrocardiogram, whereas precision for individual measurements was poor as reflected by wide limits of agreement (LoA) [
Therefore, the findings of previous sleep and HR validation studies of Fitbit Charge 2 are rather inconsistent and warrant further research. It was previously concluded that apart from the sample population studied, inaccurate temporal synchronization between Fitbit wearables and PSG is an important challenge in some validation studies [
The participants of this study were recruited from July 2017 to November 2019 by various informational media, emails, and presentations at shift change as part of a larger study investigating sleep and resilience to psychological stress and trauma. They completed 1 month of monitoring of wrist-derived rest-activity behavior with a Fitbit Charge 2 that was worn continuously by all individuals on their nondominant wrist.
The Ethics Commission of the Canton of Zurich approved (2016-01357) all study protocols and experimental procedures, and written informed consent was obtained before participation. Participants invited to participate fulfilled all inclusion criteria: aged between 18 and 65 years, BMI ≤26 (or if exceeding a BMI of 26, which is typical of very athletic participants, an absence of sleep problems, such as sleep breathing disorders, was reported), current employment in 1 of 2 participating emergency rescue stations and a police station in the greater Zurich area of Switzerland, possession of a smartphone, and German language fluency. Exclusion criteria included the presence of a neurological disorder diagnosis or head injury with the potential to affect electroencephalogram variables, reported intake of >5 alcoholic beverages per week, or if a urine drug screen (Drug Screen Multi 12-AE; Nal von Minden GmbH) revealed drug abuse. All participants were shift workers, although specific shift schedules varied among individuals by occupation, such that emergency medical rescue workers and emergency doctors worked cycles of two 12-hour days followed by two 12-hour nights, terminating in 4 free days. Police officers worked four contiguous shifts with varying individual activities and bedrest times. Data on individual shifts were not collected or analyzed. Individuals received monetary compensation for participating in the study. Participants additionally received a report on their sleep derived from their own sleep data derived from Fitbit Charge 2 and PSG. This report was explained to them by a study staff member.
Validated German translations of questionnaires administered at meetings at the start and upon completion of 1 month of monitoring were used to assess lifestyle and psychological and sleep variables. The Pittsburgh Sleep Quality Index (PSQI) [
A total of 62 individuals (43 emergency medical rescue workers, 16 police officers, and 3 emergency doctors), of whom 56% (35/62) were women, completed 2 nights of ambulatory PSG recordings in their homes. The PSG recordings were always made of nocturnal sleep following a day work shift and consisted of an adaptation night and then a baseline night the following evening. Individuals were free to determine their bedtime and sleep duration. The adaptation night served as a combined adaptation and screening night, whereas the baseline night provided the data analyzed in this report, with the exception of 8 individuals, whose data originated from the adaptation night because the PSG data of the baseline nights were of poor quality. The PSG data from one individual were excluded from the analyses because the data were of poor quality on both nights. Therefore, the total PSG sample consisted of 61 individuals. On 2 nights, the Fitbit Charge 2 data sets for 2 individuals were not obtained, reducing the sample to 59 individuals who had both PSG and Fitbit Charge 2 data for comparison. All PSG data were acquired using dedicated ambulatory polysomnographic amplifiers (SOMNOscreen Plus, SOMNOmedics GmbH). All electrodes and sensors for PSG recordings were applied by trained members of the research team. The overall PSG montage consisted of scalp electrode sites Fz, Cz, Pz, Oz, C3, C4, A1, and A2 applied according to the International 10-20 System [
The electrocardiogram trace in the PSG recordings was examined visually for one epoch at a time for all wake epochs before Son and all epochs of sleep and wake stages after Son (performed by an experienced individual). Artifacts and ectopic beats present in the electrocardiogram trace that had the potential to interfere with the quantification of interbeat intervals (IBIs), defined as the time interval between the normal R peaks of the QRS complex, were manually marked and removed before data processing and analysis.
All participants wore the Fitbit Charge 2 continuously during the PSG recorded nights. The device records wrist activity using accelerometry and pulses via photoplethysmography. It produces two types of sleep data depending on whether certain criteria are fulfilled during data collection. These criteria are sufficient battery charge, a sleep episode >3 hours in duration, and sufficient skin contact with the photoplethysmography sensor. If these criteria are not fulfilled, then
We manually omitted such bordering wake epochs and adjusted the Son, Soff, TST (ie, Soff – Son), and WASO values accordingly. Son, Soff, and REML are variables that are not provided directly by Fitbit; hence, we calculated them from the sleep staging information provided by Fitbit. All other variables were standard Fitbit variables. Adjustments only affected the Bland-Altman analyses. The results of the analyses without adjustment for the standard Fitbit variables can be found in Figure S1 and Table S1 in
All analyses and data processing steps were performed in the programming language
The internal clock times of the Fitbit and PSG systems were misaligned. This is a common problem in studies involving multiple measurement instruments, as they often do not share the same clock and thus require temporal alignment [
Bland-Altman plots were constructed with the
Epoch-by-epoch (EBE) analyses were performed through the following statistical measures:
In these equations, TP represents true positives (number of Fitbit epochs that share a given PSG stage), TN represents true negatives (the number of Fitbit epochs that are not in a given stage and where the according PSG epoch is also not labeled as that stage), FP represents false positives (number of Fitbit epochs that do not share a given PSG stage), and FN represents false negatives (number of Fitbit epochs that did detect a given stage, whereas PSG did not detect it). Sensitivity measures the proportion of epochs of a given PSG-derived sleep state that was correctly identified by Fitbit (eg, for REM sleep, it is the percentage of Fitbit
The demographics of the 59 individuals studied as well as their mean PSG- and Fitbit-derived sleep and HR measures are summarized in
Demographics of study sample (N=59).
|
Value |
Female, n (%) | 33 (56) |
Police, n (%) | 15 (25) |
Age (years), mean (SD) | 33.5 (8.1) |
BMI, mean (SD) | 23.9 (2.9) |
PSQIa, mean (SD) | 5.8 (2.7) |
PCL-5b, mean (SD) | 6.2 (7.9) |
PSS-10c, mean (SD) | 12.2 (4.9) |
rMEQd, mean (SD) | 14.4 (3.5) |
aPSQI: Pittsburgh Sleep Quality Index.
bPCL-5: Posttraumatic Stress Disorder Checklist for Diagnostic and Statistical Manual of Mental Disorders Fifth Edition.
cPSS-10: Perceived Stress Scale 10.
drMEQ: Horne-Östberg Morningness-Eveningness Questionnaire-A Reduced Scale.
Sleep and heart rate variables (N=59).
|
Value, mean (SD) | |
|
Polysomnography | Fitbit |
N1soa (clock time) | 23.4 (0.9) | 23.4 (2.4) |
TSTb (hours) | 8.0 (1.7) | 7.8 (2.6) |
REMdc (hours) | 1.7 (0.8) | 1.7 (0.7) |
lightdd (hours) | 4.2 (1.1) | 4.4 (1.3) |
deepde (hours) | 1.5 (0.6) | 1.3 (0.5) |
WASOf (hours) | 0.4 (0.5) | 1.0 (1.1) |
REMLg (minutes) | 76.3 (30.6) | 103.9 (59.7) |
REMh in the first cycle (%) | 11.6 (8.1) | 15 (8.7) |
HR10i REM (bpmj) | 60.9 (9.1) | 59.9 (8.2) |
HR10 N1k (bpm) | 61.8 (9.2) | 59.2 (7.5) |
HR10 N2l (bpm) | 56.6 (7.7) | 55.7 (7.0) |
HR10 N3m (bpm) | 58.8 (8.8) | 57.2 (7.2) |
HRvar10n REM (bpm) | 28.1 (90.8) | 6.4 (16.1) |
HRvar10 N1 (bpm) | 48.7 (110.1) | 6.8 (16.7) |
HRvar10 N2 (bpm) | 22.0 (76.7) | 4.7 (24.3) |
HRvar10 N3 (bpm) | 25.4 (111) | 2.9 (12.9) |
aN1so: sleep onset with non–rapid eye movement (NREM) sleep stages 1 criteria.
bTST: total sleep time.
cREMd: rapid eye movement sleep duration.
dlightd: light sleep or NREM sleep stages 1+NREM sleep stages 2 duration, respectively.
edeepd: deep sleep or NREM sleep stages 3 duration, respectively.
fWASO: wakefulness after sleep onset.
gREML: rapid eye movement sleep latency.
hREM: rapid eye movement.
iHR10: 10%-trimmed heart rate average.
jbpm: beats per minute.
kN1: NREM sleep stages 1.
lN2: NREM sleep stages 2.
mN3: NREM sleep stages 3.
nHRvar10: 10%-trimmed heart rate variability.
Accurate temporal synchronization between the PSG system and the wearable Fitbit device often poses a methodological challenge in validation studies [
The consecutive study participant numbers (higher numbers indicate chronologically later entry into the study) from the entire study sample are shown on the x-axis; the data-driven timeshift between polysomnography and Fitbit is shown on the y-axis. There was a significant linear relationship between the identifier and the shift (
To align the time series, we computed the cross-correlation function for each participant and corrected the time shift by the emergent maximum. Our time alignment efforts produced good correspondence in our data between the two instruments, as evident in the simultaneous occurrences of HR bursts in the two time series (
Data on the validation night of the first participant in the study with identifying number 004 (left column) and the last participant in the study with number 104 (right column) are shown. Row A displays the cross-correlation function, which displays a large visible maximum at the orange vertical line representing the best alignment between the two devices (PSG and Fitbit). The dashed vertical reference line shows a lag of 0 minutes. Rows B-D share the same x-axis, which denotes hours after PSG-derived sleep onset with criteria. For each hour in the recording, a vertical dashed gray line was added. Row B shows the HR in bpm derived from PSG (red) and Fitbit (black) that were seen before any time alignment was applied, whereas row C presents the HR data after the data-driven shift from panel A was applied. The time-aligned time series visually shows good agreement after correcting for the time difference. Fitbit shows reduced variability in the signal but fairly good average correspondence. In panel D, the top row shows PSG-derived hypnograms for both participants, whereas in the bottom row, the Fitbit-derived hypnograms are displayed. All hypnograms have been time-corrected according to panel A. The overall sleep structure is captured reasonably well by Fitbit, but Fitbit detects more wake and REM episodes compared with PSG, and the distinction of light (N1+N2) and deep (N3) sleep often seems to be particularly challenging for Fitbit. bpm: beats per minute; HR: heart rate; PSG: polysomnography; REM: rapid eye movement; W: wake.
The available data of all nights (n=59) were extracted and counted for the number of heart rate measures contained. A total of roughly 28,320 minutes (corresponding to 59 study participants who, on average, spent 8×60 minutes asleep) were expected. In fact, 28,601 individual minutes of data were recorded; this figure displays the distribution of all heart rate measures, yielding an average of 7.48 measures per minute. Count data for >12 measures per minute and <4 measures per minute are not displayed because their occurrences were so small that they are not visible on the plot.
Next, we compared the distribution of sleep stage durations between the Fitbit and PSG data (
The distribution of sleep stage durations for Fitbit (left panel) and PSG (right panel). Both were computed on the sample of the nights used for validation. Here, the plot has been cut off at 40 minutes for visual purposes; the tails continue to decrease as one would expect. The Fitbit sleep staging data types "classic" (red) and "stages" (blue) show large deviations compared with PSG sleep stages (black). Of note, deep and REM sleep show nonbiological discontinuity at around 4.5 minutes, and all Fitbit stages have larger tails. The stage "restless" has a peak at 11 minutes with unknown meaning. PSG: polysomnography. REM: rapid eye movement; WASO: wakefulness after sleep onset.
We split our validation into two analyses, one with the PSG-determined first occurrence of N1 sleep as the criterion for Son (N1 Son [N1on]) and the other with the first occurrence of N2 sleep as the criterion for Son (N2 Son [N2on]). This was done because it is unknown how Fitbit estimates Son. In
Bland-Altman plots for various sleep variables are shown with sleep onset defined as the first occurrence of N1. The dashed lines denote lower limits of agreement, bias, and upper limits of agreement. The dotted lines are the respective 95% CI of limits of agreement. On the top and right of each panel, the marginal densities are plotted. The x-axis displays the PSG variables, and the y-axis denotes the differences between the two devices (PSG-Fitbit). N1-derived sleep onset is unbiased. Sleep offset, total sleep time, light sleep or N1+N2 sleep duration, deep sleep or N3 sleep duration, and REMd do not have significant bias. WASO and REML display a significant deviation of the difference between the devices from 0. deepd: deep sleep duration; lightd: light sleep duration; PSG: polysomnography; REMd: rapid eye movement sleep duration; REML: rapid eye movement sleep latency; Soff: sleep offset; Son: sleep onset; TST: total sleep time; WASO: wake after sleep onset.
Bland-Altman statisticsa.
Variable | PSGb-Fitbit | Lower LoAc | Upper LoA | ||||||
Sond (minutes) | –1.6 | –68.8 | 65.6 | .73 | |||||
Soffe (minutes) | –5.6 | –189.3 | 178.2 | .66 | |||||
TSTf (minutes) | –4.0 | –204.3 | 196.3 | .77 | |||||
REMdg (minutes) | –2.7 | –87.8 | 82.4 | .67 | |||||
lightdh (minutes) | –10.4 | –136.8 | 116.0 | .27 | |||||
deepdi (minutes) | 11.2 | –72.9 | 95.2 | .08 | |||||
WASOj (minutes) | –37.1 | 188.1 | 113.8 | .001 | |||||
REMLk (minutes) | –29.4 | –145.4 | 86.6 | .001 | |||||
|
|||||||||
|
Overall | 0.9 | –6.9 | 8.6 | <.001 | ||||
|
WASO | 1.9 | –5.4 | 9.2 | .03 | ||||
|
N1n | 1.2 | –8.9 | 11.3 | .14 | ||||
|
N2o | 0.6 | –4.7 | 6.0 | .001 | ||||
|
N3p | 0.6 | –6.4 | 7.6 | .008 | ||||
|
REMq | 0.7 | –4.7 | 6.0 | <.001 |
aStatistics accompanying the Bland-Altman plots (
bPSG: polysomnography.
cLoA: limit of agreement.
dSon: sleep onset.
eSoff: sleep offset.
fTST: total sleep time.
gREMd: REM sleep duration.
hlightd: light sleep duration.
ideepd: deep sleep duration.
jWASO: wakefulness after sleep onset.
kREML: REM sleep latency.
lHR10: 10%-trimmed heart rate average.
mbpm: beats per minute.
nN1: NREM stage 1 sleep.
oN2: NREM stage 2 sleep.
pN3: NREM stage 3 sleep.
qREM: rapid eye movement.
The Bland-Altman plots for the HR variables are shown in
When analyzing overall HR variance, Fitbit strongly underestimated HRvar10 with a bias of 20.3 bpm (
Bland-Altman plots for heart rate–derived variables. The dashed lines denote lower limits of agreement, bias, and upper limits of agreement for a mixed model dealing with the repeated measures. On the top and right of each panel are the marginal densities. The x-axis displays the means of both devices (ie, [polysomnography + Fitbit]/2), and the y-axis denotes the differences between the two devices (polysomnography-Fitbit). Overall average 10%-trimmed heart rate and 10%-trimmed heart rate variance values are calculated for 1-minute intervals between 30 minutes before sleep onset with N1 criteria and 30 minutes after sleep offset. All other variables are calculated between sleep onset and sleep offset, only extracting the designated variable, in 1-minute intervals. HR10: 10%-trimmed heart rate average; HRvar10: 10%-trimmed heart rate variance average; REM: rapid eye movement; WASO: wake after sleep onset.
The EBE analysis results are displayed in
Epoch-by-epoch analysisa.
State | Sensitivity | Specificity | Accuracy | MCCb | PPVc | NPVd |
WASOe | 0.428 | 0.898 | 0.824 | 0.329 | 0.438 | 0.894 |
Light sleep | 0.534 | 0.574 | 0.553 | 0.108 | 0.592 | 0.516 |
Deep sleep | 0.279 | 0.920 | 0.776 | 0.250 | 0.501 | 0.815 |
REMf sleep | 0.548 | 0.889 | 0.861 | 0.339 | 0.306 | 0.956 |
REM sleep <120 minute | 0.432 | 0.963 | 0.934 | 0.383 | 0.403 | 0.967 |
REM sleep >120 minute | 0.570 | 0.864 | 0.837 | 0.329 | 0.296 | 0.953 |
aEpoch-by-epoch comparison of Fitbit and polysomnography stages.Each stage—wakefulness after sleep onset, light sleep (non–rapid eye movement [REM] stage 1 [N1] sleep+NREM stage 2 sleep), deep sleep (NREM stage 3 sleep), and REM sleep—was analyzed. REM sleep was divided into analyses with REM sleep episodes occurring during the first 120 minutes after sleep onset with N1 sleep criteria (N1 sleep onset) and REM sleep episodes occurring later than 120 minutes after N1 sleep onset. Various performance measures were used, including sensitivity, specificity, accuracy, the Matthews correlation coefficient, the positive predictive value, and the negative predictive value. More information on these measures can be found in the
bMCC: Matthews correlation coefficient.
cPPV: positive predictive value.
dNPV: negative predictive value.
eWASO: wakefulness after sleep onset.
fREM: rapid eye movement.
We evaluated the performance of the multisensor wearable Fitbit Charge 2 against PSG of the sleep macrostructure and HR in a sample of first responder shift workers under naturalistic conditions. We observed that Son, Soff, TST, REMd, N1+N2 sleep duration, and N3 sleep duration showed unbiased estimates but nonnegligible LoA. Fitbit overestimated REML by –29.4 minutes, possibly because the proprietary algorithm failed to detect very short first REM sleep episodes. This hypothesis is supported by the right shift in the maximum duration of stages and larger tails (
One of our most striking and novel findings is that the distribution of all sleep episode durations differs between the Fitbit Charge 2 and PSG. Fitbit’s sleep staging algorithm probably treats
The Son measures from Fitbit were unbiased concerning the N1on criteria, whereas there was a higher but nonsignificant underestimation for N2on. Thus, it is likely that Fitbit’s definition of Son time roughly corresponds to PSG-derived N1on. Son criteria should be reported in future validation studies because whatever criterion one selects (eg, N1on, N2on, or alternatively any stage of sleep) will impact many sleep variables, such as TST, REML, and WASO, whose operational definition and calculation depend upon the criterion of Son. This may be one of the reasons for discrepancies reported in the validation literature. A peculiarity of the staging information provided by Fitbit is that the first stage after the Son time and the last stage before Soff time is sometimes staged as
Overall, EBE analyses revealed better specificity than sensitivity for all sleep states. This might have been expected. For example, there are much fewer
The information Fitbit provides on the sleep sensitivity setting, with options
Regarding the HR data, Fitbit slightly underestimated overall HR10 by 0.9 bpm with a limited capability to capture sudden HR changes. This underestimation was smaller in N2, N3, and REM sleep stages (0.6, 0.6, and 0.7 bpm, respectively) compared with N1 sleep and wake (1.2 and 1.9 bpm), thus indicating a sleep stage–specific bias. The bias was low and probably not biologically relevant. The low
Fitbit HR variance was reduced owing to the inaccessibility of raw data and showed higher LoA than the LoA for HR. The differences between the assessments are not surprising, as Fitbit only provided 7.4 measurements per minute on average (
The missing information regarding an objective marker of
In a study conducted at home in a relatively large sample validating Fitbit Charge 2 against PSG, compared with most previous validation studies (n=15 [
Supplemental figures and tables.
beats per minute
deep sleep duration
epoch-by-epoch
heart rate
10%-trimmed heart rate average
10%-trimmed heart rate variance average
interbeat interval
light sleep duration
limits of agreement
Matthews correlation coefficient
N1 sleep onset
N2 sleep onset
negative predictive value
positive predictive value
polysomnography
Pittsburgh Sleep Quality Index
Perceived Stress Scale 10
60% quantile of the absolute value of the second derivative
rapid eye movement
rapid eye movement sleep duration
rapid eye movement sleep latency
Horne-Östberg Morningness-Eveningness Questionnaire-A Reduced Scale
sleep offset
sleep onset
total sleep time
wakefulness after sleep onset
The authors thank all participants and staff members of Schutz & Rettung Zürich, Rettungsdienst Winterthur and Stadtpolizei Winterthur. The authors also thank Maria Dimitriu, Giulia Haller, Zilla Huber, Zora Kaiser, Nora Krucker, Josefine Meier, Gioia Peterhans, Daniel Prossnitz, Sinja Rosemann-Niedrig, Nora Werner, Rafael Wespi, Laura van Bommel, and Chantal Wey for assisting with data collection and data processing. This study was supported by the Clinical Research Priority Program Sleep and Health of the University of Zurich, the University of Zurich, and the Swiss National Science Foundation (grant 320030_163439 to HPL).
HPL, BK, YA, and IC conceived the study. YA and IC contributed to implementation and data acquisition. BS, HPL, IC, PA, and WK contributed to data acquisition, analysis, and interpretation. BS, IC, and HPL wrote the manuscript. All authors contributed to manuscript revisions. All authors approved the submitted version.
None declared.