Framework for the Design Engineering and Clinical Implementation and Evaluation of mHealth Apps for Sleep Disturbance: Systematic Review

Background: Mobile health (mHealth) apps offer a scalable option for treating sleep disturbances at a population level. However, there is a lack of clarity about the development and evaluation of evidence-based mHealth apps. Objective: The aim of this systematic review was to provide evidence for the design engineering and clinical implementation and evaluation of mHealth apps for sleep disturbance. Methods: A systematic search of studies published from the inception of databases through February 2020 was conducted using 5 databases (MEDLINE, Embase, Cochrane Library, PsycINFO, and CINAHL). Results: A total of 6015 papers were identified using the search strategy. After screening, 15 papers were identified that examined the design engineering and clinical implementation and evaluation of 8 different mHealth apps for sleep disturbance. Most of these apps delivered cognitive behavioral therapy for insomnia (CBT-I, n=4) or modified CBT-I (n=2). Half of the apps (n=4) identified adopting user-centered design or multidisciplinary


Introduction
Sleep disturbance is extremely prevalent and affects 33%-45% of adults [1]. Insomnia is the most common sleep disorder, defined as a chronic and persistent difficulty falling asleep, maintaining sleep, or waking up too early [2]. When left untreated, insomnia significantly increases the risk of adverse health outcomes, including mental health disorders [3,4], cardiovascular disease [5], hypertension [6], and diabetes [7]. As insomnia poses serious risks to mental and physical health, exploring the efficacy of treatment is imperative. Cognitive behavioral therapy for insomnia (CBT-I) is an effective gold standard treatment that has consistently shown moderate to large treatment effects [8][9][10]. A limitation to the widespread use of CBT-I has been a shortage of adequately trained practitioners to treat the high volume of patients with insomnia.
One potential avenue for addressing these challenges is the use of digital therapy. The widespread adoption of mobile phones and apps can change the delivery of health care. This emerging field, known as mobile health (mHealth), refers to the provision of health care services and practice delivered using mobile technology. mHealth apps provide unique benefits in delivering health information and interventions, given the ubiquity, convenience, and affordability of mobile phones. Over 325,000 mHealth apps are available on app stores, and this number continues to grow rapidly [11]. Research supports the utility of mHealth apps for a range of health issues, including depression, anxiety, schizophrenia, cardiac disease, physical activity, and diabetes [12][13][14][15][16]. The potential of mHealth is particularly significant for sleep disturbances as it presents a promising method for addressing this public health burden.
Health outcomes are not only dependent on the intervention delivered but also on the design engineering process employed. Design engineering combines design thinking, participatory design practices, software engineering methods, software, and quality assurance methods. Without analyzing this process, it is not possible to make inferences about the reasons for failure (eg, low engagement, lack of clinical efficacy) of an app. Alongside, this process involves the incorporation of clinically relevant content that either provides adjunctive or standalone therapy to traditional medical and psychological practice.
Clinical implementation requires consideration of end-user privacy and security, a major concern in mHealth. Recent work has demonstrated that mHealth apps routinely share and commercialize end-user data with third parties with very little transparency [17]. In addition, there are several important regulatory considerations required to protect the public and ensure that mHealth apps meet the minimum requirements of quality. Regulatory bodies for mHealth include HIPAA (Health Insurance Portability and Accountability Act), which is a federal law mandating privacy and security standards, and the FDA (Food and Drug Administration), which evaluates the safety and marketing claims of mHealth apps. In the absence of oversight from these regulatory bodies, it is possible to misuse health care-related data [18].
Despite the potential and ubiquity of mHealth apps, most apps lack evidence for their clinical efficacy among end users [15]. Compounding this problem is a lack of framework to inform and standardize the process and reporting of design, development, and evaluation of mHealth apps [19]. This may lead to clinical inefficacy, lack of medical condition-specific content, poor patient engagement, or even harmful apps [20,21].
Although various research groups have established frameworks for apps for posttraumatic stress disorder (PTSD) [22], bipolar disorder [23], and hypertension [24], there may be differences in the processes used according to the health condition, and there are no existing frameworks for sleep apps. This systematic review aims to assess the extent and nature of the peer-reviewed evidence and proposes a high-level framework for the design engineering and clinical implementation and evaluation of mHealth apps for sleep disturbance.

Search Strategy
This study uses PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-analyses) guidelines [25]. A systematic literature search was conducted using the electronic databases MEDLINE, Embase, Cochrane Library, PsycINFO, and CINAHL for relevant papers published from the inception of the databases through February 2020. Keywords related to sleep and mHealth were searched. To increase the coverage of relevant databases, we conducted a manual search of the Journal of Medical Internet Research and the Journal of Internet Interventions as well as the reference list of retrieved publications. See Multimedia Appendix 1 for an example of the search strategy.

Inclusion Criteria
The inclusion criteria consisted of peer-reviewed publications that (1) focused on mHealth apps that aimed to measure, track, or improve sleep; (2) described the design engineering, clinical implementation, or clinical evaluation of the apps; (3) focused on an app solely targeting sleep; and (4) targeted adults aged between 18 years and 60 years. The included studies were published in English.

Exclusion Criteria
Studies were excluded if they (1) focused on a sleep disorder other than insomnia; (2) focused on an app that provided multimodal interventions targeting health aspects other than sleep; (3) described internet, telephone, or text messaging interventions; or (4) were review papers. We excluded review papers that were at the early stages of screening; however, given the difficulty of discerning these papers, most were excluded in the full-text screening.

Screening
After duplicates were removed, a single author (MA) screened all the titles to identify potentially relevant studies. Abstract and title screen and full-text of potentially relevant studies were reviewed independently by 2 authors (MA and ES). Where there were conflicts, discrepancies were discussed, and consensus was reached with the senior author (NG).

Data Extraction and Coding
A data extraction template was constructed by one of the authors (MA) to summarize the following characteristics of the studies: first author name, publication date, country, objective, study design, and sample. Full papers were then imported into NVivo (QSR International Pty Ltd, version 11.4.3, 2017) for detailed data extraction and thematic coding. A coding framework was developed by the authors for each of the following categories: design engineering and clinical implementation and evaluation. Relevant segments of text were coded and extracted into a Microsoft Excel spreadsheet. Risk of bias was not assessed, as this was deemed irrelevant to the aims of the paper.
Data were extracted and coded into the following subgroups: • Design engineering: This stage considers the therapeutic approach, design method, and features and functionalities of the apps. Although we appreciate the design engineering process as an ongoing and iterative process, for the purpose of this paper, we propose that this stage covers the point up until the prototype app is implemented among end users.
• Clinical implementation: This stage refers to the testing of the app among end users and includes an assessment of app maturity, implementation metrics (acceptability, engagement, usability, and adherence), and privacy (ie, data acquisition, use or disclosures of identifiable health data) and regulatory (eg, HIPAA) requirements. To categorize the implementation metrics used by the papers and uniformly compare evidence in the included studies, we developed definitions based on previous literature and adjusted for use with sleep apps [26][27][28][29]. Acceptability refers to how intended recipients react to the intervention, for example, interest, user satisfaction, and perceived appropriateness. Engagement refers to the usage metrics of the app. Usability refers to whether the app can be used as intended. Adherence refers to the degree to which the user follows the program and the prescribed recommendations of the therapy.

Search Results
The search strategy identified 6015 papers ( Figure 1). Following the removal of duplicates (n=1503), 4512 papers were screened for title. Of those, 4246 were excluded, leaving 266 potentially relevant papers. Following a duplicate independent abstract review, 166 papers were excluded. The full text of 100 papers was obtained and independently reviewed by 2 authors (MA and ES). Following this, 82 papers were excluded, primarily for being review papers (n=24) or abstracts only (n=34), leaving 18 full-text papers to be included in this systematic review. A further 3 papers were excluded that reported evidence solely for concurrent validity (ie, comparison of app sleep tracking against objective measurements of sleep) and provided no further information about design engineering and clinical implementation and evaluation.

Therapeutic Approach
A total of 4 apps were identified to deliver a CBT-I intervention, 1 app used behavioral therapy for insomnia (BTi), 1 app delivered sleep restriction therapy (SRT), 1 app was a social alarm clock, and 1 app was a wallpaper display to promote healthy sleep behaviors. The average length of intervention (ie, use of app in the paper) was 5 weeks (mean 4.5. SD 1.5; median 4.5, range 3-6.5 weeks).

Implementation Metrics
All 8 apps included in this review had at least one paper assessing an implementation metric. A total of 10 out of the 15 papers reported on one or more of 4 implementation metrics: acceptability, usability, adherence, and engagement (Table 4). Table 4. Attributes related to implementation in included papers (N=15).

(27) Usability
Average number of days and time spent on homework [33], patient adherence form (completed by clinicians) [33], number of relaxation exercises performed [44], deviation between real and agreed-upon time in bed [44], number of coaching sessions completed [39], and qualitative interview [40] CBT-I Coach, ShutEye, Sleep Bunny, and Sleepcare Acceptability among end users and engagement were the most frequently assessed implementation metrics. Most papers measured acceptability through an interview or survey with end users and engagement using app-measured or self-reported app usage metrics. Usability was measured by less than half of the papers, with most using the System Usability Scale. Most apps had evidence available for 1 to 2 implementation metrics, and only 3 out of 8 apps (CBT-I Coach, ShutEye, and SleepFix) had papers reporting more than 2 implementation metrics.

App Maturity
The maturity levels of the apps ranged on a continuum of 3 levels: prototype, matured, and released [48]. A prototype refers to a minimally viable product of the app with functionality that users can test. A matured version refers to an app that has undergone user testing and has been redesigned. A released version refers to an app that is available for download. Most apps were prototypes (n=5), and 1 app was a prototype-to-mature version ( Table 5). Only 1 app was matured to release (CBT-I Coach). Furthermore, 1 study had no information regarding the maturity level of the app described. Only CBT-I Coach was available from the Google Play or Apple App Store at the time of writing the primary paper. Only 3 apps considered the privacy of data collected [35,40,45]. In total, 3 out of the 4 apps that enabled wearable synchronization reported data relating to its use adjunct to the app [31,41,43]. No studies have considered the regulatory requirements of the apps.

Clinical Evaluation
In total, 6 of the 8 apps had a quantitative evaluation of treatment outcomes. Of the 11 studies that evaluated clinical effectiveness, the most frequently used sleep outcomes were self-reported sleep questionnaires, followed by app sleep diary measures and actigraphy (objective). In total, 8 of the 11 papers used the Insomnia Severity Index, the most widely used insomnia treatment outcome questionnaire to evaluate insomnia symptom severity [49] (Table 6). Only 2 papers focused solely on evaluating the effectiveness of the app [44,47]. Table 6. Clinical outcome measures used in included papers.

Principal Findings
This study provides the first comprehensive review of published studies and a framework for the design engineering and clinical implementation and evaluation of mobile apps for sleep disturbance. Despite the availability of over 500 [50] sleep mobile apps in commercial app stores (Apple App and Google Play Store), our review identified 15 papers assessing the design, implementation, and evaluation of 8 apps, only one of which was available to download on commercial app stores [32,33,40,[45][46][47]. This means that less than 1% of all commercially available sleep apps have any published data on these aspects. Of the 15 papers, implementation metrics were reported in 10 papers and treatment outcomes were evaluated in 11 papers. Despite the potential of mHealth, there has been a small number of studies, a lack of standardization in design engineering approaches and clinical implementation assessments, and few comprehensive clinical evaluations.

Design Engineering
For the increased utilization and adoption of mHealth apps, these technologies must be designed for people who will use them-both end users and multidisciplinary stakeholders [51][52][53][54]. Our review indicates that although some apps utilized best practice design approaches, for example, user-centered and multidisciplinary methods, approximately half of the apps did not report their design approach. We have previously shown how people with insomnia have unique user needs and preferences for sleep mobile apps that can drive engagement [42]; however, only 2 of the 8 apps reported any end-user involvement. Multidisciplinary teams are particularly crucial in a domain such as sleep, where various stakeholders (eg, clinical psychologists, sleep clinicians, psychiatrists) tend to be involved in patient care; however, only 2 apps reported this approach. Although these best practice design approaches may be an unspoken rule in app development, transparency in the reporting of these approaches is important in encouraging clinicians to be able to recommend such apps designed to sustain engagement.

Clinical Implementation
There were few comprehensive evaluations of implementation, with only 3 out of 8 apps reporting more than 2 implementation metrics. Poor implementation can limit adoption or engagement, particularly in an uncontrolled and real-world setting, which in turn limits effectiveness. Our results suggest that although most apps had papers reporting some implementation outcomes, very few conducted a comprehensive exploration. For instance, although there is support for Sleepcare's efficacy, there was a lack of a substantial assessment for its implementation. The National Health Service in the United Kingdom highlights the importance of implementation in the acknowledgment that their previous failure of digital health technology deployment was attributed to rushed and inadequate implementation [55]. Moreover, reporting of implementation processes may enable replication and reduce the gap between research and practice [56].
There was great heterogeneity among implementation studies in the conceptualization of implementation metrics. Given the inconsistency in terminology, we developed our own definitions to facilitate cross-study comparisons. This language incongruence highlights the need for a taxonomy of implementation metrics to clearly delineate key variables [57,58]. Similarly, there was a lack of standardized implementation metrics ranging from qualitative methods to nonstandardized quantitative surveys. This further contributes to blurring among constructs. When standardized measures were used, they were not specifically designed for mobile or sleep disturbances. For instance, the System Usability Scale was one of the most frequently used implementation scales in the identified studies [59]. Although it is a well-researched measure with good psychometrics [59], it is not designed for mobile or sleep use. A taxonomy of implementation outcomes with standardized tools for sleep mobile apps can advance the measurement and understanding of implementation processes.
Despite the importance of privacy and regulatory oversight in digital technology, only 3 papers mentioned privacy, and no papers considered mHealth regulations. Mobile devices collect a large amount of behavioral and health care data, raising concerns regarding mHealth app quality and safety [60][61][62]. Most mHealth apps do not have a transparent privacy policy, leaving end users unaware of what data are collected, how data are transferred, where data are stored, and with whom the data are shared [63]. Given that data security is a key concern among health care providers when recommending mHealth apps [64], transparent privacy policies and further regulatory oversight from bodies such as the FDA can combat this issue.

Clinical Evaluation
Of the evaluation studies, we identified 2 RCTs, only one of which was adequately powered according to the primary paper [44]. Most of the apps (6/8, 75%) delivered full or modified CBT-I (SRT or BTi), a well-established intervention where an evidence base already exists for face-to-face and internet-enabled delivery. CBT-I has been recommended as a first-line therapy for insomnia, given its substantial clinical base [8][9][10], and has shown comparable efficacy when delivered via the internet [65]. Although a previous systematic review demonstrated support for the efficacy of mobile phone interventions for sleep disorders and sleep quality, this review only included studies of 4 mobile apps [66]. In this review, only 2 of these 4 apps were included. One was excluded as it was a multimodal intervention and the other was targeted toward older adults. There is an evident need for methodologically robust and adequately powered studies assessing the effectiveness of mHealth apps for sleep disturbance. Nevertheless, mobile-delivered CBT-I has potential, given the therapy's existing evidence base across various modalities.
Most apps described in this review were prototype versions, with only one app being matured to released (CBT-I Coach). This is aligned with a greater number of identified studies primarily focusing on earlier stages of development, for example, design or implementation, with some preliminary evaluations of efficacy. Of the apps identified in this review, only 1 app (Sleepcare) has the support of efficacy from an adequately powered RCT [44]. Despite this, there are no implementation studies for Sleepcare, and although acceptance is included as a measure in the paper, the results are not reported. Conversely, several papers described the design and implementation of the CBT-I Coach, but no full-scale efficacy evaluation was identified. However, CBT-I Coach is the only app available in commercial app stores.
Although it might be thought that a more mature app would have more cumulative evidence for its design, implementation, and evaluation, this does not prove to be the case. This mismatch in the maturity of sleep apps and the levels of available evidence ultimately reflects the unregulated nature of mHealth app development and deployment. To exacerbate the problem, the commercial app marketplace allows developers and researchers to freely release apps into these stores, which serve as the main app repository for consumers. Evidently, there is a need for a standardized set of evidence-based criteria for researchers to meet before making commercially available apps.
Ultimately, the lack of standardization in the evidence and reporting for the design engineering and clinical implementation and evaluation of mHealth apps for sleep disturbance stresses the need for a comprehensive framework to guide researchers and app developers. A recent systematic review highlighted the wide heterogeneity among the different published criteria for the assessment of mHealth apps, with 38 main classes of criteria [19]. Guidelines specific to the development and assessment of apps for sleep disturbance are particularly scarce. Establishing an extensive and standardized framework for mobile apps for sleep disturbance may lead to improved existing tools and the development of successful, high-quality, and effective apps.

mHealth App Framework for Sleep Disturbance
As a first step, we developed a high-level framework based on the findings of this study to guide the design engineering and clinical implementation and evaluation of apps for sleep disturbance. The findings are summarized in Figure 2. Although several frameworks exist for conditions such as PTSD [22], bipolar disorder [23], and hypertension [24], there are no frameworks for the development of sleep apps. Each of these frameworks similarly address design engineering, clinical implementation, and evaluation. For instance, the framework for bipolar disorder also notes the importance of designing with end users and multidisciplinary teams, addressing security and regulations with standards consistent with HIPAA, and evaluation with end users [23]. These frameworks were developed through a combination of lessons learned from firsthand app development, best practice principles, and theory-based design models. The framework in this study triangulates the findings from this review of digital sleep interventions to these previous frameworks augmented by our firsthand experience with the development of an app for insomnia [43,67]. Future work may consider adapting theory-based design models for sleep disturbance and integrating them into this framework.
Several reviews of commercial sleep apps have demonstrated a lack of validated sleep measurement algorithms [68], evidence-based principles for insomnia management [69], behavior change constructs [70], and overall low quality of functionality and content based on established app assessment criteria [71,72]. Evidently, commercial development of apps has severely outpaced academic research, putting their trustworthiness in question [73]. Our systematic search identified 13 clinical trial registrations, of which 6 were mobile apps not included in our systematic review as there were no available publications. Although partly attributable to the infancy of the mHealth field, there is still a necessity for timely and increased efforts of mobile sleep apps to progress to clinical evaluations. Collaboration between academia and industry may offer an opportunity to work together in developing scientifically rigorous solutions while keeping pace with the rapidly evolving app market.
This review has several limitations. First, we included English language publications only, which introduces publication bias, particularly given that these papers tended to originate from high-income countries such as the United States and Australia. Second, given that data extraction was based on the included studies only and that the mobile apps were not downloaded by the authors, some information such as app features and design approaches was not always clear or available. Third, given that our study focused on apps for sleep disturbance and did not include mHealth apps with multimodal interventions, including sleep, the inferences from this study may not extend to all sleep apps.

Conclusions
This is the first review to evaluate the design engineering and clinical implementation and evaluation of apps designed for sleep disturbance. It was found that despite a plethora of sleep apps available, there is limited research and a lack of standardization in the evidence base for the design, implementation, and evaluation of apps for sleep disturbance. Few apps had evidence for the use of best practice design approaches. Implementation assessments lacked standardization and consistency in implementation metrics used, and very few comprehensive efficacy evaluations were identified.
For the future development of engaging and evidence-based apps for sleep disturbance, we have developed a framework to guide the development and deployment process. The framework aims to address the need for (1) increased application and reporting of best practice design approaches, for example, user-centered and multidisciplinary teams; (2) comprehensive implementation assessments involving multiple metrics, tools validated for sleep, and privacy and regulatory considerations; and (3) rigorous evaluations of clinical efficacy. Collaboration between academia and the industry may facilitate the development of evidence-based apps in the fast-paced mHealth technology environment.

Conflicts of Interest
MA, CG, RC, RG, and NG are named on 2 provisional patents for the SleepFix app.