Introduction

Ageing populations, rising prevalence in treatment-intensive and costly non-communicable diseases and increasing shortage of specialized personnel pose serious threats to the financial sustainability of healthcare systems1. Without timely transformations of healthcare systems, rising socioeconomic and geographical inequalities in disease burden and unmet patient needs may be further exacerbated by inequalities in access to adequate healthcare services2.

Over the last decade, advancements in mobile technology have created new opportunities to meet this challenge. Most notably, mobile- or tablet-based health applications (apps) gained attention for their potentially beneficial effect on patients’ lives. For example, the use of mobile health (mhealth) apps can activate patients with chronic conditions to engage in online education, peer support, lifestyle monitoring and coaching consultations and help track their health status, fostering self-engagement and self-compliance in the disease management process to improve health outcomes3. Additionally, mhealth apps can bridge geographical barriers for access to healthcare, offering real-time reaction to patient needs in remote locations4. Lastly, health apps can relieve the burden on medical personnel by supporting medication prescription management and intake, as well as symptoms monitoring5.

The importance of new technologies has also been highlighted in the World Health Organization (WHO) global strategy on digital health 2020–20256. Member countries are encouraged to develop digital healthcare strategies considering national contexts such as culture, public needs and available resources. However, large-scale integration of new technologies into standard care processes requires sufficient confidence in their effectiveness and cost-efficiency. Effectiveness can for example be hampered through technological challenges faced by users and delivery agents, data protection issues or privacy concerns, use of ineffective components and suboptimal sustainability in user engagement or long-term effects7. Also, population-wide implementation may, in some instances, add to existing health inequalities in society by introducing a digital divide8,9 with regard to access to (first), usage of (second), and benefits from usage (third) of digital health technology10. Large-scale implementation would, in such cases, entail the waste or misallocation of (usually scarce) healthcare system resources and, in the worst case, pose the risk of detrimental individual health effects, and loss of trust in technology or the healthcare system. A corroborated respective evidence base is therefore a prerequisite for health systems to initiate adequate policy reforms.

Mirroring the increase in available health apps, also the number of scientific evaluations of their efficacy and effectiveness has increased over the last decade. To make these research results actionable, an up-to-date, comprehensive yet concise map** of the available high-quality evidence on efficacy and effectiveness of mhealth apps is required. A previous umbrella review attempted to summarize systematic reviews on a broad spectrum of telemedicine interventions beyond mhealth apps with heterogenous study designs including non-randomized controlled trials and various disease indications11. However, this broad summation of different intervention technologies and evidence levels makes it hard to draw conclusions specifically for mhealth apps. A second existent map** of effectiveness reviews on mhealth interventions focused specifically on diabetes indications, but also included systematic reviews on heterogeneous study designs and various types of mhealth interventions beyond mhealth apps12. Another umbrella review focused on randomized controlled trials (RCTs) and restricted its scope to diabetes, dyslipidemia and hypertension, while trying to summarize evidence not only on mhealth apps, but for a broader range of telemedicine interventions13. Thus, there is still a currently unmet need to identify both well-researched and potentially under-researched indications with regard specifically to mhealth app effectiveness.

The objective of this umbrella review is to systematically map and summarize existing systematic reviews of RCTs investigating the effectiveness of mobile phone or tablet app-based mhealth interventions in patients. We provide a summary of investigated patient populations, the specific intervention configurations and features, reported comparators, outcomes used to assess efficacy and effectiveness, and assess overall review quality, whereas synthesizing or even re-analysing outcome data is beyond the scope of this study.

Results

Study selection

The study selection process according to PRISMA requirements14 is summarized in Fig. 1. The database search yielded a total of 1895 records, with additional 2513 records identified through forward and backward citation searching of records from the initial search deemed eligible after full text screening by the first author. After de-duplication, 4253 articles were screened by title and abstract. Of these, 3892 records were excluded, and 361 records were included for full text screening. The final number of included articles was 48. Inter-rater reliability (IRR) for title-/abstract screening and full-text screening was κ = 0.3469 and κ = 0.9326, respectively. A list of the 313 studies excluded after full-text screening with exclusion reasons for each study can be found in Supplementary Table 1.

Fig. 1: PRISMA flow chart of retrieved, screened and included articles.
figure 1

Flow chart illustrating the process of study identification for the present umbrella review with database searches (last updated on August 28, 2023), deduplication, title and abstract screening as well as full-text screening, leading to a final inclusion decision for n = 48 systematic reviews.

Review characteristics

Included reviews were published between 2013 and 2023, with the highest number of reviews published in 2020 (n = 10) and the first three quarters of 2023 (n = 9) (see Fig. 2).

Fig. 2: Number of included reviews by publication year.
figure 2

Vertical bar chart illustrating the number of included systematic reviews (n = 48 in total) on the y-axis stratified by year of publication on the x-axis.

All included reviews considered articles without geographic restrictions, except one focusing on China15. The number of RCT studies included in a review ranged from two to 36. Out of the 48 included reviews, 3515,16,17,18,19,20,21,22,73, thereby providing additional insights.

We acknowledge that by focusing on effectiveness outcomes from systematic reviews of RCTs within patient groups only, this umbrella review only reflects a specific part of the available evidence on smartphone applications, since other evidence dimensions except effectiveness/efficacy (e.g., equity, cost effectiveness) were not considered. Furthermore, we did not include systematic reviews that comprised studies on the general population, other observational study designs that may allow for causal inference, or other intervention study types, such as non-randomized trials. Quasi-experimental studies could have also been an interesting source of evidence which we did not include. The rationale for our rather strict inclusion criteria with regard to study design was twofold: First, we wanted to specifically include systematic reviews with the highest internal validity when it comes to causal inference which could be undermined with lack of randomization74. Second, we attempted to keep the basis for this umbrella review manageable and at the same time as homogenous as possible. We acknowledge that the downside of these decisions is that some relevant studies may have been excluded from this umbrella review.

Lastly, there were several systematic reviews that we excluded because they marginally failed to meet our inclusion criteria. Examples include a large systematic review by Cucciniello et al.75 which included 69 studies from chronic disease indications, however at least one study used WhatsApp as the intervention instead of a full-blown health app. Another example is a systematic review by Widdison et al.76 which summarized three RCTs on health app effectiveness in urinary incontinence, but additionally included one observational follow-up study of a previous RCT intervention arm. Although these studies potentially summarized evidence that could have been relevant to our research question, they were excluded from our umbrella review.

In conclusion, we found 48 systematic reviews published since 2013 that narratively or quantitatively synthesized effectiveness results for app-based health interventions in patients. These reviews targeted a range of different health conditions, with diabetes and hypertension being the most intensely covered and evaluated. There was substantial heterogeneity of what was defined as primary outcomes, but the majority of reviews concluded that app-based health interventions are likely to be effective. In reviews focusing on diabetes, obesity and hypertension, variability in reported outcome measures was lowest. Future research in other indications might follow these examples and attempt higher standardization of measurements, easing quantitative inference and allowing for more actionable conclusions. Additionally, studies with longer follow-up periods are required. Furthermore, the heterogenous methodological quality of the evidence included in this umbrella review highlights the need to take quality assessments into account for policy decisions. Lastly, future evidence synthesis attempts should also map the additional evidence provided by systematic reviews summarizing other study designs and general population instead of diseased populations. This would provide a definitive and full picture of the effectiveness of health app-based interventions and would support evidence-based public health and healthcare policy decisions alleviating economic pressures on healthcare systems.

Methods

Study design

We conducted an umbrella review of existing systematic reviews following (where applicable) the Preferred Reporting Items for Systematic Reviews and Meta-Analysis guidelines (PRISMA) checklist14,77 (Supplementary Table 8) as recommended elsewhere78. We uploaded a pre-specified review protocol to the Open Science Framework database prior to conducting the initial literature search79. For protocol development, we consulted guidance documents78 including those published by Cochrane71 and JBI80. The scope of the review, as well as the pre-defined search strategy, eligibility criteria and extraction targets outlined below remained essentially unchanged throughout the conduct of the review.

Eligibility criteria

We defined eligibility criteria around types of studies, population, interventions, outcomes, and study language (Table 1).

Table 1 Eligibility criteria.

This umbrella review included systematic reviews with and without meta-analyses. Being the gold standard for efficacy evaluations, we included only systematic reviews of RCTs. Systematic reviews that did not include RCTs or that included RCTs together with other study types were excluded.

We considered only reviews that included efficacy/effectiveness trials of app-based health interventions in patients. The study participants had to have a specific disease or health issue (as defined by the International Classification of Diseases, 10th Revision [ICD-10])81, that was targeted by the intervention in question. Health issues could be diagnosed by a health professional or self-reported. In contrast, reviews that considered studies on general populations without any specific health problems were excluded, even if they reported sub-analyses on separate patient groups. Furthermore, we excluded reviews targeting individuals with potentially addictive behaviors such as tobacco use, drinking, gambling or other substance use, as these behaviors are classified in Chapter XXI as “factors influencing health status and contact with health services”, and not within the disease-related ICD-10 chapters, and we deemed a potential separate umbrella review for such behavioral factors more appropriate than a combination with clear-cut diseases. Similarly, reviews on pregnant women without any additional specific medical condition were also excluded, as ICD-10 does not classify normal pregnancy within the disease-related chapters but in Chapter XXI.

We included reviews focusing on interventions which aimed at improving specific health-related outcomes via smartphone or tablet apps. The app could be a standalone or complementary intervention tool (i.e., coupled with personal interactions, text messaging or social media). In contrast, reviews comprising studies that evaluated solely non-app technologies such as text messaging, social media, wearable devices or websites were excluded. Reviews comprising studies that solely involved online communication applications (e.g., Instagram, WhatsApp, WeChat, Telegram, Skype) were excluded unless the app was specifically designed for health or medical purposes. Reviews comprising studies that evaluated health apps aiming to support users in primary prevention or the process of (self-)diagnosis and/or (self-)screening for yet undetected conditions were excluded.

No restrictions were set with regard to the types of comparators.

Reviews reporting on health or care process outcomes pertaining to the efficacy or effectiveness dimension of evidence were included. These outcomes included, but were not limited to health outcomes, medication adherence, chronic disease management, or symptoms relief. Reviews reporting exclusively on other dimensions of evidence such as diagnostic accuracy, concordance, feasibility, cost-effectiveness, resource consumption, costs, equity, or measurement accuracy were excluded.

Only articles with available full text in English were considered, as we assumed the efficiency gains of implementing a language restriction to outweigh the risk of missing out on important evidence.

Databases and search strategy

We searched MEDLINE and PubMedCentral via PubMed and the CDSR. Articles were included from database inception until March 15, 2022 and the search was updated on August 28, 2023.

The search strategy combined keywords and Medical Subject Headings (MeSH) structured around three components: (i) intervention; (ii) study design; (iii) outcome dimension (see Supplementary Tables 9 and 10 for the complete search strategy and number of associated hits in each electronic database).

Forward and backward citation searches were additionally conducted for articles deemed eligible after initial full text screening. Forward citation searches were conducted until August 8, 2022 using PubMed and Scopus.

Selection process

After search completion and deduplication, two authors (SOKC and NA) carried out independent title and abstract screening according to the predefined eligibility criteria using the online software Rayyan82. Diverging decisions were resolved unanimously after discussion with up to two additional authors (SP and AJS).

After title/abstract screening, we assessed full texts of all potentially eligible articles against the eligibility criteria57. Whenever an inclusion criterion was not met, we stopped the screening of the respective full text and excluded the systematic review. One author (SOKC) conducted the full text screening of all systematic reviews. A second author (NA) independently double-checked all exclusion decisions. Diverging decisions were resolved unanimously with up to two additional authors (AJS and SP) included in the discussion.

Data collection/extraction process and data items

One author (SOKC or NA) extracted data from all eligible articles after full-text screening using a predefined and pretested extraction form79. A second reviewer (NA or AJS) double-checked the extracted data. Conflicts were resolved unanimously, where necessary after discussion with a third reviewer (AJS). We extracted the following information from the included reviews: general information about the review (e.g., publication date, number of included studies), pooled population characteristics, app characteristics, comparators, outcomes, subgroup analyses, authors’ narrative conclusions on overall efficacy/effectiveness of app-based interventions.

Methodological quality assessment

We used the Assessing the Methodological Quality of Systematic Reviews (AMSTAR2, see Supplementary Note 1) tool to evaluate methodological quality of all included systematic reviews83. AMSTAR2 covers 16 domains, of which seven are considered critical. Critical domains are deemed especially influential for review validity and include protocol pre-registration (item 2), literature search strategy (item 4), list and justification for excluded studies (item 7), risk of bias assessment (item 9), meta-analytical methods (item 11), consideration of risk of bias in results interpretation (item 13) and assessment of presence and likely impact of publication bias (item 15).

Each included review was rated for adequacy on each domain as either “Yes”, “No”, or “Partial Yes” (available only for domains 2, 4, 7, 8, and 9). For those articles that did not conduct meta-analyses, items 11, 12 and 15 were rated “Not Applicable”. Fulfillment of each dimension across the different reviews was illustrated using a table and heat map. Based on these domains, we also assigned a summary quality rating as “critically low” ( ≥ 2 “no” ratings on critical domains), “low” ( ≤ 1 “no” ratings on critical domains), “moderate” ( ≥ 2 “no” ratings on non-critical domains) or “high” ( ≤ 1 “no” on a non-critical domain) to each review. Quality appraisals were conducted in duplicate by two review authors (SOKC and AJS) and diverging ratings were resolved through discussion.

Inter-rater reliability

IRR was calculated using Cohen’s Kappa (κ) for title- and abstract screening, full-text screening and the methodological quality assessment (overall and item specific).

Data synthesis

We provide narrative summaries and graphical representations of publication years, population characteristics, type of underlying condition, type of intervention and type of outcomes assessed. As data from the same RCT may have contributed to the pooled effect estimates of more than one included systematic review, and due to the high expected heterogeneity of diseases and outcomes covered in the systematic reviews, a meta-analysis pooling systematic review results was not planned nor performed.