Introduction

The Acute Respiratory Distress Syndrome (ARDS) is a clinically and biologically heterogeneous syndrome that contributes significantly to morbidity and mortality in critically ill patients [1,2,3]. ARDS has long been a focal point of critical care research, but hundreds of randomized controlled trials (RCTs) have led to merely two guideline recommendations supported by high-level evidence: low tidal volume ventilation and prone positioning in patients with severe ARDS [4, 5].

The paucity of high-level evidence is due to indeterminate and conflicting trial results. Many RCTs in the ARDS population report an indeterminate outcome—detecting neither significant benefit nor harm of investigated therapeutic strategies [6]. Several other large RCTs demonstrated contradictory results, with seemingly beneficial therapies being found ineffective in subsequent trials [7, 8].

It has become clear that treatment effects of interventions in ARDS are highly dependent on the details of the intervention, as small variations of the same treatment have led to disparate results. For example, different definitions of ‘low’ and ‘high’ tidal volumes [9,10,11,12,13] or differences in neuromuscular blockade and sedation [7, 8, 14] have led to different trial outcomes. On the other hand, it is much less clear how differences between methodological trial characteristics and patient characteristics affect study outcomes. Between-trial heterogeneity refers to the non-random variation in treatment effect of an intervention due to methodological or clinical differences between patient populations. Unmeasured or unexplainable heterogeneity—both among patients in a single trial and between trial populations—may adversely affect the validity and generalizability of study results (see: ‘Panel: A practical example of the problem with unexplainable between-trial heterogeneity’) [15, 16].

In this study, we set out to quantify the consistency of reporting baseline characteristics and to measure between-trial heterogeneity in 28-day control group mortality among all ARDS RCTs in the lung-protective ventilation era. Our aim was to determine to which extent between-trial differences in control group mortality could be explained by differences in trial and patient characteristics. We hypothesized that between-trial heterogeneity would be large and trial populations often poorly characterized, leading to a discrepancy between inclusion criteria and patient characteristics on the one hand, and control group outcomes on the other hand.

Panel: A practical example of the problem with unexplainable between-trial heterogeneity

We note two high-profile trials published in the same journal issue [17, 18]. Both trials investigated high-frequency oscillatory ventilation in the same target population of moderate to severe ARDS patients, but they reported a different effect on mortality. Judging by the control group patient characteristics, there were clinically meaningful differences between the trial populations: there was a 32% relative difference in baseline Acute Physiology and Chronic Health Evaluation (APACHE II) scores (22 vs. 29 points) and there was a 41% relative difference in control group mortality (41% vs 29% at 30 days). But, paradoxically, the trial with the lowest baseline mean APACHE II score had the highest control group mortality. This makes the interpretation of the conflicting trial results exceedingly difficult. One trial demonstrated significant harm from the intervention while the other trial found no effect. Was the difference in treatment effect due to subtle unreported differences in the intervention, due to unreported differences in the patient populations, or due to differences in the standard of care? Were the patients more severely ill at baseline in the trial with the highest APACHE II score or in the trial with the highest control group mortality rate? It is clear that unexplainable outcome heterogeneity reduces the generalizability of intervention effects to the global ARDS population [19].

Methods

This systematic review follows the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA). The study protocol and statistical analysis plan were registered online at the International Prospective Register of Systematic Reviews (PROSPERO, registration number: CRD42020161809).

Systematic search

A comprehensive search was conducted in MEDLINE, Embase and Scopus for randomized clinical trials including adult ARDS patients published from January 1st, 2000 until January 31st, 2020. Eligible studies included a) adult ARDS patients diagnosed according to the AECC guidelines from 1994 [20] or the Berlin definition from 2012 [21], subjected to b) invasive lung-protective mechanical ventilation according to the ARDSnet protocol [11], or reporting a tidal volume of ≤ 8 ml/kg. Included studies were c) randomized clinical trials reporting on d) 28-day, hospital, intensive care unit (ICU) or 60-day mortality. There were no restrictions with regards to the intervention or phase of the study. More details about the review process are provided in the supplementary appendix, Sect. 1.1 and 1.2.

Outcome measures

For each study, we recorded trial characteristics, intervention, inclusion and exclusion criteria, mean patient baseline characteristics and mortality outcomes.

Primary outcome was the between-trial heterogeneity based on the 28-day control group mortality rate (I2). The 28-day control group mortality rate reflects the baseline risk of death of a patient population of an individual trial. Secondary outcomes included associations between 28-day control group mortality and characteristics of trial design and outcome, inclusion- and exclusion criteria, as well as baseline characteristics.

Estimation of 28-day control group and intervention group mortality

All analyses investigating heterogeneity were conducted using the 28-day control group mortality rate. For trials reporting solely on the hospital, ICU or 60-day mortality, 28-day control group mortality was estimated with linear regression using data from trials reporting on both, 28-day mortality and any of the other mortality outcomes [22]. 28-day intervention group mortality was estimated in the same manner for analyses investigating differences between control and intervention group mortality. A sensitivity analysis was conducted, using only the trials reporting 28-day control group mortality.

Estimation and quantification of between-trial heterogeneity

The 28-day control group mortality rates across studies were analyzed using a random-effects meta-regression model with the log odds of mortality as the dependent variable. Each individual trial was weighted by the inverse of the sampling variance of the mortality rates. A maximum likelihood estimator was applied to estimate the mean mortality (random-effects pooled estimate), the between-study standard deviation due to heterogeneity (τ), and heterogeneity (the percentage of variation in control group mortality due to heterogeneity rather than chance, I2). To make heterogeneity interpretable in a clinically meaningful manner, we calculated the 95% prediction interval. The prediction interval represents the distribution of estimated underlying mortality after correction for random chance and predictive covariates [for further details see method section reference [22]]. This model and its corresponding outcomes were used to present the distribution of 28-day control group mortality between all trials, and to investigate differences between individual trial characteristics.

Associations between patient characteristics and mortality rates

The associations between 28-day control group mortality and reported patient characteristics were estimated by adding each individual covariate separately to the random-effect model as moderators in univariate analysis. The goodness-of-fit of the log-linear, quadratic and power models were compared, and the model with the lowest Akaike information criterion (AIC) was selected [23]. For each model, the regression coefficient (b) and unadjusted R2 were reported. R2 represents the proportion of between-trial heterogeneity in 28-day control group mortality explained by the individual baseline characteristic—for the n trials reporting the covariate.

Prediction of control group mortality based on significant patient characteristics

To predict between-trial differences in mortality based on patient characteristics, a comprehensive multivariate logistic regression model was constructed. Missing observations were imputed using multiple imputation generating 20 datasets with predictive mean matching. For a detailed description of the process, we refer to the supplementary appendix, Sect. 2.5. Significant baseline characteristics reported in at least 25% of all trials with a univariate regression R2 ≥ 0.10 were eligible for the model. The threshold R2 of 0.10 was a compromise between the number of variables and the limited number of observations, as described before [22]. A stepwise backward selection procedure was applied removing regressors if p ≥ 0.05 for the final model. To facilitate comparisons between the individual covariates the standardized regression coefficient (β) and the standardized standard error (SSE) were reported in the supplementary appendix, Sect. 2.5.

Control group mortality differences between trials demonstrating benefit vs. no benefit

A trial demonstrating significant benefit was defined as a reported p value of < 0.05 for the primary endpoint (as defined by authors) in favor for the intervention group. Comparisons between trials demonstrating significant benefit and trials with an indeterminate outcome or harm were performed using the Mann–Whitney U test. Linear mixed-effects and regression models were applied to estimate the probability of a significant trial outcome based on the observed control group mortality and intervention group mortality, respectively.

Statistical analyses

A p-value of < 0.05 was considered statistically significant.

Statistical analyses were performed with R Studio interface (Version 1.1.447. R core team. R: A Language and Environment for Statistical Computing. 2013. http://www.r–project.org/) using the packages ‘tidyverse, ‘dplyr’, ‘metafor’, ‘mice’, ‘Hmisc’, ‘wCorr’, ‘data.table’, ‘MASS’ and ‘ggplot2’.

Results

Systematic search

The literature search yielded 3479 results. A total of 67 RCTs met all inclusion and exclusion criteria and were included in the analyses (eFigure 1) [7, 8, 17, 18, 24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,

Conclusion

Randomized controlled ARDS trials in the lung-protective ventilation era present a statistically significant and clinically relevant amount of heterogeneity in reporting and mortality outcomes. Differences in baseline characteristics partly explained the variability in outcome, but large unexplainable heterogeneity remained after extensive statistical adjustments. This study underlines the urgent need for standardized and comprehensive reporting of trial and baseline characteristics to diminish between-trial heterogeneity and to support the transportability of study results across populations.