Background

The goal of implementation science is to improve the quality and effectiveness of health services by develo** strategies that promote the adoption, implementation, and sustainment of empirically supported interventions in routine care [1]. Understanding the causal processes that influence healthcare professionals’ and participants’ behavior greatly facilitates this aim [2, 3]; however, knowledge regarding these processes is in its infancy [4, 5]. One popular approach to understanding causal processes is to conduct mediation studies in which the relationship between an independent variable (X) and a dependent variable (Y) is decomposed into two relationships—an indirect effect that occurs through an intervening or mediator variable (M) and a direct effect that does not occur through an intervening variable [6, 7]. Figure 1 shows a mediation model in which the effect of X on Y is decomposed into direct (c’) and indirect effects (the product of the a and b paths). Estimates of the a, b, and c’ paths shown in Fig. 1 can be obtained from regression analyses or structural equation modeling. Under certain assumptions, these estimates allow for inference regarding the extent to which the effect of X on Y is mediated, or transmitted, through the intervening variable M [8,9,10]. Interpreted appropriately, mediation analysis enables investigators to test hypotheses about how X contributes to change in Y and thereby to elucidate the mechanisms of change that influence implementation [5, 9, 10]. Recently, several major research funders, including the National Institutes of Health in the USA, have emphasized the importance of an experimental therapeutics approach to translational and implementation research in which mechanisms of action are clearly specified and tested [11,12,13]. Mediation analysis offers an important method for such tests.

Fig. 1
figure 1

Single-level mediation model. Note: X = independent variable; M = mediator, Y = outcome. The indirect effect is estimated as the product of the a and b paths (i.e., a*b). The c’ path represents the direct effect of X on Y (i.e., the effect of X on Y that is not transmitted through the mediator, M)

Mediation analysis has long been of importance in implementation science, with recent studies emphasizing the need to increase the frequency and rigor with which this method is used [5, 14]. Guided by theoretical work on implementation mechanisms [15, 16], emerging methods-focused guidance for implementation research calls for the use of mediation analyses in randomized implementation trials to better understand how implementation strategies influence healthcare processes and outcomes [5, 17]. A systematic review of studies examining implementation mechanisms indicated mediation analysis was the dominant method for testing mechanisms in the field, used by 30 of 46 studies [4]. Other systematic reviews highlight deficits in the quality of published mediation analyses in implementation science to date and have called for increased and improved use of the method [5, 18]. Reflecting its growing importance within the field, mediation analyses feature prominently in several implementation research protocols published in the field’s leading journal, Implementation Science, during the last year [19,20,21,22]. Chashin et al. [23] recently published guidance for reporting mediation analyses in implementation studies, including the importance of determining required sample sizes for mediation tests a priori.

Designing mediation studies requires estimates of the sample size needed to detect the indirect effect. This seemingly simple issue takes on special nuance and heightened importance in implementation research because of the complexity of statistical power analysis for multilevel research designs—which are the norm in implementation research [17, 24]—and the constraints on sample size posed by the practical realities of conducting implementation research in healthcare systems. While statistical power analysis methods and tools for single-level mediation are well-developed and widely available [8, 25,26,27,28,29], these approaches are inappropriate for testing mediation in studies with two or more hierarchical levels, such as patients nested within providers nested within organizations [9, 30, 31]. Generating correct inferences about mediation from multilevel research designs requires multilevel analytic approaches and associated power analyses to determine the required sample size [32,33,34,35,36].

While some tools have begun to emerge to estimate required sample sizes for 2- and 3-level mediation designs [37, 60] or the Monte Carlo confidence interval approach [26, 61].

Cells C and D in Fig. 3 represent statistical power estimates for MVM and MSEM using a 1-sided hypothesis test. Many mediation hypotheses could reasonably be specified as directional (i.e., 1-sided) because the implementation strategy is anticipated to have a positive (or negative) effect on the mediator and outcome. The use of a 1-sided test should reduce the sample size needed to detect mediation. Estimates of statistical power for 1-sided tests were generated using an algebraic transformation of the results from the 2-sided simulations and thus did not require additional computational time (details available upon request).

Results

Completion of the simulations required 591 days of computational time. Completion rates, defined as the number of replications within a simulation that successfully converged (e.g., 500 out of 500), were high: 97.8% (n=17,114) of the MVM simulations exhibited complete convergence (i.e., 500 of 500 replications were successfully estimated) and 79.4% (n=13,889) of the MSEM simulations exhibited complete convergence. The lowest number of completed replications for any design was 493 (out of 500). The high rate at which the replications were completed increases confidence in the resulting simulation-based estimates of statistical power.

How many of the designs studied had adequate statistical power to detect mediation?

Table 1 shows the frequency and percent of designs studied that had adequate statistical power (≥ 0.8) to detect mediation by study characteristic based on a conventional MVM model, using a 2-sided test (cell A in Fig. 3). Only 463 of the 17,496 (2.6%) designs had adequate statistical power to detect mediation. As expected, statistical power was higher for the designs in cell C of Fig. 3 which were estimated using MVM and a 1-sided hypothesis test: 808 of these designs (4.6%) had adequate power to detect mediation.

As an alternative to MVM, investigators may use MSEM. Focusing on cell B of Fig. 3 (MSEM, 2-sided test), results indicated that 228 of the 17,496 designs (1.3%) studied had adequate statistical power to detect mediation. Shifting to cell D of Fig. 3 (MSEM, 1-sided test): 369 of the designs (2.1%) had adequate statistical power.

In summary, less than 5% of the 3-level mediation designs studied had adequate statistical power to detect mediation regardless of the statistical model employed (i.e., MVM vs. MSEM) or whether tests were 1- vs. 2-sided.

What study characteristics were associated with increased statistical power to detect mediation?

Table 1 presents the frequency and percent of designs with adequate statistical power to detect mediation by study characteristic for the 17,496 designs in cell A of Fig. 3 (MVM, 2-sided test). Because results were similar for all four cells in Fig. 3, we focus on the results from cell A and describe variations for the other cells as appropriate. Additional file 2 presents the frequency and percent of study designs with adequate statistical power to test mediation by study characteristic for all four cells shown in Fig. 3.

First, consistent with expectations, statistical power to detect mediation increased as the magnitude of effect sizes increased for the two paths that constitute the indirect effect (i.e., a3 and b3). Notably, none of the designs in Table 1 had adequate power when either the a3 or b3 paths were small; less than 1% of designs had adequate power when the a3 or b3 paths were medium.

Second, the number of adequately powered designs increased as sample sizes increased at each level, with the level-3 sample size having the largest effect on power. In Table 1, no designs with fewer than 40 level-3 clusters (e.g., organizations) had adequate power to detect mediation. This finding also held for the MSEM designs (cells B and D in Fig. 3; see Additional file 2). However, for cell C in Fig. 3 (MVM, 1-sided test), 11 designs (0.1%) had adequate power to detect mediation with level-3 sample sizes of 20 (see Additional file 2).

Third, larger total sample sizes were associated with increased power, although this relationship was not monotonic because the total sample size consisted of the product of the sample sizes at each level. In Table 1, the minimum total required sample size to detect mediation was N=900 level-1 units. The minimum total sample for cell C in Fig. 3 (MVM, 1-sided test) was N=600. The minimum total sample for cell B in Fig. 3 (MSEM, 2-sided test) was N=1800, and the minimum total sample for cell D in Fig. 3 (MSEM, 1-sided test) was N=1200.

What was the range of minimum sample sizes required to detect mediation?

Table 2 presents the minimum sample sizes required to achieve statistical power ≥ 0.8 to detect mediation by values of effect size for the a3 and b3 paths that constitute the indirect effect, the size of the direct effect, and the level-3 ICCs of the mediator and outcome. Results in Table 2 are based on cell A of Fig. 3 (MVM, 2-sided). In each cell of Table 2, two sample sizes are provided, one assuming a small direct effect (cs) and the other assuming a medium direct effect (cm). Sample sizes are presented as N3 [N2 [N1]] where N3 = number of level-3 units (e.g., organizations), N2 = number of level-2 units (e.g., providers) per cluster, and N1 = number of level-1 units (e.g., patients) per level-2 unit. Because the N3 sample size is typically the most resource intensive to recruit in implementation studies, and because multiple combinations of N1, N2, and N3 can achieve the same total sample size in a given cell, the minimum sample sizes shown in Table 2 were selected based on the sample combination with adequate power and the smallest N3, followed by the smallest N2, followed by the smallest N1. Blank cells (-) are informative in that they indicate there were no sample sizes that achieved adequate statistical power to detect mediation for that design; for these cells, it is not possible to design a study with adequate statistical power to test mediation within the range of sample sizes and input values we tested. Additional file 3 provides a similar table for cell C of Figure 3 (MVM, 1-sided test).

Table 2 Minimum sample sizes required for adequate statistical power to detect mediation

Table 2 provides additional insights into the design features necessary to test mediation in 3-level designs under conditions that are plausible for implementation research. First, most of the cells in Table 2 are empty, indicating no design in that cell had adequate power to detect mediation. This underscores the limited circumstances under which one can obtain a sample large enough to test mediation in 3-level implementation designs. Second, no designs with combinations of small or medium effects for the a3 and b3 paths had adequate statistical power. This indicates at least one large effect size for either the a3 or b3 path is needed to achieve adequate statistical power to test mediation. Third, the size of the level-3 ICC of the mediator (ICCm3) is extremely important. When ICCm3 is small, there are no designs with adequate power except those that have large effect sizes for both a3 and b3 paths.

Discussion

Thought leaders and funders in the field of implementation science have increasingly called for a stronger focus on understanding implementation mechanisms [13,14,15,16], with methodologists pointing to mediation analysis as a recommended tool in this effort [5, 17]. Because statistical power to test mediation in multilevel designs depends on the specific range of input values that are feasible within a given research area, we estimated what sample sizes, effect sizes, and ICCs are required to detect mediation in 3-level implementation research designs. We estimated statistical power and sample size required to detect mediation using a range of input values feasible for implementation research. Designs were tested under four different conditions representing two statistical models (MVM vs. MSEM) and 1- versus 2-sided hypothesis tests (see Fig. 3). Fewer than 5% of the designs studied had adequate statistical power to detect mediation. In almost all cases, the smallest number of level-3 clusters necessary to achieve adequate power was 40, the upper limit of what is possible in many implementation studies. This raises important questions about the feasibility of mediation analyses in implementation research as it is currently practiced. Enrolling 40 organizations usually requires substantial resources and may not be feasible within a limited geographic area or timeframe [24, 55]. In many settings, it also may not be possible to enroll enough level-2 units per setting (e.g., nurses on a ward, primary care physicians in a practice, specialty mental health clinicians in a clinic) or level-1 units (e.g., patients per provider). Below, we discuss the implications of these findings for researchers, funders of research, and the field.

Implications for researchers

Implementation research commonly randomizes highest-level units to implementation strategies and measures characteristics of these units that may predict implementation, such as organizational climate or culture, organizational or team leadership, or prevailing policies or norms within geopolitical units. If researchers wish to study multilevel mediation, they must either obtain a large number of highest-level units or choose potential mediating variables that are likely to have large effects. While it is not known how often such level-3 independent variables have large effects on putative lower-level mediators, there are some encouraging data on the potential for large associations between lower-level mediators and lowest-level outcomes. For example, in a meta-analysis of 79 studies, Godin et al. found variables from social cognitive theories explained up to 81% of the variance in providers’ intention to execute healthcare behaviors and 28% of the variance in physicians’ behaviors, 24% of the variance in nurses’ behavior, and 55% of the variance in other healthcare professionals’ behavior [62]. These effect sizes are comparable to or larger than the effect size for the b3 path used in this study, suggesting that the variables proposed as antecedents to behavior in these theoretical models may serve as effective mediators linking level-3 independent variables to level-1 implementation outcomes.

Researchers can take steps to increase statistical power. One approach is to include a baseline covariate that is highly correlated with the outcome, ideally a pretest measure of the outcome itself, which can significantly increase statistical power, in some cases reducing the required sample size by 50% [30, 38] and future research is needed to characterize the types of pretest covariates that are available in implementation research as well as the strength of the relationship between these covariates and pertinent implementation and clinical outcomes as these will be important for study planning. Future research should also examine how unbalanced clusters influence power in multilevel mediation.

Conclusions

This study assesses the sample sizes needed to test mediation in 3-level designs that are typical and plausible in implementation science in healthcare. Results suggest large effect sizes coupled with 40 or more highest-level units are needed to test mediation. Innovations in research design are likely needed to increase the feasibility of studying mediation within the multilevel contexts common to implementation science.