Background

In medical, social, and behavioral research we often encounter datasets with a multilevel structure and multiple correlated dependent variables. An example of such a study is the Cognition and Radiation Study B [1, 2] that investigated whether local brain radiation (stereotactic radiosurgery) preserves cognitive functioning and quality of life better than whole brain radiation in cancer patients with multiple brain metastases. Patients were recruited from multiple hospitals and the treatment was executed in two treatment centers, giving the data a multilevel structure. Many other examples of such datasets can be found in a paper by Biswas and colleagues [3], who presented a nonexhaustive overview of hundreds of Bayesian trial protocols executed in a specialized center for cancer treatment. The authors noted that a) almost half of the reviewed studies were multicenter trials; and b) many studies were designed to assess effectiveness and side effects simultaneously, thus including at least two dependent variables.

Often, these multilevel, multivariate data are collected from a study population that consists of several subpopulations with potentially distinctive (i.e., heterogeneous) effects of an intervention. Examples of such studies are the two International Stroke Trials (International Stroke Trial (IST) and Third International Stroke Trial (IST-3); [4,5,6,7]), which investigated the effects of antiplatelet and antithrombotic treatments on various (neuro)psychological, functional and psychosocial dependent variables respectively. Both trials covered multiple treatment centers from multiple countries and included a variety of patient characteristics that could potentially predict treatment effects. We discuss the IST-3 in more depth as it serves as a running example throughout the paper. The IST-3 investigated the effects of an intravenous thrombolysis treatment on shortterm (e.g., recurrent stroke, functional deficits) and long-term (e.g., dependency, depression, pain) indicators of health status among patients who suffered from an acute ischaemic stroke. The IST-3 data revealed considerable variation in characteristics of patients and disease - such as subtype or severity of stroke, blood pressure, and age - that can be predictive of treatment effects and call for exploration of treatment heterogeneity to gain insight into subpopulation-specific effects [8].

All of the abovementioned trials made treatment comparisons in the context of Randomized Controlled Trials (RCTs): Randomized experiments in which an experimental or a control treatment is randomly assigned and administered to a random sample of patients. RCTs often aim to evaluate whether the experimental treatment is superior or (non-)inferior to the control condition and ultimately guide clinicians in evidence-based assignment of treatments and interventions [9]. Whereas RCTs are considered a golden standard for treatment comparison, their implementation is challenged by a growing demand for personalized treatment [10,11,12,13]. That is, clinical practice relies more and more on the idea that different patients react differently to treatments. Treatment prescription is increasingly guided by a trade-off between patient-specific risks and benefits, making the research context for these decisions multivariate and heterogeneous [14]. While demanding more complex methodology, personalization of treatments can impede the collection of sufficient data for rigorous treatment evaluation. Development of more targeted treatments limits eligibility for participation in trials, thereby making the recruitment of subjects more difficult. As a solution, trials more often span multiple treatment centers or countries. This adds another layer of complexity to the research context: clustered data that require multilevel analysis. To meet the methodological demands of these increasingly complex research problems, RCTs ideally provide a) a broad understanding of the treatment’s effects on multiple dependent variables; and b) insights potential dependencies of treatment effects on characteristics of patients; and c) an accurate handling of clustered data structures. In practice, such comprehensive methods are less common, and often researchers resort to either ignoring the multilevel and/or heterogeneous structure, analyzing only a single dependent variable, or a combination of these. Below, we discuss how the abovementioned three aspects can be implemented in Randomized Controlled Trial methodology to support research in personalized treatment.

First, many RCTs evaluate more than one dependent variable, which are analysed separately in multiple univariate analyses [15]. As an example, the investigators of the IST-3 were primarily interested in living independently six months after stroke and secondarily in several other dependent variables, such as recurrent events, adverse reactions to the treatment, and mental health indicators. Analyzing dependent variables independently provides useful insights in treatment effects on each of these dependent variables individually, but discards available information about the relation between them. When the effects on individual dependent variables are complemented with information about their co-occurrences via multivariate analysis, a more detailed picture of treatment effects emerges. Multivariate analysis models relationships between dependent variables and can a) be helpful to detect outcome patterns that would be ignored when dependent variables are considered in isolation; and b) improve the accuracy of sample size computations and error rates in statistical decision-making [15,16,17,18].

Second, incorporating patient and/or disease characteristics in treatment comparison can result in a considerable improvement of the practical value of RCTs. The IST-3 used a sample of diverse patients with different personal and disease characteristics. This variation contains valuable information regarding differences in treatment effects. For example, knowing whether patients with different weights or blood pressures have different chances of a recurrent stroke or independent living has the potential to inform treatment recommendations. When treatments have distinct effects on patients with different characteristics, treatment effects are considered heterogeneous among (sub)populations of patients. In this case, average treatment effects (ATEs) give a global idea of treatment results among the trial population, but have limited value in targeting treatments to specific patients with their individual (disease) characteristics [19,20,21]. Conditional average treatment effects (CATEs) among specific patient groups provide insight in the variation of treatment effects among the population and help to distinguish patients who ultimately benefit from the treatment from those who do not or may even experience adverse treatment effects. Unfortunately, subgroup-specific treatment comparisons are insufficiently implemented as part of standard trial methodology yet [22]. If subgroups are targeted at all, their effects are often analyzed independently via stratified (or subgroup) analysis. Such a subgroup analysis disregards information from related subgroups and suffers from suboptimal power due to subsetting. Modelling heterogeneity is a more powerful alternative that directly uses the relation between subgroups and allows subgroups to borrow strength from each other [23,24,28,29,30]. Clustered data require specific analysis methods that are flexible enough to treat observations from different clusters as more similar to each other than to observations from other clusters. If observations within clusters are indeed more similar, the clustered structure is reflected in variance partitioning, where the within-cluster and the between-cluster variances are modelled separately. This induces a dependence between the observations within clusters when marginalizing over the cluster-specific effects. When clustered observations are treated as independent observations on the other hand, variance originating from differences between clusters is then erroneously attributed to differences between a manifold of observational units and the unique amount of information is overestimated. As a result, standard errors are overestimated, Type I error rates are inflated, and validity of statistical inference is compromised. The larger the variance between clusters relative to the variance between observational units within clusters, the larger the effect on standard errors. Properly modelling the multilevel structure of clustered data and allowing the parameters to vary over clusters is therefore crucial for accurate statistical decision-making [28, 29].

The current paper presents a Bayesian multilevel multivariate logistic regression (BMMLR) framework to capture the three abovementioned methodological aspects in a comprehensive analysis and decision procedure for treatment comparison. We build upon an existing Bayesian multivariate logistic regression (BMLR) framework for single-level data to analyze multivariate binary data in the presence of treatment heterogeneity and present a multilevel extension to deal with multilevel data. The multilevel aspect adds another layer of complexity, making the analysis a non-trivial endeavour. We discuss the existing BMLR framework first. This framework consists of three coherent elements [25]:

  1. 1

    a multivariate modelling procedure to find unknown regression parameters;

  2. 2

    a transformation procedure to convert regression parameters to the probability scale to make analysis results more interpretable;

  3. 3

    a compatible decision procedure to draw conclusions regarding treatment superiority or inferiority with targeted Type I error rates.

The first element, the modelling procedure, assumes multivariate Bernoulli distributed dependent variables and assigns them a multinomial parametrization. A multinomial parametrization is helpful for two reasons, since it a) allows statisticians to draw and build upon existing, established multinomial techniques with tractable (conditional) posterior distributions; and b) has the flexibility to model correlations between dependent variables on the subpopulation level, which contributes to the accuracy of inference under treatment heterogeneity [18, 25, 31]. Several other multivariate modelling procedures, such as the multivariate probit model [32] or multivariate logistic regression models [33, 34], have a more restrictive correlation structure and are therefore theoretically less suitable to detect treatment heterogeneity with adequate error control. Moreover, the multivariate logistic regression model by Malik and Abraham [33] does not provide insight in the treatment effects on individual dependent variables. Copula structures have been proposed as promising multivariate alternatives as well, but these models can be difficult to apply to binary dependent variables [35,36,37]. The second element, the transformation procedure, builds upon the close relation between the multinomial and multivariate parametrizations to express results on the scale of (multivariate) success probabilities and differences between them, as a more intuitive alternative to multinomial (log-)odds. The transformed parameters provide understandable insights in the treatment’s performance on the trial population (i.e., ATEs) as well as subpopulations of interest (i.e., CATEs). The third element, the decision procedure, conveniently uses the Bayesian nature of the modelling procedure, allowing for inference on the posterior samples of transformed parameters. Decisions can be made in several ways to flexibly combine and weigh multiple dependent variables into a single decision for a population of interest, while taking correlations between dependent variables into account.

The main contribution of the current paper is the extension of the single-level BMLR framework to the multilevel context. The novel Bayesian multilevel multivariate logistic regression (BMMLR) framework provides BMLR with a multilevel model component and adjusts the transformation and decision procedure accordingly, to make the framework suitable for the multilevel context, resulting in accurate type I errors. The remainder of the paper is structured as follows. Section “BMMLR: Bayesian multilevel multivariate logistic regression” introduces the multilevel multivariate logistic regression model to obtain a sample from the posterior distribution of regression coefficients. Section “Transformation of posterior regression coefficients to the probability scale” outlines how to transform the obtained regression coefficients to more interpretable treatment effect parameters. Section “Decision-making based on multivariate treatment effects” discusses the decision procedure to use the treatment effect parameters for treatment comparison. Section “Numerical evaluation” demonstrates the performance of the model numerically via simulation and in Section “Illustration with IST-3 data” the methodology is illustrated with data from the IST-3. The paper concludes with a discussion in Section “Discussion”.

BMMLR: Bayesian multilevel multivariate logistic regression

Consider the general case with \(K \in \{1,\dots ,K\}\) binary dependent variables \(y^{k}_{ji}\) for subject \(i \in \{1,\dots ,n_{j}\}\) in cluster \(j \in \{1,\dots ,J\}\). Outcome \(y^{k}_{ji}\) is Bernoulli distributed with success probability \(\theta ^{k}_{ji}\). Multivariate vector of K dependent variables, \(\varvec{y}_{ji} = \left(y^{1}_{ji}, \dots , y^{K}_{ji}\right)\) is multivariate Bernoulli distributed [31]. The multivariate Bernoulli distribution relies on a hybrid parameterization where a K-variate success probability in \(\varvec{\theta }_{ji} = \left(\theta ^{1}_{ji}, \dots , \theta ^{K}_{ji}\right)\) is expressed in terms of \(Q = 2^{K}\) multinomial joint response probabilities in \(\varvec{\phi }_{ji} = \left(\phi ^{1}_{ji}, \dots , \phi ^{Q}_{ji}\right)\) [31]. The \(q^{\text {th}}\) joint response probability in \(\varvec{\phi }_{ji}\) corresponds to multinomial response combination \(\varvec{h}^{q}\), which has length K and is given in the \(q^{th}\) row of the matrix of joint response combinations denoted by \(\varvec{H}\):

$$\begin{aligned} \varvec{H} = \left[ \begin{array}{ccccc} 1 &{} 1 &{} \dots &{} 1 &{} 1 \\ 1 &{}1 &{} \dots &{} 1 &{} 0 \\ &{} &{} \dots &{} &{}\\ 0 &{} 0 &{} \dots &{} 0 &{} 1 \\ 0 &{} 0 &{} \dots &{} 0 &{} 0\\ \end{array}\right] \end{aligned}$$
(1)

Hence, joint response probability \(\phi ^{q}_{ji} = p\left(\varvec{y}_{ji} = \varvec{h}^{q}\right)\). Note that the joint response probability \(\varvec{\phi }_{j}\) and the success probability \(\varvec{\theta }_{j}\) are identical in the univariate situation (i.e., \(K=1\)).

Likelihood of the data

The multinomial parametrization of multivariately Bernoulli distributed data allows to model the relation between dependent variables \(\varvec{y}_{ji}\) and one or multiple predictor variables via multinomial logistic regression. Joint response probability \(\phi ^{q}_{ji}\) is then regressed on a vector of P covariates, \(\varvec{x}_{ji} = \left(x_{ji0}, \dots , x_{ji(P-1)}\right)\). Covariate \(x_{ji0} = 1\) is a constant to estimate the intercept and covariate \(x_{jip}\) for \(p \in \{1,\dots ,P-1\}\) can, for example, be a treatment indicator, a patient characteristic, or an interaction between these.

The relation between outcome vector \(\varvec{y}_{ji}\) and covariate vector \(\varvec{x}_{ji}\) is mapped with a multinomial logistic function that expresses the probability of \(\varvec{y}_{ji}\) being in response category q, conditional on \(\varvec{x}_{ji}\):

$$\begin{aligned} \phi ^{q}_{ji}{} & {} = p\left(\varvec{y}_{ji} = \varvec{h}^{q} | \varvec{x}_{ji}\right) \\ \nonumber{} & {} = \frac{\textrm{exp}\left(\psi ^{q}_{ji}\right)}{\sum \limits _{r=1}^{Q-1} \textrm{exp}\left(\psi ^{r}_{ji}\right) + 1}, \nonumber \end{aligned}$$
(2)

Here, \(\psi ^{q}_{ji}\) is a linear predictor:

$$\begin{aligned} \psi ^{q}_{ji} = \varvec{x}^{'}_{ji} \varvec{\gamma }^{q}_{j} \end{aligned}$$
(3)

In Eq. 3, regression coefficients for response category q, \(\varvec{\gamma }^{q}_{j} = \left(\gamma ^{q}_{0j},\dots ,\gamma ^{q}_{(P-1)j}\right)\) are unknown parameters of interest. Regression coefficients of response categories \(1,\dots ,Q-1\) are estimated, while regression coefficients of response category Q are fixed at zero (i.e., \(\varvec{\gamma }^{Q}_{j} = \varvec{0}\)) to ensure identifiability of the model. The entire set of regression coefficients in cluster j is denoted with \(\varvec{\gamma }_{j}\).

A key aspect of multilevel models is that the regression coefficients \(\varvec{\gamma }_{j}^{q}\) are allowed to vary over clusters according to a common normal distribution on the second level. The common distribution for the random effects on the second level induces a dependency structure of the observations within clusters. The observations of diffferent individuals in the same clusters are assumed to be conditionally independent conditional on the cluster-specific random effects. The random effects distribution on the second level can be written as:

$$\begin{aligned} \gamma ^{q}_{pj}{} & {} = \gamma ^{q}_{p0} + u^{q}_{pj}\\\nonumber \varvec{u}^{q}_{j}{} & {} = \left(u^{q}_{0j},\dots ,u^{q}_{(P-1)j}\right) \sim N\left(\varvec{0},\varvec{\Sigma }^{q}\right) \nonumber \end{aligned}$$
(4)

Equation 4 consists of two elements that reflect the distributional parameters:

  1. 1

    The parameter \(\gamma ^{q}_{p0}\) is the common effect in the population and does not vary over clusters.

  2. 2

    The random effect \(u^{q}_{pj}\) quantifies the cluster specific deviation from the common effect \(\gamma ^{q}_{p0}\).

Equation 4 can be adjusted to model cluster-specific predictors or cross-level interactions between cluster-level predictors and individual level-predictors. Further, Eq. 4 can be extended to model mixed effects, which combine regression coefficients that vary over clusters, which are called random effects, and regression coefficients that are identical for all clusters, which are called fixed effects. More information on the specification of more complex linear predictors can be found in general resources on multilevel models, such as Hox et al. [28] or Gelman and Hill [27]. In general, it should be noted that each additional random effect increases the number of parameters, affecting computational burden and estimation precision.

Posterior distribution of regression coefficients

The primary goal of BMMLR is estimating the joint posterior distribution of unknown regression coefficients \(\varvec{\gamma }^{q}_{j}\), their means \(\varvec{\gamma }^{q}\), and their covariance matrices \(\varvec{\Sigma }^{q}\) for category \(q \in 1,\dots ,(Q-1)\). The posterior probability distribution of these parameters for category q is given by:

$$\begin{aligned} p\left(\varvec{\gamma }^{q}_{j}, \varvec{\gamma }^{q}, \varvec{\Sigma }^{q} | \varvec{y}\right) \propto p\left(\varvec{y}_{j}|\varvec{\gamma }^{q}_{j}\right) p\left(\varvec{\gamma }^{q}_{j} | \varvec{\gamma }^{q}, \varvec{\Sigma }^{q}\right) p\left(\varvec{\gamma }^{q}\right) p\left(\varvec{\Sigma }^{q}\right), \end{aligned}$$
(5)

where \(\varvec{\gamma }^{q}\) reflects the vector of average effects for category q, \(\varvec{\Sigma }^{q}\) is the covariance matrix of the effects across clusters for category q, and \(\varvec{\gamma }_{j}^{q}\) reflects the vector of cluster specific effects of cluster j for category q. The posterior probability distribution in Eq. 5 is proportional to the product of three types of probability distributions:

  1. 1

    The likelihood of the data quantifies the probability of the dependent variables conditional on cluster-specific regression coefficients, \(p(\varvec{y}_{j}|\varvec{\gamma }^{q}_{j})\), which is the multinomial logistic function given by Eq. 2;

  2. 2

    The probability distribution of the cluster-specific regression coefficients \(\varvec{\gamma }^{q}_{j}\) conditional on their means \(\varvec{\gamma }^{q}\) and covariance matrix \(\varvec{\Sigma }^{q}\) for category q, \(p(\varvec{\gamma }^{q}_{j} | \varvec{\gamma }^{q}, \varvec{\Sigma }^{q})\);

  3. 3

    The prior probability distributions of regression coefficient’s means \(\varvec{\gamma }^{q}\), \(p(\varvec{\gamma }^{q})\), and covariance matrix \(\varvec{\Sigma }^{q}\), \(p(\varvec{\Sigma }^{q})\) for category q, before observing the data.

As the multinomial logistic function (Eq. 2) does not have a (conditionally) conjugate prior distribution, the functional form of the posterior distribution is unknown and the regression coefficients cannot be sampled directly from the posterior distribution. In the Supplemental material, we present a Gibbs sampling algorithm based on a Pólya-Gamma auxiliary variable expansion of the likelihood proposed by Polson et al. [38]. The expanded likelihood has a Gaussian form and can be combined with normal prior distributions on regression coefficients \(\varvec{\gamma }^{q}\) and an inverse-Wishart distribution on covariance matrix \(\varvec{\Sigma }^{q}\). The parameters are known to have conditionally conjugate posterior distributions and allow for direct sampling from their multivariate normal and inverse-Wishart distributions respectively, resulting in MCMC chains of the joint posterior distribution in Eq. 5. We also include a few comments on prior specification for the proposed Gibbs sampling procedure in the Supplemental material.

As an alternative to the proposed Gibbs sampling procedure, sampling from the posterior distribution(s) of multinomial logistic regression coefficients can theoretically be done with other standard MCMC-methods for non-conjugate prior-likelihood combinations, such as Metropolis-Hastings (e.g., [39], Ch.3 and 5; [40]; [41]) or Hamiltonian Monte Carlo (e.g., [

$$\begin{aligned} \psi ^{q}_{ji}= & {} \gamma ^{q}_{0j} + \gamma ^{q}_{1j} T_{ji} + \beta ^{q}_{2} NIHSS_{ji} + \beta ^{q}_{3} NIHSS_{ji} T_{ji}\\ \nonumber \gamma ^{q}_{0j}= & {} \gamma ^{q}_{00} + u_{0j}\\ \nonumber \gamma ^{q}_{1j}= & {} \gamma ^{q}_{10} + u_{1j}. \nonumber \end{aligned}$$
(6)

In Eq. 6, \(\varvec{x}_{ji} = (1,T_{ji}, NIHSS_{ji}, NIHSS_{ji}T_{ji})\) with treatment indicator \(T_{ji}\) and \(NIHSS_{ji}\) being the stroke severity score of subject i in hospital j. The \(Q=4\) resulting joint response categories are \((\{Strk7 = 1, Indep6 = 1\}, \{Strk7 = 1, Indep6 = 0\}, \{Strk7 = 0, Indep6 = 1\}, \left\{Strk7 = 0, Indep6 = 0\right\})\), which we refer to as \((\{11\}, \{10\}, \{01\}, \{00\})\).

Transformation to cluster-specific (differences between) probabilities

The main quantity of interest, the (cluster-specific) marginal multivariate treatment difference, is defined as the difference between cluster-specific multivariate success probabilities of the two treatments:

$$\begin{aligned} \delta ^{Strk7}_{j}= & {} \theta _{Aj}^{Strk7} - \theta _{Cj}^{Strk7}\\ \nonumber \delta ^{Indep6}_{j}= & {} \theta _{Aj}^{Indep6} - \theta _{Cj}^{Indep6} \nonumber \end{aligned}$$
(7)

where subscripts Aj and Cj indicate cluster-specific parameters of the (experimental) Alteplase and control treatments respectively. The elements on the right-hand sides of Eq. 7, success probabilities \(\theta _{Tj}^{k}\), are sums of the multinomial joint response probabilities of all response categories with a success on outcome k:

$$\begin{aligned} \theta _{Tj}^{Strk7}= & {} p(\varvec{y}_{j} = \{11\}|T) + p(\varvec{y}_{j} = \{10\}|T) = \phi _{Tj}^{1} + \phi _{Tj}^{2}\\ \theta _{Tj}^{Indep6}= & {} p(\varvec{y}_{j} = \{11\}|T) + p(\varvec{y}_{j} = \{01\}|T) = \phi _{Tj}^{1} + \phi _{Tj}^{3}\nonumber \end{aligned}$$
(8)

The multinomial joint response probabilities \(\varvec{\phi }_{Tj}\) that form the elements of success probabilities \(\varvec{\theta }_{Tj}\) follow from plugging in posterior regression coefficients \(\varvec{\gamma }^q_{j}\) in the linear predictor (Eq. 6) and the multinomial logistic link function (Eq. 2) for prespecified covariates \(\varvec{x}_{j}\) and for the relevant response category q.

$$\begin{aligned} \phi ^{q}_{Tj} = = \frac{\textrm{exp}{\left(\psi ^{q}_{Tj}\right)}}{\sum \limits _{r=1}^{Q-1} \textrm{exp}{\left(\psi ^{r}_{Tj}\right)} + 1}. \end{aligned}$$
(9)

The information in covariate vector \(\varvec{x}_{j}\), which directly affects \(\psi _{Tj}^q\), determines the treatment as well as the subpopulation of interest. Subpopulations can be defined as a value, such as a stroke severity score of one standard deviation below or above the mean, that can be plugged in directly into Eqs. 2 and 6. When interested in a subpopulation that is defined by an interval, such as the groups of stroke severity in the IST-3, the joint response probability is marginalized over the specified interval or averaged over a sample of observations in this interval. In the latter case, joint response probability \(\phi ^{q}_{Tj}\) is computed for each observed subject \(i \in 1,\dots ,n_{j}\) via Eq. 2. The joint response probability for each treatment T is then computed by averaging over all subjects i in treatment T and cluster j.

Since the model in Section “BMMLR: Bayesian multilevel multivariate logistic regression” resulted in a sample of L posterior draws of each regression coefficient, multivariate treatment differences are computed for each draw (l) separately. The resulting posterior samples can be summarized with standard descriptive methods.

Pooling treatment effects over clusters

As a last step, cluster-specific estimates are pooled into estimates of average or conditional treatment effects among (sub)populations of interest via the following procedure:

$$\begin{aligned} \varvec{\delta } = \frac{\sum \limits _{j=1}^{J} n_{j} \varvec{\delta }_{j}}{\sum \limits _{j=1}^{J} n_{j}} \end{aligned}$$
(10)

This pooling strategy weighs cluster-specific estimates by cluster size, thereby balancing data with unequal cluster sizes.