1 Introduction

Official statistics contain estimates of socioeconomic indicators at different levels of aggregation. In many sampling designs, small sample sizes do not allow accurate direct estimators to be calculated at low levels of aggregation. These territories or population groups are called small areas. Small Area Estimation (SAE) gives a solution to this problem by incorporating auxiliary information to the data analysis and by introducing model-based predictors. The books of Rao and Molina (2015) and Morales et al. (2021) give a general description of SAE.

The Spanish household budget survey (SHBS) provides information about the nature and destination of the consumption household expenses, as well as on various characteristics related to the conditions of household life. Spain is hierarchically partitioned in 17 autonomous communities and 50 provinces, plus 2 autonomous cities. The sampling design and the sample sizes of the SHBS are developed to provide estimates for the 17 autonomous communities level, but not for the provinces. The direct estimates at the province level have a low accuracy and, therefore, estimating SHBS indicators at that level is a SAE problem. This paper has two objectives. The first one is to model the unit-level proportions of annual household expenditures on food, housing and others. The second one is to estimate the average of these proportions, by Spanish provinces.

Under area-level models, we find some more proposals for estimating domain proportions and counts. For example, Esteban et al. (2012), Marhuenda et al. (2013, 2014) and Morales et al. (2015) derived predictors based on linear mixed models and (Chambers et al. 2014; Dreassi et al. 2014; Tzavidis et al. 2015) and (Boubeta et al. 2017, 2016) applied binomial, negative binomial or Poisson regression models. There are also methodologies for estimating proportions and counts in the setup of contingency tables or multinomial regression models. Without being exhaustive, we find the papers of Zhang and Chambers (2004), Berg and Fuller (2014) for contingency tables, and the papers of Ferrante and Trivisano (2010), Souza and Moura (2016), Fabrizi et al. (2016), Saei and Chambers (2003), Molina et al. (2007) and López-Vizcaíno et al. (2013, 2015) for multinomial regression models. However, in the household survey samples, some variables of interest and domain indicators are compositions. This is to say, they are positive quantities summing up to one or to a known integer number. Concerning area-level model for compositional data, Esteban et al. (2020) and Krause et al. (2022) transformed compositions into target vectors of multivariate Fay-Herriot models in order to make model-based predictions, like the ones described by González-Manteiga et al. (2008a), Benavent and Morales (2016), Benavent and Morales (2021) or Arima et al. (2017).

The statistical literature presents some contributions to small area estimation of proportions and counts under unit-level models for binary outcomes. For example, Chambers et al. (2016), Hobza and Morales (2016), Hobza et al. (2018) and Burgard et al. (2021) derived predictors under M-quantile or binomial-logit models for binary outcomes. These approaches are based on univariate models and not in models for compositional data that consider the possibility of jointly estimating the counts or proportions of all the categories of a classification variable. This issue was faced by Scealy and Welsh (2017), which introduced a directional mixed effects model for compositional data and predicted the proportions of total weekly expenditure on food and housing costs for households in a chosen set of domains. A different approach was employed by Hijazi and Jernigan (2009), Camargo et al. (2012), Tsagris and Stewart (2018), Morais et al. (2018), which modelled compositional data using Dirichlet regression models. This manuscript also deals with unit-level compositional data, but it proposes to fit multivariate linear mixed models to logratio transformations of compositions. Some references on the foundations of compositional data analysis are the books (Aitchison 1986) and (Pawlowsky-Glahn and Buccianti 2011) and the papers (Egozcue et al. 2003) and (Egozcue and Pawlowsky-Glahn 2019), where some basic transformations of compositions are studied.

This paper introduces small area predictors of averages of unit-level vectors of compositions. For this sake, the paper considers three logratio transformations of compositions into vectors of \(R^m\). They are the additive, centered and isometric logratio transformations. We propose a multivariate nested error regression (MNER) model for analyzing the transformed SHBS compositional data, where the vectors of random effects and the vector of model errors have unstructured covariance matrices with unknown components. The estimates of the MNER model parameters are obtained by using the residual maximum likelihood (REML) estimation method, as it is described in Esteban et al. (2022a). The fitted model is then used to predict averages of proportions of annual household expenditures on food, housing and others, by Spanish provinces. The empirical best and plug-in predictors of small area compositional parameters are derived similarly as in Esteban et al. (2022b).

The estimation of the mean squared error (MSE) of a model-based predictor is an important issue that has no easy solution. Under nonlinear models, the problem is even more difficult. We follow the resampling approach appearing in González-Manteiga et al. (2007, 2008b) to implement a parametric bootstrap procedure.

This paper introduces statistical methodology that is new in four main aspects: (1) the employment of three transformations of unit-level compositional survey data, (2) the use of MNER models with unstructured covariance matrix for modelling the transformed data and capturing the sample correlations, (3) the derivation of domain-level predictors of averages of compositions based on the MNER model fitted to the transformed unit-level data, and (4) the introduction of parametric bootstrap estimators of the MSEs of the new predictors.

The remainder of the paper is organized as follows: Section 2 establishes the probabilistic framework, describes the SAE problem of interest and presents the MNER model. Section 3 derives empirical best predictors (EBP) and plug-in predictors of average compositions and gives a parametric bootstrap method for estimating the MSEs of the EBPs. Section 4 presents three simulation experiments. The target of Simulation 1 is to check the behavior of the REML algorithm for fitting the MNER model. Simulation 2 investigates the performance of the EBPs and plug-in predictors, and Simulation 3 analyzes the parametric bootstrap estimator of the MSEs. Section 5 applies the proposed methodology to data from the SHBS of 2016 in Spain. Section 6 gives some conclusions. The paper contains four appendices in a supplementary material file. Appendix A describes the additive, centered and isometric logratio transformations of compositions. Appendix B gives further simulation results. Appendix C analyzes the SHBS data with different transformations. Appendix D performs the application to SHBS data without applying logratio transformations of compositions.

2 The probabilistic framework

Let U be a population of size N partitioned into D domains or areas \(U_1,\ldots ,U_D\) of sizes \(N_1,\ldots ,N_D\), respectively. Let \(N=\sum _{j=1}^DN_d\) be the global population size. Let us consider the probability vector \(a_{dj}^{+}=(a_{dj1},\ldots ,a_{djm+1})^\prime \in R^{m+1}\) representing proportions associated with the \(m+1\) categories of a classification variable that is defined on the sample unit j of domain d, \(d=1,\ldots ,D\), \(j=1,\ldots ,N_d\). For example, \(a_{dj}^{+}\) may contain the proportions of annual household expenditures in the different expense categories. The components of \(a_{dj}^{+}\) are nonnegative and fulfill the constraint \(a_1+\ldots +a_{m+1}=1\). These vectors \(a_{dj}^{+}\) are called compositions or \((m+1)\)-part compositions, and vectors \(a_{dj}=(a_{dj1},\ldots ,a_{djm})^\prime \) are called m-part compositions. Compositional data, consisting of compositions, play an important role in public statistics. Compositions take values in the simplex embedded in \(R^{m+1}\)

$$\begin{aligned} {{\mathcal {S}}}^{m}_e=\big \{(a_1,\ldots ,a_{m+1})^\prime \in R^{m+1}:\, a_1>0,\ldots ,a_{m+1}>0,\,a_1+\ldots +a_{m+1}=1\big \}, \end{aligned}$$

and m-part compositions take values in the m-dimensional simplex defined by

$$\begin{aligned} {{\mathcal {S}}}^{m}=\big \{(a_1,\ldots ,a_m)^\prime \in R^m:\, a_1>0,\ldots ,a_m>0,\,a_1+\ldots +a_m<1\big \}. \end{aligned}$$

This paper deals with the problem of predicting domain average compositions

$$\begin{aligned} A_{dk}=\frac{1}{N_d}\sum _{j=1}^{N_d}a_{djk},\quad d=1,\ldots ,D,\,\, k=1,\ldots ,m+1, \end{aligned}$$
(2.1)

under a compositional data analysis approach. This is to say, we apply a one-to-one transformation, \(h=(h_1,\ldots ,h_m)^\prime :{{\mathcal {S}}}^m\mapsto R^m\), to m-part compositions and we assume that the transformed vectors follow a multivariate regression model. Appendix A presents three widely employed transformations. They are the additive, centered and isometric logratio transformations. The components of the transformed vectors \(y_{dj}=h(a_{dj})=(y_{dj1},\ldots ,y_{djm})^{\prime }\) are continuous variables measured on the sample unit j of domain d, \(d=1,\ldots ,D\), \(j=1,\ldots ,N_d\).

For \(k=1,\ldots ,m\), let \({x}_{djk}=(x_{djk1},\ldots ,x_{djkp_k})\) be a row vector containing \(p_k\) explanatory variables and let \({X}_{dj}=\text{ diag }\left( {x}_{dj1},\ldots ,{x}_{djm}\right) _{m\times p}\) with \(p=p_1+\ldots +p_m\). Let \(\beta _{k}\) be a column vector of size \(p_k\) containing regression parameters and let \(\beta =\left( \beta _{1}^{\prime },\ldots ,\beta _{m}^{\prime }\right) ^{\prime }_{p\times 1}\). We assume that the transformed vectors \(y_{dj}\)’s follow the population MNER model

$$\begin{aligned} y_{dj}=X_{dj}\beta +u_d+e_{dj},\quad d=1,\ldots ,D,\, j=1,\ldots ,N_d, \end{aligned}$$
(2.2)

where the vectors of random effects \(u_{d}\)’s and random errors \(e_{dj}\)’s are independent with multivariate normal distributions

$$\begin{aligned} u_{d}=(u_{d1},\ldots ,u_{dm})^\prime \sim N_m(0,V_{ud}),\quad e_{dj}=(e_{dj1},\ldots ,e_{djm})^\prime \sim N_m(0,V_{edj}). \end{aligned}$$

The \(m\times m\) covariance matrices \({V}_{ud}\) depend on \(q=m(m+1)/2\) unknown parameters, denoted by

$$\begin{aligned} \theta _u=(\theta _{u1},\ldots ,\theta _{uq})^\prime = (\sigma _{u1}^2,\sigma _{u2}^2,\ldots ,\sigma _{um}^2,\rho _{u12},\rho _{u13},\ldots ,\rho _{u23},\rho _{u24},\ldots ,\rho _{um-1,m})^{\prime }. \end{aligned}$$

The matrix \(V_{ud}\) is

$$\begin{aligned} V_{ud}= \left( \begin{array}{cccc} \sigma _{u1}^2 &{} \rho _{u12}\sigma _{u1}\sigma _{u2} &{} \cdots &{} \rho _{u1m}\sigma _{u1}\sigma _{um} \\ \rho _{u12}\sigma _{u1}\sigma _{u2} &{} \sigma _{u2}^2 &{} \cdots &{} \rho _{u2m}\sigma _{u2}\sigma _{um} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ \rho _{u1m}\sigma _{u1}\sigma _{um} &{} \rho _{u2m}\sigma _{u2}\sigma _{um} &{} \ldots &{} \sigma _{um}^2 \end{array}\right) . \end{aligned}$$

The \(m\times m\) covariance matrices \({V}_{edj}\) depend on q unknown parameters, i.e.

$$\begin{aligned} \theta _e=(\theta _{e1},\ldots ,\theta _{eq})^\prime = (\sigma _{e1}^2,\sigma _{e2}^2,\ldots ,\sigma _{em}^2,\rho _{e12},\rho _{e13},\ldots ,\rho _{e23},\rho _{e24},\ldots ,\rho _{em-1,m})^{\prime }. \end{aligned}$$

The matrix \(V_{edj}\) is

$$\begin{aligned} { V_{edj}}= \left( \begin{array}{cccc} \sigma _{e1}^2 &{} \rho _{e12}\sigma _{e1}\sigma _{e2} &{} \cdots &{} \rho _{e1m}\sigma _{e1}\sigma _{em} \\ \rho _{e12}\sigma _{e1}\sigma _{e2} &{} \sigma _{e2}^2 &{} \cdots &{} \rho _{e2m}\sigma _{e2}\sigma _{em} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ \rho _{e1m}\sigma _{e1}\sigma _{em} &{} \rho _{e2m}\sigma _{e2}\sigma _{em} &{} \ldots &{} \sigma _{em}^2 \end{array}\right) . \end{aligned}$$

The \(2q\times 1\) vector of variance component parameters is \(\theta =(\theta _u^\prime ,\theta _e^\prime )^\prime \). The \((p+2q)\times 1\) vector of model parameters is \(\psi =(\beta ^{\prime },\theta ^\prime )^{\prime }\). Let \({I}_a\) be the \(a\times a\) identity matrix. We define the \(mN_d\times 1\) vectors \(y_d\) and \(e_d\), the \(mN_d\times p\) matrix \(X_d\) and the \(mN_d\times m\) matrix \(Z_d\) as follows:

$$\begin{aligned} y_d=\underset{1\le j \le N_d}{\hbox {col}}(y_{dj}),\,\,\, e_d=\underset{1\le j \le N_d}{\hbox {col}}(e_{dj}),\,\,\, X_d=\underset{1\le j \le N_d}{\hbox {col}}(X_{dj}),\,\,\, Z_d=\underset{1\le j \le N_d}{\hbox {col}}(I_m). \end{aligned}$$

Model (2.2) can be written in the domain-level form

$$\begin{aligned} y_{d}=X_{d}\beta +Z_du_d+e_{d},\quad d=1,\ldots ,D, \end{aligned}$$
(2.3)

where the vectors \(u_d\) and \(e_d\sim N_{mN_d}(0,V_{ed})\) are independent and \(V_{ed}=\underset{1\le j \le N_d}{\hbox {diag}}(V_{edj})\). We define the \(mN\times 1\) vectors y and e, the \(mD\times 1\) vector u, the \(mN\times p\) matrix X and \(mN\times mD\) matrix Z as follows:

$$\begin{aligned} y= & {} \underset{1\le d \le D}{\hbox {col}}(y_{d}),\,\,\, e=\underset{1\le d \le D}{\hbox {col}}(e_{d}),\,\,\, u=\underset{1\le d \le D}{\hbox {col}}(u_{d}),\,\,\, X=\underset{1\le d \le D}{\hbox {col}}(X_{d}),\,\,\,\\ Z= & {} \underset{1\le d \le D}{\hbox {diag}}(Z_d). \end{aligned}$$

Model (2.2) can be written in the linear mixed model form

$$\begin{aligned} y=X\beta +Zu+e. \end{aligned}$$
(2.4)

where \(u\sim N_{mD}(0,{V}_{u})\), \(e\sim N_{mN}(0,V_{ed})\) are independent, \(V_{u}=\underset{1\le d \le D}{\hbox {diag}}(V_{ud})\) and \(V_{e}=\underset{1\le d \le D}{\hbox {diag}}(V_{ed})\).

Under the predictive approach to inference in finite populations, statistical procedures are based on a fixed subset (called sample), \(s=\cup _{d=1}^Ds_d\), of the finite population U. Let \(n_d\) be the size of the domain subset \(s_d\subset U_d\), \(d=1,\ldots ,D\), and let \(n=n_1+\ldots +n_D\) be the total sample size. The complementary domain subsets are \(r_d=U_d-s_d\), \(d=1,\ldots ,D\). Let \(y_s\) and \(y_{ds}\) be the sub-vectors of y and \(y_d\) corresponding to sample elements and \(y_r\) \(y_{dr}\) the sub-vectors of y and \(y_d\) corresponding to the out-of-sample elements. Without lack of generality, we can write \(y_d=(y_{ds}^\prime ,y_{dr}^\prime )^\prime \). Define also the corresponding decompositions of \(X_d\), \(Z_d\) and \(V_d\). As we assume that sample indexes are fixed, then the sample sub-vectors \(y_{ds}\) follow the marginal models derived from the population model (2.3), i.e.

$$\begin{aligned} y_{ds}=X_{ds}\beta +Z_{ds}u_d+e_{ds},\quad d=1,\ldots ,D, \end{aligned}$$
(2.5)

where \(u_d\sim N_m(0,{V}_{ud})\), \(e_{ds}\sim N_{mn_d}(0,V_{eds})\) are independent and \(V_{eds}=\underset{1\le j \le n_d}{\hbox {diag}}(V_{edj})\). The vectors \(y_{ds}\) are independent with \(y_{ds}\sim N_{n_d}(\mu _{ds},V_{ds})\), \(\mu _{ds}=X_{ds}\beta \), \(V_{ds}=Z_{ds}V_{ud}Z_{ds}^\prime +V_{eds}\).

When the variance component parameters are known, the best linear unbiased estimator (BLUE) of \(\beta \) and the best linear unbiased predictor (BLUP) of \(u_d\), \(d=1,\ldots ,D\), are

$$\begin{aligned} \hat{\beta }_B=\bigg (\sum _{d=1}^DX_{ds}^\prime {V}_{ds}^{-1}X_{ds}\bigg )^{-1}\sum _{d=1}^DX_{ds}^\prime {V}_{ds}^{-1}y_{ds},\,\, \hat{u}_{Bd}=V_{ud}Z_{ds}^\prime V_{ds}^{-1}\big (y_{ds}-X_{ds}\hat{\beta }_B\big ). \end{aligned}$$

Let \({\hat{\theta }}\) be the REML estimator of \(\theta \), then the empirical BLUE (BLUE) of \(\beta \) and the empirical BLUP (EBLUP) of \(u_d\), \(d=1,\ldots ,D\), are

$$\begin{aligned} \hat{\beta }=\bigg (\sum _{d=1}^DX_{ds}^\prime \hat{V}_{ds}^{-1}X_{ds}\bigg )^{-1}\sum _{d=1}^DX_{ds}^\prime \hat{V}_{ds}^{-1}y_{ds},\,\, \hat{u}_{d}=\hat{V}_{ud}Z_{ds}^\prime \hat{V}_{ds}^{-1}\big (y_{ds}-X_{ds}\hat{\beta }\big ), \end{aligned}$$

where \(\hat{V}_{ds}\) and \(\hat{V}_{ud}\) are obtained by substituting \(\theta \) by \({\hat{\theta }}\) in \({V}_{ds}\) and \({V}_{ud}\), respectively. We calculate the inverse of \(V_{ds}=V_{eds}+Z_{ds}V_{ud}Z_{ds}^\prime =A+BCD\) by applying the formula

$$\begin{aligned} (A+BCD)^{-1}=A^{-1}-A^{-1}B(C^{-1}+DA^{-1}B)^{-1}DA^{-1}. \end{aligned}$$

As \(Z_{ds}^\prime V_{eds}^{-1}Z_{ds}=\sum _{j=1}^{n_d}V_{edj}^{-1}=n_dV_{edj}^{-1}\), we obtain

$$\begin{aligned} V_{ds}^{-1}= & {} V_{eds}^{-1}-V_{eds}^{-1}Z_{ds}\big (V_{ud}^{-1}+Z_{ds}^\prime V_{eds}^{-1}Z_{ds}\big )^{-1}Z_{ds}^\prime V_{eds}^{-1}\\= & {} V_{eds}^{-1}-V_{eds}^{-1}Z_{ds}\Big (V_{ud}^{-1}+{n_d}V_{edj}^{-1}\Big )^{-1}Z_{ds}^\prime V_{eds}^{-1}. \end{aligned}$$

As the sample indexes are fixed, the out-of-sample sub-vectors \(y_{dr}\) follow the marginal models derived from the population model (2.3), i.e.

$$\begin{aligned} y_{dr}=X_{dr}\beta +Z_{dr}u_d+e_{dr},\quad d=1,\ldots ,D, \end{aligned}$$

where \(u_d\sim N_m(0,{V}_{ud})\), \(e_{dr}\sim N_{m(N_d-n_d)}(0,V_{eds})\) are independent and \(V_{edr}=\underset{n_d+1\le j \le N_d}{\hbox {diag}}(V_{edj})\). The vectors \(y_{dr}\) are independent with \(y_{dr}\sim N_{N_d-n_d}(\mu _{dr},V_{dr})\), \(\mu _{dr}=X_{dr}\beta \), \(V_{dr}=Z_{dr}V_{ud}Z_{dr}^\prime +V_{edr}\). The covariance matrix between \(y_{dr}\) and \(y_{ds}\) is

$$\begin{aligned} V_{drs}=\text{ cov }(y_{dr},y_{ds})= & {} \text{ cov }(X_{dr}\beta +Z_{dr}u_d+e_{dr}, X_{ds}\beta +Z_{ds}u_d+e_{ds})\\= & {} Z_{dr}\text{ var }(u_d)Z_{ds}^\prime = Z_{dr}V_{ud}Z_{ds}^\prime . \end{aligned}$$

The distribution of \(y_{dr}\), given the sample data \(y_{s}\), is

$$\begin{aligned} y_{dr}|y_{s}\sim y_{dr}|y_{ds}\sim N(\mu _{dr|s},V_{dr|s}). \end{aligned}$$

The conditional \((N_d-n_d)\times 1\) mean vector is

$$\begin{aligned} \mu _{dr|s}= & {} \mu _{dr}+V_{drs}V_{ds}^{-1}(y_{ds}-\mu _{ds})= X_{dr}\beta +Z_{dr}V_{ud}Z_{ds}^\prime V_{ds}^{-1}(y_{ds}-X_{ds}\beta ) \\= & {} X_{dr}\beta +Z_{dr}V_{ud}Z_{ds}^\prime \Big \{V_{eds}^{-1}-V_{eds}^{-1}Z_{ds}\Big (V_{ud}^{-1}+{n_d}V_{edj}^{-1}\Big )^{-1}Z_{ds}^\prime V_{eds}^{-1}\Big \}(y_{ds}-X_{ds}\beta ). \end{aligned}$$

The conditional covariance matrix is

$$\begin{aligned} V_{dr|s}= & {} V_{dr}-V_{drs}V_{ds}^{-1}V_{dsr} = Z_{dr}V_{ud}Z_{dr}^\prime +V_{edr} -Z_{dr}V_{ud}Z_{ds}^\prime V_{ds}^{-1}Z_{ds}V_{ud}Z_{dr}^\prime \\= & {} Z_{dr}V_{ud}Z_{dr}^\prime +V_{edr} -Z_{dr}V_{ud}Z_{ds}^\prime \Big \{V_{eds}^{-1}-V_{eds}^{-1}Z_{ds}\Big (V_{ud}^{-1}+{n_d}V_{edj}^{-1}\Big )^{-1}Z_{ds}^\prime V_{eds}^{-1}\Big \} Z_{ds}V_{ud}Z_{dr}^\prime \\= & {} Z_{dr}V_{ud}Z_{dr}^\prime +V_{edr}-{n_d}Z_{dr}V_{ud}V_{edj}^{-1}V_{ud}Z_{dr}^\prime +n_d^2Z_{dr}V_{ud}V_{edj}^{-1}\Big (V_{ud}^{-1}+{n_d}V_{edj}^{-1}\Big )^{-1}V_{edj}^{-1}V_{ud}Z_{dr}^\prime . \end{aligned}$$

Note that

$$\begin{aligned} Z_{ds}^\prime V_{eds}^{-1}(y_{ds}-X_{ds}\beta )=\sum _{j=1}^{n_d}V_{edj}^{-1}(y_{dj}-X_{dj}\beta ),\quad \end{aligned}$$

If \(n_d\ne 0\) and \(j\in r_d\), \(j>n_d\), the conditional \(m\times 1\) mean vector is

$$\begin{aligned} \mu _{dj|s}= & {} X_{dj}\beta +V_{ud}Z_{ds}^\prime \Big \{V_{eds}^{-1}-V_{eds}^{-1}Z_{ds}\big (V_{ud}^{-1}+n_dV_{edj}^{-1}\big )^{-1}Z_{ds}^\prime V_{eds}^{-1}\Big \}(y_{ds}-X_{ds}\beta ) \\= & {} X_{dj}\beta +V_{ud}\Big \{I_m-n_dV_{edj}^{-1}\big (V_{ud}^{-1}+n_dV_{edj}^{-1}\big )^{-1}\Big \}\sum _{j=1}^{n_d}V_{edj}^{-1}(y_{dj}-X_{dj}\beta ). \end{aligned}$$

If \(n_d=0\) and \(j\in r_d\), the conditional \(m\times 1\) mean vector is

$$\begin{aligned} \mu _{dj|s}=X_{dj}\beta . \end{aligned}$$

If \(n_d\ne 0\) and \(j\in r_d\), \(j>n_d\), the conditional \(m\times m\) covariance matrix is

$$\begin{aligned} V_{dj|s}=V_{d|s}=V_{ud}+V_{edj}-n_dV_{ud}V_{edj}^{-1}V_{ud} +n_d^2V_{ud}V_{edj}^{-1}\Big (V_{ud}^{-1}+n_dV_{edj}^{-1}\Big )^{-1}V_{edj}^{-1}V_{ud}. \end{aligned}$$

If \(n_d=0\) and \(j\in r_d\), the conditional \(m\times m\) covariance matrix is

$$\begin{aligned} V_{dj|s}=V_{d|s}=V_{ud}+V_{edj}. \end{aligned}$$

3 Predictors of average compositions

This section deals with the problem of predicting the domain average compositions \(A_{dk}\), \(d=1,\ldots ,D\), \(k=1,\ldots ,m+1\), defined in (2.1). As explained in Sect. 2 and Appendix A, we first transform the m-part compositions \({a}_{dj}=({a}_{dj1},\ldots ,{a}_{djm})^\prime \) into vectors of \(R^m\). This is done by applying a one-to-one function \(h=(h_1,\ldots ,h_m)^\prime :{{\mathcal {S}}}^m\mapsto R^m\). The transformed vectors \(y_{dj}=h(a_{dj})\) have components \(y_{dj1}=h_1(a_{dj}),\ldots ,y_{djm}=h_m(a_{dj})\). Let \(h^{-1}=(h_1^{-1},\ldots ,h_m^{-1})^\prime :R^m\mapsto {{\mathcal {S}}}^m\) be the inverse function of h, so that \(a_{dj1}=h_1^{-1}(y_{dj}),\ldots ,a_{djm}=h_m^{-1}(y_{dj})\).

For estimating \(A_{dk}\), \(k=1,\ldots ,m+1\), we assume that \(y_{dj}=(y_{dj1},\ldots ,y_{djm})^\prime \) follows a multivariate nested error regression (MNER) model. For \(d=1,\ldots ,D\), the target parameters are additive, i.e

$$\begin{aligned} A_{dk}=\frac{1}{N_d}\sum _{j=1}^{N_d}h_k^{-1}(y_{dj}),\,\,\, k=1,\ldots ,m;\quad A_{dm+1}=1-A_{d1}-\ldots -A_{dm}, \end{aligned}$$

The EBP of \(A_{dk}\) is

$$\begin{aligned} \hat{A}_{dk}^{eb}= & {} \frac{1}{N_d}\Big \{\sum _{j\in s_{d}}h_k^{-1}(y_{dj}) +\sum _{j\in r_{d}}E_{y_r}\big [h_k^{-1}(y_{dj})|y_s;{\hat{\psi \big ]\Big \}}},\,k=1,\ldots ,m;\,\,\, \hat{A}_{dm+1}^{eb}\\= & {} 1-\hat{A}_{d1}^{eb}-\ldots -\hat{A}_{dm}^{eb}. \end{aligned}$$

For a general function h, the expected values above might be not tractable analytically. When this occurs, the following Monte Carlo procedure can be applied.

  1. (a)

    Estimate the unknown parameter \(\psi =(\beta ^{\prime },\theta ^\prime )^{\prime }\) using sample data \((y_s,X_s)\).

  2. (b)

    Replacing \(\psi =(\beta ^{\prime },\theta ^\prime )^\prime \) by the estimate \({\hat{\psi }}=({\hat{\beta ^\prime }},{\hat{\theta ^\prime }})^\prime \) obtained in (a), draw L copies of each out-of-sample variable \(y_{dj}\) as

    $$\begin{aligned} y_{dj}^{(\ell )}\sim N_2({\hat{\mu }}_{dj|s},\hat{V}_{d|s}),\quad j\in r_{d},\,\, d=1,\ldots ,D,\,\, \ell =1,\ldots ,L. \end{aligned}$$

    where

    $$\begin{aligned} {\hat{\mu }}_{dj|s}=\left\{ \begin{array}{ll} X_{dj}{\hat{\beta }}+\hat{V}_{ud}Z_{ds}^\prime \Big \{\hat{V}_{eds}^{-1}-\hat{V}_{eds}^{-1}Z_{ds}\Big (\hat{V}_{ud}^{-1}+n_d\hat{V}_{edj}^{-1}\Big )^{-1}Z_{ds}^\prime \hat{V}_{eds}^{-1}\Big \}(y_{ds}-X_{ds}{\hat{\beta }})&{} \text{ if } \, n_d\ne 0, \\ X_{dj}{\hat{\beta }}&{} \text{ if } \, n_d=0, \end{array}\right. \end{aligned}$$

    and

    $$\begin{aligned} \hat{V}_{d|s}=\left\{ \begin{array}{ll} \hat{V}_{ud}+\hat{V}_{edj}-n_d\hat{V}_{ud}\hat{V}_{edj}^{-1}\hat{V}_{ud} +n_d^2\hat{V}_{ud}\hat{V}_{edj}^{-1}\Big (\hat{V}_{ud}^{-1}+n_d\hat{V}_{edj}^{-1}\Big )^{-1}\hat{V}_{edj}^{-1}\hat{V}_{ud}&{} \text{ if } \, n_d\ne 0, \\ \hat{V}_{ud}+\hat{V}_{edj}&{} \text{ if } \, n_d=0. \end{array}\right. \end{aligned}$$
  3. (c)

    The Monte Carlo approximation of the expected value is

    $$\begin{aligned} E_{y_r}\big [h_k^{-1}(y_{dj})|y_s;{\hat{\psi }}\big ]\approx \frac{1}{L}\sum _{\ell =1}^L h_k^{-1}(y_{dj}^{(\ell )}),\,\,\, j\in r_{d},\,\,\,d=1,\ldots ,D. \end{aligned}$$

    The Monte Carlo approximation of the EBP of \(A_{dk}\) is

    $$\begin{aligned} \hat{A}_{dk}^{eb}\approx \frac{1}{L}\sum _{\ell =1}^LA_{dk}^{(\ell )},\,\,\, A_{dk}^{(\ell )}=\frac{1}{N_d}\bigg (\sum _{j\in s_{d}}h_k^{-1}(y_{dj})+\sum _{j\in r_{d}} h_k^{-1}(y_{dj}^{(\ell )})\bigg ),\,\,\,k=1,\ldots ,m. \end{aligned}$$

The plug-in estimator of \(A_{dk}\) is

$$\begin{aligned} \hat{A}_{dk}^{in}=\frac{1}{N_d}\sum _{j\in U_d}h_k^{-1}(\hat{y}_{dj}^{eb}) =\frac{1}{N_d}\bigg \{\sum _{j\in s_{d}}h_k^{-1}(y_{dj})+\sum _{j\in r_{d}}h_k^{-1}(\hat{\mu }_{dj|s})\bigg \},\,\,\,k=1,\ldots ,m, \end{aligned}$$

and \(\hat{A}_{dm+1}^{in}=1-\hat{A}_{d1}^{in}-\ldots -\hat{A}_{dm}^{in}\).

Remark 3.1

In many practical cases, the values of the auxiliary variables are not available for all the population units. If in addition some of the variables are continuous, the EBP method is not applicable. An important particular case, where this method is applicable, is when the number of values of the vector of auxiliary variables is finite. More concretely, suppose that the covariates are categorical such that \(X_{dj}\in \{X_{01},\ldots ,X_{0T}\}\), then we can calculate \(A_{dk}^{(\ell )}\) as

$$\begin{aligned} A_{dk}^{(\ell )}=\frac{1}{N_d}\left[ \sum _{j=1}^{n_{d}}h_k^{-1}(y_{dj}) + \sum _{t=1}^{T}\sum _{j=1}^{N_{dt}-n_{dt}} h_k^{-1}(y_{dj}^{(\ell )})\right] , \end{aligned}$$

where \(N_{dt}=\#\{j\in U_{d}:\,X_{dj}=X_{0t}\}\) is available from external data sources (aggregated auxiliary information), \(n_{dt}=\#\{j\in s_{d}:\,X_{dj}=X_{0t}\}\), \(y_{dtj}^{(\ell )}\sim N_2({\hat{\mu }}_{dt|s},\hat{V}_{d|s})\), \(d=1,\ldots ,D\), \(j=1,\ldots ,N_{dt}-n_{dt}\), \(t=1,\ldots ,T\), \(\ell =1,\ldots ,L\), and

$$\begin{aligned} {\hat{\mu }}_{dj|s}=\left\{ \begin{array}{ll} X_{0t}{\hat{\beta }}+\hat{V}_{ud}Z_{ds}^\prime \Big \{\hat{V}_{eds}^{-1}-\hat{V}_{eds}^{-1}Z_{ds}\Big (\hat{V}_{ud}^{-1}+n_d\hat{V}_{edj}^{-1}\Big )^{-1}Z_{ds}^\prime \hat{V}_{eds}^{-1}\Big \}(y_{ds}-X_{ds}{\hat{\beta }})&{} \text{ if } \, n_d\ne 0, \\ X_{0t}{\hat{\beta }}&{} \text{ if } \, n_d=0, \end{array}\right. \end{aligned}$$

and \(\hat{V}_{d|s}\) was defined in Step (b) of the above Monte Carlo procedure.

Remark 3.2

If some auxiliary variables are continuous, we can use the Hájek-type approximation to \(A_{dk}^{(\ell )}\), i.e.

$$\begin{aligned} A_{dk}^{(\ell )}\approx \frac{1}{N_d}\sum _{j\in s_{d}}w_{dj}h_k^{-1}(y_{dj}^{(\ell )}). \end{aligned}$$

where \(w_{dj}\) is the sample weight of unit j of domain d. A GREG-type approximation to \(A_{dk}^{(\ell )}\) is

$$\begin{aligned} A_{dk}^{(\ell )}\approx \frac{1}{N_d}\bigg (\sum _{j\in s_{d}}\big \{h_k^{-1}(y_{dj})-h_k^{-1}(y_{dj}^{(\ell )})\big \} +\sum _{j\in s_{d}} \tilde{w}_{dj}h_k^{-1}(y_{dj}^{(\ell )})\bigg ), \end{aligned}$$

where \( \tilde{w}_{dj}=w_{dj}N_d/\hat{N}_d\), \(\hat{N}_d=\sum _{j\in s_d}w_{dj}\).

Analytical approximations to the MSE are difficult to derive in the case of complex parameters. We therefore propose a parametric bootstrap MSE estimator by following the bootstrap method for finite populations of González-Manteiga et al. (2008b). The steps for implementing this method are

  1. 1.

    Fit the model (2.5) to sample data \((y_s,X_s)\) and calculate an estimator \({\hat{\psi }}=({\hat{\beta }}^{\prime },{\hat{\theta }}^\prime )^{\prime }\) of \(\psi =(\beta ^{\prime },\theta ^\prime )^{\prime }\).

  2. 2.

    For \(d=1,\ldots ,D\), \(j=1,\ldots ,N_{d}\), generate independently \(u_{d}^{*}\sim N(0,\hat{V}_{ud})\) and \(e_{dj}^{*}\sim N(0,\hat{V}_{edj})\), where \(\hat{V}_{ud}=V_{ud}(\hat{\theta })\) and \(\hat{V}_{edj}=V_{edj}(\hat{\theta })\).

  3. 3.

    Construct the bootstrap superpopulation model \(\xi ^*\) using \(\{u_{d}^{*}\}\), \(\{e_{dj}^{*}\}\), \(\{X_{dj}\}\) and \(\hat{\beta }\), i.e

    $$\begin{aligned} \xi ^{*}:\, y_{dj}^{*}=X_{dj}\hat{\beta }+u_{d}^{*}+e_{dj}^{*},\,\,d=1,\ldots ,D, j=1,\ldots ,N_{d}. \end{aligned}$$
    (3.1)
  4. 4.

    Under the bootstrap superpopulation model (3.1), generate a large number B of i.i.d. bootstrap populations \(\{y_{dj}^{*(b)}:\,d=1,\ldots ,D, j=1,\ldots ,N_{d}\}\) and calculate the bootstrap population parameters

    $$\begin{aligned} A_{dk}^{*(b)}=\frac{1}{N_d}\sum _{j=1}^{N_{d}}h_k(y_{dj}^{*(b)}),\quad k=1,\ldots ,m,\,\,\, b=1,\ldots ,B. \end{aligned}$$
  5. 5.

    From each bootstrap population b generated in Step 4, take the sample with the same indices \(s\subset U\) as the initial sample, and calculate the bootstrap EBPs, \(\hat{A}_{dk}^{eb*(b)}\), \(k=1,\ldots ,m\), as described in Sect. 3, using the bootstrap sample vector \(y_s^*\) and the known values \(X_{dj}\).

  6. 6.

    A Monte Carlo approximation to the theoretical bootstrap estimator

    $$\begin{aligned} MSE_*(\hat{A}_{dk}^{eb*})=E_{\xi ^*}\big [(\hat{A}_{dk}^{eb*}-A_{dk}^{*})(\hat{A}_{dk}^{eb*}-A_{dk}^{*})^\prime \big ],\quad k=1,\ldots ,m, \end{aligned}$$

    is

    $$\begin{aligned} mse_*(\hat{A}_{dk}^{eb*})=\frac{1}{B}\sum _{b=1}^B(\hat{A}_{dk}^{eb*(b)}-A_{dk}^{*(b)})(\hat{A}_{dk}^{eb*(b)}-A_{dk}^{*(b)})^\prime ,\quad k=1,\ldots ,m.\nonumber \\ \end{aligned}$$
    (3.2)

    The estimator (3.2) is used to estimate \(MSE(\hat{A}_{dk}^{eb})\), \(k=1,\ldots ,m\).

4 Simulations

The simulation experiments empirically investigate the asymptotic behavior of: (1) the REML estimators of model parameters in Sect. 4.1 and Appendix B.1, (2) the EBP and plug-in predictors of domain average compositions in Sect. 4.2 and Appendix B.2, and (3) the parametric bootstrap MSE estimators in Sect. 4.3 and Appendix B.3.

To meet these three objectives, we consider a basic scenario in which we run simulations for different sample sizes. Take \(m=2\), \(p_1=p_2=2\), \(p=4\), \(\beta _1=(\beta _{11},\beta _{12})^\prime =(10,10)^\prime \), \(\beta _2=(\beta _{21},\beta _{22})^\prime =(10,10)^\prime \), For \(k=1,2\), \(d=1,\ldots ,D\), \(j=1,\ldots ,n_d\), generate \(X_{dj}=\text{ diag }(x_{dj1},x_{dj2})_{2\times 4}\), where \(x_{dj1}=(x_{dj11},x_{dj12})\), \(x_{dj2}=(x_{dj21},x_{dj22})\) and

$$\begin{aligned} x_{dj11}=x_{dj21}=1,\quad x_{dj12}\sim \text{ Bin }(1,1/2),\quad x_{dj22}\sim \text{ Bin }(1,1/2),\quad \end{aligned}$$

For \(d=1,\ldots ,D\), simulate \({u}_{d}\sim N_{2}(0,{V}_{ud})\) and \({e}_{dj}\sim N_{2}(0,{V}_{edj})\), where

$$\begin{aligned} V_{ud}=\left( \begin{array}{cc} \theta _1&{}\theta _3\sqrt{\theta _1}\sqrt{\theta _2}\\ \theta _3\sqrt{\theta _1}\sqrt{\theta _2}&{}\theta _2\\ \end{array}\right) ,\,\,\, V_{ed}=\left( \begin{array}{cc} \theta _4&{}\theta _{6}\sqrt{\theta _4}\sqrt{\theta _5}\\ \theta _{6}\sqrt{\theta _4}\sqrt{\theta _5}&{}\theta _5\\ \end{array}\right) . \end{aligned}$$

where \(\theta _1=0.75\), \(\theta _2=0.75\), \(\theta _4=0.5\), \(\theta _5=0.5\) and \(\theta _3=-0.4\), \(\theta _6=0.4\). Simulation 1 generates only 4 different matrices \(X_{dj}\). They are

$$\begin{aligned} X_{dj}=\left( \begin{array}{cc|cc} x_{dj11}&{}x_{dj12}&{}0&{}0\\ \hline 0&{}0&{}x_{dj21}&{}x_{dj22}\end{array}\right) \in \big \{ X_{01},X_{02},X_{03},X_{04}\big \}, \end{aligned}$$

where

$$\begin{aligned} X_{01}=\left( \begin{array}{cc|cc} 1&{}0&{}0&{}0\\ \hline 0&{}0&{}1&{}0\end{array}\right) , X_{02}=\left( \begin{array}{cc|cc} 1&{}0&{}0&{}0\\ \hline 0&{}0&{}1&{}1\end{array}\right) , X_{03}=\left( \begin{array}{cc|cc} 1&{}1&{}0&{}0\\ \hline 0&{}0&{}1&{}0\end{array}\right) , X_{04}=\left( \begin{array}{cc|cc} 1&{}1&{}0&{}0\\ \hline 0&{}0&{}1&{}1\end{array}\right) . \end{aligned}$$

4.1 Simulation 1 for REML estimators

The target of Simulation 1 is to check the behavior of the REML algorithm for fitting the MNER model (2.5). This simulation runs \(I=200\) iterations. Appendix B.1 gives the steps of Simulation 1 and the definitions of the absolute and relative performance measures. For every REML estimator \(\hat{\eta }\in \{\hat{\beta }_{11},\hat{\beta }_{12},\hat{\beta }_{21},\hat{\beta }_{22},\hat{\theta }_{1},\ldots ,\hat{\theta }_{6}\}\), Tables 1 and 2 present the relative bias \(RB(\hat{\eta })\) and the relative root-mean-squared error \(RRE(\hat{\eta })\) in %. Appendix B.1 gives the corresponding absolute performance measures. Simulation 1 shows that the REML Fisher-scoring algorithm works properly because \(RB({\hat{\eta }})\) and \(RRE(\hat{\eta })\) decrease as \(n_d\) or D increase.

Table 1 \(RB(\hat{\eta })\) (left) and \(RRE(\hat{\eta })\) (right) with \(n_d=10\)
Table 2 \(RB(\hat{\eta })\) (left) and \(RRE(\hat{\eta })\) (right) with \(D=50\)

4.2 Simulation 2 for EBPs

Simulation 2 investigates the EBP and plug-in predictors, \(\hat{A}_{dk}^{eb}\) and \(\hat{A}_{dk}^{in}\), respectively, \(k=1,2,3\). It takes \(I=200\) iterations and generates \(L=200\) random vectors for the Monte Carlo approximations of integrals. The population sizes are \(N_d=200\) and \(D=50\). Let h be the clr, alr or ilr transformation. Appendix B.2 gives the steps of Simulation 2 and the definitions of the absolute and relative performance measures. Tables 34 and 5 present the relative absolute bias \(RAB_k\) and the relative root-mean-squared error \(RRE_k\) in %, \(k=1,2,3\), for the clr, alr and ilr transformations, respectively. Appendix B.2 gives the corresponding absolute performance measures.

Table 3 \(RAB_k\) (left) and \(RRE_k\) (right) for clr with \(D=50\)
Table 4 \(RAB_k\) (left) and \(RRE_k\) (right) for alr with \(D=50\)
Table 5 \(RAB_k\) (left) and \(RRE_k\) (right) for ilr with \(D=50\)

The performances measures decrease as the sample sizes, \(n_d\)’s, increase and the EBP gets better results (RAB and RRE) than the plug-in predictor. Note that for each transformation, the data generation, and therefore the true underlying model, is different. For this reason, the results in Tables 34 and 5 are not comparable. It is curious to observe that if the data are generated by the MNER model derived from the alr transformation and its corresponding EBP is used, the results are slightly better than in the clr and ilr cases.

4.3 Simulation 3 for MSEs

Simulation 3 investigates the MSE estimators of predictors \(\hat{A}_{dk}^{eb}\) and \(\hat{A}_{dk}^{in}\), \(k=1,2,3\). One of the goals is to give a recommendation on the number of bootstrap replicates B to implement. The simulation takes \(I=200\) iterations and generates \(L=200\) random vectors for the Monte Carlo approximations of integrals. The population sizes are \(N_d=200\) and \(D=50\). Let h be the clr, alr or ilr transformation. Appendix B.3 gives the steps of Simulation 3 and the definitions of the absolute and relative performance measures.

Table 6 \(RAB_k\) (left) and \(RRE_k\) (right) for clr with \(D=50\) and \(n_d=10\)
Table 7 \(RAB_k\) (left) and \(RRE_k\) (right) for alr with \(D=50\) and \(n_d=10\)
Table 8 \(RAB_k\) (left) and \(RRE_k\) (right) for ilr with \(D=50\) and \(n_d=10\)

Tables 67 and 8 present the relative absolute bias \(RAB_k\) and the relative root-mean-squared error \(RRE_k\) in %, \(k=1,2,3\), for the clr, alr and ilr transformations, respectively. The number of bootstrap replicates is \(B=50, 100, 200, 300, 400\). Appendix B.3 gives the corresponding absolute performance measures. As in Simulation 2, we remark that the results in Tables 67 and 8 are not comparable because the data generation is different. Nevertheless, we observe that if the data are generated by the MNER model derived from the alr transformation and its corresponding EBP is used, Simulation 3 gives slightly better results than in the clr or ilr cases. That is, the functional form of the transformation plays a non-negligible role. In any case, the selection of the transformation in an application to real data must be made based on the diagnosis of the corresponding MNER model that we select.

Figures 1 and 2 show the boxplots of \(RRE_{dk}\) and \(RAB_{dk}\) for the predictors \(\hat{A}^{eb}_{dk}\), \(k=1,2,3\), with the clr transformation. From the obtained performance measures, we recommend to implement the bootstrap algorithm with at least \(B=300\) iterations. Appendix B.3 give the same recommendation for the alr and ilr transformations.

Fig. 1
figure 1

\(RRE_{dk}\) (in %) of MSE estimators for \(\hat{A}^{eb}_{dk}\), \(k=1,2,3\), for clr

Fig. 2
figure 2

\(RAB_{dk}\) (in %) of MSE estimators for \(\hat{A}^{eb}_{dk}\), \(k=1,2,3\), for clr

5 The Spanish Household Budget Survey (SHBS)

The SHBS is annually carried out by the “Instituto Nacional de Estadística” (INE), with the objective of obtaining information on the nature and destination of the consumption expenses, as well as on various characteristics related to the conditions of household life. In the Spanish economy, it is important to have good estimates of consume spending, since this spending represents, approximately, \(60\%\) of gross domestic product. However, global political measures are not often satisfactory for regional authorities, which can also develop their own economic strategies. They need some tools to determine, with precision and reliability, the main variables and consume indicators in order to implement their strategies. Among the main consume indicators are the proportions of food and housing annual expenses of households. This section presents an application of the new statistical methodology to the estimation of domain parameters defined as average of proportions of annual household expenditures. We take data from the SHBS of 2016. The domains are the 50 Spanish provinces plus the autonomous cities Ceuta and Melilla, so that \(D=52\).

Let \(a_{dj1}\), \(a_{dj2}\) and \(a_{dj3}\) be the proportions of annual expenditures on food, housing and other for household j of domain d. Housing includes expenditure on current housing costs, water, electricity, gas and other fuels. Food includes both food and nonalcoholic beverages and other represent the remaining expenditures. The vectors \(a_{dj}=(a_{dj1},a_{dj2})^\prime \in R^2\) are 2-part compositions that can be transformed into vectors \(y_{dj}=h(a_{dj})\) of \(R^2\) by one of the transformations h described in Appendix A. Let \(x_{djk}\), \(d=1,\ldots ,D\), \(j=1,\ldots ,n_d\), \(k=1,2\), be the \(4\times 1\) vector whose components are the binary auxiliary variables that indicate the composition of the household to which household j belongs in domain d. As auxiliary variables, we thus consider the household composition HC with categories

HC1::

Single person or adult couple with at least one members with age over 65,

HC2::

Other compositions with a single person or a couple without children,

HC3::

Couple with children under 16 years old or adult with children under 16 years old,

HC4::

Other households.

The variable HC is treated as a factor with reference category HC4.

For calculating the EBPs of the domain parameters of interest, we need the true population sizes, \(N_{dt}\), of the crossings of provinces with the categories of the variable HC. We calculate these sizes by using the sampling weights of the Spanish Labor Force Survey (SLFS). The SLFS sampling weights are calibrated to the population sizes of the provinces crossed with sex and age groups. These demographic quantities come from the INE population projection system and they are considered the most accurate demographic figures in Spain. On the other hand, the SHBS sampling weights are calibrated to the population sizes of the autonomous community (NUTS 2) crossed with sex and age groups, which are not the domains of interest.

This section presents an statistical analysis by applying the centered logratio transformation. This choice is due to the good fit of the MNER model to the transformed data. For the sake of completeness, Appendix C presents the corresponding data analysis for the alr and ilr transformations. Table 9 presents the estimates of the regression parameters, the z-values, the standard errors and the asymptotic p-values. The factor HC is significative for \(y_1\) and \(y_2\). Table 10 presents the asymptotic 95% confidence intervals (L.CI, U.CI) for the variance component parameters. None of them contains the zero.

Table 9 Regression parameters
Table 10 Variance and correlation parameters

For calculating the asymptotic p-values and confidence intervals of Tables 9 and 10, we take the asymptotic distributions of the REML estimators \({\hat{\theta }}\) and \({\hat{\beta }}\), i.e.

$$\begin{aligned} \hat{\theta }\sim N_{6}(\theta , {F}_s^{-1}(\theta )),\quad \hat{\beta }\sim N_p(\beta , ({X}_s^{\prime }{V}_s^{-1}{X}_s)^{-1}), \end{aligned}$$

where \(F_s\) is the REML Fisher information matrix. For \(\hat{\beta }_i=\beta _0\), the asymptotic p-value for testing the hypothesis \(H_0:\,\beta _i=0\) is

$$\begin{aligned} \text{ p-value }=2P_{H_0}(\hat{\beta }_i>|\beta _0|)=2P(N(0,1)> |\beta _0|/\sqrt{q_{ii}}\,). \end{aligned}$$

where \(({X}^{\prime }{V}^{-1}({\hat{\theta }}){X})^{-1}=(q_{ij})_{i,j=1,\ldots ,p}\) and \(\beta _i\) denotes the i-th component of the vector \(\beta \). The asymptotic \((1-\alpha )\)-level confidence intervals for the components \(\theta _{\ell }\) of \(\theta \) are

$$\begin{aligned} \hat{\theta }_{\ell }\pm z_{\alpha /2}\,\nu _{\ell \ell }^{1/2},\,\, \ell =1,\ldots ,6,\, \end{aligned}$$

where \({F}^{-1}(\hat{\theta })=(\nu _{ab})_{a,b=1,\ldots ,6}\) and \(z_{\alpha }\) is the \(\alpha \)-quantile of the N(0, 1) distribution.

Figure 3 plots the histograms of the \(D=52\) standardized EBPs of the random effects of the fitted MNER model for food (left) and housing (right) expenditures. It also prints the corresponding probability density function estimates. The shapes of the densities are quite symmetrical, which indicates that the distributions of the random effects are not very far from the normal distributions. Since D is too small to obtain a good nonparametric estimate of the density functions, the definitive conclusions can not be drawn.

Fig. 3
figure 3

Histograms of standardized random effects

Figure 4 gives the histograms of standardized residuals for components \(y_1\) and \(y_2\). It also prints the corresponding probability density function estimates. We do not appreciate a large deviation from the normal distribution.

Fig. 4
figure 4

Histograms of standardized residuals

Figure 5 presents the dispersion plots of standardized residuals versus predicted values (in \(10^4\) euros). Most standardized residuals fall within the interval \((-3,3)\), so we consider that outliers do not play a relevant role in the performance of the EBPs. Appendix C of the supplementary material gives the corresponding plots for the additive and isometric logratio transformations. The corresponding plots are similar to the ones shown in Figs. 4 and 5 for the centered logratio transformation. However, Fig. 5 presents more uniform clouds of points in both components than the corresponding figures for the two other transformations. From this graphical diagnosis, we finally prefer doing the data analysis with the centered logratio transformation. However, since the choice of the clr transformation can be debatable, Appendix C presents the full analysis of the data under the two other transformations.

Fig. 5
figure 5

Standardized residuals versus predicted values (in \(10^4\) euros)

Figure 6 plots the plug-in and the EBP predictions of \({a_{d1}}\) and \({a}_{d2}\). The domains are sorted by sample sizes and the sample size is printed in the axis OX. This figure shows that both estimators follow a similar pattern. This information is completed by Fig 7, which shows the relative root-MSEs (RRMSE).

Fig. 6
figure 6

Plug-in and EBP predictions of \({a_{d1}}\) and \({a}_{d2}\) in %

Fig. 7
figure 7

RRMSE of plug-in and EBP predictions of \({a_{d1}}\) and \({a}_{d2}\) in %

Figure 8 (left) maps the proportions of the household annual expenditures in food by Spanish provinces. Figure 8 (right) maps the estimated RRMSE in %. These figures show that expenditures on food are rather variable between provinces. This happens mostly in the autonomous regions of Andalucía, Aragón or Castilla León, where there are many provinces and some of them are more deprived than others. In contrast, there are other regions, such as Basque Country where the variability of the estimated ratios is smaller. This information could be of great use to local governments in develo** economic plans aimed at households and improving the quality of life.

Fig. 8
figure 8

EBP predictions of \({a_{d1}}\) by Spanish provinces in %

Figure 9 (left) maps the proportions of the household annual expenditures on housing by Spanish provinces. Figure 9 (right) maps the estimated RRMSE in \(\%\). As is the case with food expenditure, these figures show that expenditures on housing is rather variable between provinces. This map shows clear differences between the north-central regions, where the proportion of spending is higher, and the southern regions, where household expenditures are lower.

Fig. 9
figure 9

EBP predictions of \({a}_{d2}\) by Spanish provinces in %

Tables 11 and 12 present some condensed numerical results. The tables are constructed in two steps: First, the domains are sorted by sample size, starting by the domain with the smallest sample size. Finally, a selection of 14 domains out of 52 is done from the positions \(1, 5, 9,\ldots , 52\). The name and code of provinces are labeled by province and d, respectively, and the sample sizes by \(n_d\). Table 11 presents the model-based predictions of food and housing expenditures by provinces and Table 12 displays the corresponding estimates of RRMSEs. The plug-in predictors are denoted by in1 and in2 and the EBPs by ebp1 and ebp2.

Table 11 Predictions of \({a}_{d1}\) and \({a}_{d2}\) in %
Table 12 RRMSE estimates for \({a}_{d1}\) and \({a}_{d2}\) in %

6 Conclusions

Compositional data play an important role in public statistics. The proposed methodology is applied to estimate the proportions of annual household expenditures on food, housing and others from the 2016 SHBS at the province level. This paper introduces small area predictors of averages of unit-level vectors of compositions. For this purpose, the manuscript considers the centered logratio transformations of compositions into vectors of \(R^m\). For the sake of completeness, Appendix C of the supplementary material presents the corresponding statistical analysis under the additive and isometric logratio transformations. A MNER model is proposed for analyzing the transformed compositional data, where the vectors of random effects and the vector of model errors have unstructured covariance matrices with unknown components. As usual in linear mixed models, the parameter estimates of the MNER model are obtained using the REML method. The selection of the centered logratio transformation was motivated by the interpretability and diagnosis of the selected MNER model. In this sense, we followed the recommendations of Greenacre (2019). This is to say, we have tried to provide a simple solution to a practical problem of compositional data.

Of the two proposed predictors, EBP and plug-in, EBP presents a slightly better performance, as can be seen in the simulation study. For the calculation of the MSE, we recommend a parametric bootstrap, following the ideas of González-Manteiga et al. (2008a) and for a number of repetitions greater than \(B=300\).

As a result of the statistical analysis for Spanish provinces, we conclude that food expenditure in Spain accounts for \(14.6\%\) of total household expenditure and presents great variability within autonomous communities. This happens mostly in the Autonomous Regions of Andalucía, Aragón or Castilla León, where there are many provinces and some of them are more deprived than others. In contrast, there are other regions, such as Basque Country where the variability of the estimated proportions is smaller. On the other hand, spending on housing in Spain accounts for \(31\%\) of total household spending and there are important differences between the north-central provinces (with higher incomes) and those in the south.

In this case, we applied the introduced methodology to the SHBS, but it is useful in other topics of the official statistics, like the classification of the population by the educational level and according to economic activity. In both situations, it is necessary to take into account the simplex constraints.

We finally remind that there are other regression models for compositions, such as directional mixed effects models or Dirichlet regression mixed models. These models are likely to be adapted to the SAE context described in Sect. 2, including fitting algorithms, predictors of domain quantities, MSE estimators, and so on. They can be competitive options with respect to fitting a multivariate normal mixed model to logratio transformations of compositions. We believe that these tasks are interesting subjects for future research.