Small area estimation of average compositions under multivariate nested error regression models

Esteban, María Dolores; Lombardía, María José; López-Vizcaíno, Esther; Morales, Domingo; Pérez, Agustín

doi:10.1007/s11749-023-00847-0

Small area estimation of average compositions under multivariate nested error regression models

Original Paper
Open access
Published: 15 February 2023

Volume 32, pages 651–676, (2023)
Cite this article

Download PDF

You have full access to this open access article

TEST Aims and scope Submit manuscript

Small area estimation of average compositions under multivariate nested error regression models

Download PDF

María Dolores Esteban¹,
María José Lombardía ORCID: orcid.org/0000-0001-9452-9818²,
Esther López-Vizcaíno³,
Domingo Morales¹ &
…
Agustín Pérez¹

1476 Accesses
1 Citation
Explore all metrics

Abstract

This paper investigates the small area estimation of population averages of unit-level compositional data. The new methodology transforms the compositions into vectors of $R^m$ and assumes that the vectors follow a multivariate nested error regression model. Empirical best predictors of domain indicators are derived from the fitted model, and their mean squared errors are estimated by parametric bootstrap. The empirical analysis of the behavior of the introduced predictors is investigated by means of simulation experiments. An application to real data from the Spanish household budget survey is given. The target is to estimate the average of proportions of annual household expenditures on food, housing and others, by Spanish provinces.

Partial linear regression of compositional data

Article 15 July 2022

A Dirichlet Regression Model for Compositional Data with Zeros

Article 17 April 2018

A generalized mixed model for skewed distributions applied to small area estimation

Article 04 July 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Official statistics contain estimates of socioeconomic indicators at different levels of aggregation. In many sampling designs, small sample sizes do not allow accurate direct estimators to be calculated at low levels of aggregation. These territories or population groups are called small areas. Small Area Estimation (SAE) gives a solution to this problem by incorporating auxiliary information to the data analysis and by introducing model-based predictors. The books of Rao and Molina (2015) and Morales et al. (2021) give a general description of SAE.

The Spanish household budget survey (SHBS) provides information about the nature and destination of the consumption household expenses, as well as on various characteristics related to the conditions of household life. Spain is hierarchically partitioned in 17 autonomous communities and 50 provinces, plus 2 autonomous cities. The sampling design and the sample sizes of the SHBS are developed to provide estimates for the 17 autonomous communities level, but not for the provinces. The direct estimates at the province level have a low accuracy and, therefore, estimating SHBS indicators at that level is a SAE problem. This paper has two objectives. The first one is to model the unit-level proportions of annual household expenditures on food, housing and others. The second one is to estimate the average of these proportions, by Spanish provinces.

Under area-level models, we find some more proposals for estimating domain proportions and counts. For example, Esteban et al. (2012), Marhuenda et al. (2013, 2014) and Morales et al. (2015) derived predictors based on linear mixed models and (Chambers et al. 2014; Dreassi et al. 2014; Tzavidis et al. 2015) and (Boubeta et al. 2017, 2016) applied binomial, negative binomial or Poisson regression models. There are also methodologies for estimating proportions and counts in the setup of contingency tables or multinomial regression models. Without being exhaustive, we find the papers of Zhang and Chambers (2004), Berg and Fuller (2014) for contingency tables, and the papers of Ferrante and Trivisano (2010), Souza and Moura (2016), Fabrizi et al. (2016), Saei and Chambers (2003), Molina et al. (2007) and López-Vizcaíno et al. (2013, 2015) for multinomial regression models. However, in the household survey samples, some variables of interest and domain indicators are compositions. This is to say, they are positive quantities summing up to one or to a known integer number. Concerning area-level model for compositional data, Esteban et al. (2020) and Krause et al. (2022) transformed compositions into target vectors of multivariate Fay-Herriot models in order to make model-based predictions, like the ones described by González-Manteiga et al. (2008a), Benavent and Morales (2016), Benavent and Morales (2021) or Arima et al. (2017).

The statistical literature presents some contributions to small area estimation of proportions and counts under unit-level models for binary outcomes. For example, Chambers et al. (2016), Hobza and Morales (2016), Hobza et al. (2018) and Burgard et al. (2021) derived predictors under M-quantile or binomial-logit models for binary outcomes. These approaches are based on univariate models and not in models for compositional data that consider the possibility of jointly estimating the counts or proportions of all the categories of a classification variable. This issue was faced by Scealy and Welsh (2017), which introduced a directional mixed effects model for compositional data and predicted the proportions of total weekly expenditure on food and housing costs for households in a chosen set of domains. A different approach was employed by Hijazi and Jernigan (2009), Camargo et al. (2012), Tsagris and Stewart (2018), Morais et al. (2018), which modelled compositional data using Dirichlet regression models. This manuscript also deals with unit-level compositional data, but it proposes to fit multivariate linear mixed models to logratio transformations of compositions. Some references on the foundations of compositional data analysis are the books (Aitchison 1986) and (Pawlowsky-Glahn and Buccianti 2011) and the papers (Egozcue et al. 2003) and (Egozcue and Pawlowsky-Glahn 2019), where some basic transformations of compositions are studied.

This paper introduces small area predictors of averages of unit-level vectors of compositions. For this sake, the paper considers three logratio transformations of compositions into vectors of $R^m$. They are the additive, centered and isometric logratio transformations. We propose a multivariate nested error regression (MNER) model for analyzing the transformed SHBS compositional data, where the vectors of random effects and the vector of model errors have unstructured covariance matrices with unknown components. The estimates of the MNER model parameters are obtained by using the residual maximum likelihood (REML) estimation method, as it is described in Esteban et al. (2022a). The fitted model is then used to predict averages of proportions of annual household expenditures on food, housing and others, by Spanish provinces. The empirical best and plug-in predictors of small area compositional parameters are derived similarly as in Esteban et al. (2022b).

The estimation of the mean squared error (MSE) of a model-based predictor is an important issue that has no easy solution. Under nonlinear models, the problem is even more difficult. We follow the resampling approach appearing in González-Manteiga et al. (2007, 2008b) to implement a parametric bootstrap procedure.

This paper introduces statistical methodology that is new in four main aspects: (1) the employment of three transformations of unit-level compositional survey data, (2) the use of MNER models with unstructured covariance matrix for modelling the transformed data and capturing the sample correlations, (3) the derivation of domain-level predictors of averages of compositions based on the MNER model fitted to the transformed unit-level data, and (4) the introduction of parametric bootstrap estimators of the MSEs of the new predictors.

The remainder of the paper is organized as follows: Section 2 establishes the probabilistic framework, describes the SAE problem of interest and presents the MNER model. Section 3 derives empirical best predictors (EBP) and plug-in predictors of average compositions and gives a parametric bootstrap method for estimating the MSEs of the EBPs. Section 4 presents three simulation experiments. The target of Simulation 1 is to check the behavior of the REML algorithm for fitting the MNER model. Simulation 2 investigates the performance of the EBPs and plug-in predictors, and Simulation 3 analyzes the parametric bootstrap estimator of the MSEs. Section 5 applies the proposed methodology to data from the SHBS of 2016 in Spain. Section 6 gives some conclusions. The paper contains four appendices in a supplementary material file. Appendix A describes the additive, centered and isometric logratio transformations of compositions. Appendix B gives further simulation results. Appendix C analyzes the SHBS data with different transformations. Appendix D performs the application to SHBS data without applying logratio transformations of compositions.

2 The probabilistic framework

Let U be a population of size N partitioned into D domains or areas $U_1,\ldots ,U_D$ of sizes $N_1,\ldots ,N_D$, respectively. Let $N=\sum _{j=1}^DN_d$ be the global population size. Let us consider the probability vector $a_{dj}^{+}=(a_{dj1},\ldots ,a_{djm+1})^\prime \in R^{m+1}$ representing proportions associated with the $m+1$ categories of a classification variable that is defined on the sample unit j of domain d, $d=1,\ldots ,D$, $j=1,\ldots ,N_d$. For example, $a_{dj}^{+}$ may contain the proportions of annual household expenditures in the different expense categories. The components of $a_{dj}^{+}$ are nonnegative and fulfill the constraint $a_1+\ldots +a_{m+1}=1$. These vectors $a_{dj}^{+}$ are called compositions or $(m+1)$-part compositions, and vectors $a_{dj}=(a_{dj1},\ldots ,a_{djm})^\prime $ are called m-part compositions. Compositional data, consisting of compositions, play an important role in public statistics. Compositions take values in the simplex embedded in $R^{m+1}$

$$\begin{aligned} {{\mathcal {S}}}^{m}_e=\big \{(a_1,\ldots ,a_{m+1})^\prime \in R^{m+1}:\, a_1>0,\ldots ,a_{m+1}>0,\,a_1+\ldots +a_{m+1}=1\big \}, \end{aligned}$$

and m-part compositions take values in the m-dimensional simplex defined by

$$\begin{aligned} {{\mathcal {S}}}^{m}=\big \{(a_1,\ldots ,a_m)^\prime \in R^m:\, a_1>0,\ldots ,a_m>0,\,a_1+\ldots +a_m<1\big \}. \end{aligned}$$

This paper deals with the problem of predicting domain average compositions

$$\begin{aligned} A_{dk}=\frac{1}{N_d}\sum _{j=1}^{N_d}a_{djk},\quad d=1,\ldots ,D,\,\, k=1,\ldots ,m+1, \end{aligned}$$

(2.1)

under a compositional data analysis approach. This is to say, we apply a one-to-one transformation, $h=(h_1,\ldots ,h_m)^\prime :{{\mathcal {S}}}^m\mapsto R^m$, to m-part compositions and we assume that the transformed vectors follow a multivariate regression model. Appendix A presents three widely employed transformations. They are the additive, centered and isometric logratio transformations. The components of the transformed vectors $y_{dj}=h(a_{dj})=(y_{dj1},\ldots ,y_{djm})^{\prime }$ are continuous variables measured on the sample unit j of domain d, $d=1,\ldots ,D$, $j=1,\ldots ,N_d$.

For $k=1,\ldots ,m$, let ${x}_{djk}=(x_{djk1},\ldots ,x_{djkp_k})$ be a row vector containing $p_k$ explanatory variables and let ${X}_{dj}=\text{ diag }\left( {x}_{dj1},\ldots ,{x}_{djm}\right) _{m\times p}$ with $p=p_1+\ldots +p_m$. Let $\beta _{k}$ be a column vector of size $p_k$ containing regression parameters and let $\beta =\left( \beta _{1}^{\prime },\ldots ,\beta _{m}^{\prime }\right) ^{\prime }_{p\times 1}$. We assume that the transformed vectors $y_{dj}$’s follow the population MNER model

$$\begin{aligned} y_{dj}=X_{dj}\beta +u_d+e_{dj},\quad d=1,\ldots ,D,\, j=1,\ldots ,N_d, \end{aligned}$$

(2.2)

where the vectors of random effects $u_{d}$’s and random errors $e_{dj}$’s are independent with multivariate normal distributions

$$\begin{aligned} u_{d}=(u_{d1},\ldots ,u_{dm})^\prime \sim N_m(0,V_{ud}),\quad e_{dj}=(e_{dj1},\ldots ,e_{djm})^\prime \sim N_m(0,V_{edj}). \end{aligned}$$

The $m\times m$ covariance matrices ${V}_{ud}$ depend on $q=m(m+1)/2$ unknown parameters, denoted by

$$\begin{aligned} \theta _u=(\theta _{u1},\ldots ,\theta _{uq})^\prime = (\sigma _{u1}^2,\sigma _{u2}^2,\ldots ,\sigma _{um}^2,\rho _{u12},\rho _{u13},\ldots ,\rho _{u23},\rho _{u24},\ldots ,\rho _{um-1,m})^{\prime }. \end{aligned}$$

The matrix $V_{ud}$ is

$$\begin{aligned} V_{ud}= \left( \begin{array}{cccc} \sigma _{u1}^2 &{} \rho _{u12}\sigma _{u1}\sigma _{u2} &{} \cdots &{} \rho _{u1m}\sigma _{u1}\sigma _{um} \\ \rho _{u12}\sigma _{u1}\sigma _{u2} &{} \sigma _{u2}^2 &{} \cdots &{} \rho _{u2m}\sigma _{u2}\sigma _{um} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ \rho _{u1m}\sigma _{u1}\sigma _{um} &{} \rho _{u2m}\sigma _{u2}\sigma _{um} &{} \ldots &{} \sigma _{um}^2 \end{array}\right) . \end{aligned}$$

The $m\times m$ covariance matrices ${V}_{edj}$ depend on q unknown parameters, i.e.

$$\begin{aligned} \theta _e=(\theta _{e1},\ldots ,\theta _{eq})^\prime = (\sigma _{e1}^2,\sigma _{e2}^2,\ldots ,\sigma _{em}^2,\rho _{e12},\rho _{e13},\ldots ,\rho _{e23},\rho _{e24},\ldots ,\rho _{em-1,m})^{\prime }. \end{aligned}$$

The matrix $V_{edj}$ is

$$\begin{aligned} { V_{edj}}= \left( \begin{array}{cccc} \sigma _{e1}^2 &{} \rho _{e12}\sigma _{e1}\sigma _{e2} &{} \cdots &{} \rho _{e1m}\sigma _{e1}\sigma _{em} \\ \rho _{e12}\sigma _{e1}\sigma _{e2} &{} \sigma _{e2}^2 &{} \cdots &{} \rho _{e2m}\sigma _{e2}\sigma _{em} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ \rho _{e1m}\sigma _{e1}\sigma _{em} &{} \rho _{e2m}\sigma _{e2}\sigma _{em} &{} \ldots &{} \sigma _{em}^2 \end{array}\right) . \end{aligned}$$

The $2q\times 1$ vector of variance component parameters is $\theta =(\theta _u^\prime ,\theta _e^\prime )^\prime $. The $(p+2q)\times 1$ vector of model parameters is $\psi =(\beta ^{\prime },\theta ^\prime )^{\prime }$. Let ${I}_a$ be the $a\times a$ identity matrix. We define the $mN_d\times 1$ vectors $y_d$ and $e_d$, the $mN_d\times p$ matrix $X_d$ and the $mN_d\times m$ matrix $Z_d$ as follows:

$$\begin{aligned} y_d=\underset{1\le j \le N_d}{\hbox {col}}(y_{dj}),\,\,\, e_d=\underset{1\le j \le N_d}{\hbox {col}}(e_{dj}),\,\,\, X_d=\underset{1\le j \le N_d}{\hbox {col}}(X_{dj}),\,\,\, Z_d=\underset{1\le j \le N_d}{\hbox {col}}(I_m). \end{aligned}$$

Model (2.2) can be written in the domain-level form

$$\begin{aligned} y_{d}=X_{d}\beta +Z_du_d+e_{d},\quad d=1,\ldots ,D, \end{aligned}$$

(2.3)

where the vectors $u_d$ and $e_d\sim N_{mN_d}(0,V_{ed})$ are independent and $V_{ed}=\underset{1\le j \le N_d}{\hbox {diag}}(V_{edj})$. We define the $mN\times 1$ vectors y and e, the $mD\times 1$ vector u, the $mN\times p$ matrix X and $mN\times mD$ matrix Z as follows:

$$\begin{aligned} y= & {} \underset{1\le d \le D}{\hbox {col}}(y_{d}),\,\,\, e=\underset{1\le d \le D}{\hbox {col}}(e_{d}),\,\,\, u=\underset{1\le d \le D}{\hbox {col}}(u_{d}),\,\,\, X=\underset{1\le d \le D}{\hbox {col}}(X_{d}),\,\,\,\\ Z= & {} \underset{1\le d \le D}{\hbox {diag}}(Z_d). \end{aligned}$$

Model (2.2) can be written in the linear mixed model form

$$\begin{aligned} y=X\beta +Zu+e. \end{aligned}$$

(2.4)

where $u\sim N_{mD}(0,{V}_{u})$, $e\sim N_{mN}(0,V_{ed})$ are independent, $V_{u}=\underset{1\le d \le D}{\hbox {diag}}(V_{ud})$ and $V_{e}=\underset{1\le d \le D}{\hbox {diag}}(V_{ed})$.

Under the predictive approach to inference in finite populations, statistical procedures are based on a fixed subset (called sample), $s=\cup _{d=1}^Ds_d$, of the finite population U. Let $n_d$ be the size of the domain subset $s_d\subset U_d$, $d=1,\ldots ,D$, and let $n=n_1+\ldots +n_D$ be the total sample size. The complementary domain subsets are $r_d=U_d-s_d$, $d=1,\ldots ,D$. Let $y_s$ and $y_{ds}$ be the sub-vectors of y and $y_d$ corresponding to sample elements and $y_r$ $y_{dr}$ the sub-vectors of y and $y_d$ corresponding to the out-of-sample elements. Without lack of generality, we can write $y_d=(y_{ds}^\prime ,y_{dr}^\prime )^\prime $. Define also the corresponding decompositions of $X_d$, $Z_d$ and $V_d$. As we assume that sample indexes are fixed, then the sample sub-vectors $y_{ds}$ follow the marginal models derived from the population model (2.3), i.e.

$$\begin{aligned} y_{ds}=X_{ds}\beta +Z_{ds}u_d+e_{ds},\quad d=1,\ldots ,D, \end{aligned}$$

(2.5)

where $u_d\sim N_m(0,{V}_{ud})$, $e_{ds}\sim N_{mn_d}(0,V_{eds})$ are independent and $V_{eds}=\underset{1\le j \le n_d}{\hbox {diag}}(V_{edj})$. The vectors $y_{ds}$ are independent with $y_{ds}\sim N_{n_d}(\mu _{ds},V_{ds})$, $\mu _{ds}=X_{ds}\beta $, $V_{ds}=Z_{ds}V_{ud}Z_{ds}^\prime +V_{eds}$.

When the variance component parameters are known, the best linear unbiased estimator (BLUE) of $\beta $ and the best linear unbiased predictor (BLUP) of $u_d$, $d=1,\ldots ,D$, are

$$\begin{aligned} \hat{\beta }_B=\bigg (\sum _{d=1}^DX_{ds}^\prime {V}_{ds}^{-1}X_{ds}\bigg )^{-1}\sum _{d=1}^DX_{ds}^\prime {V}_{ds}^{-1}y_{ds},\,\, \hat{u}_{Bd}=V_{ud}Z_{ds}^\prime V_{ds}^{-1}\big (y_{ds}-X_{ds}\hat{\beta }_B\big ). \end{aligned}$$

Let ${\hat{\theta }}$ be the REML estimator of $\theta $, then the empirical BLUE (BLUE) of $\beta $ and the empirical BLUP (EBLUP) of $u_d$, $d=1,\ldots ,D$, are

$$\begin{aligned} \hat{\beta }=\bigg (\sum _{d=1}^DX_{ds}^\prime \hat{V}_{ds}^{-1}X_{ds}\bigg )^{-1}\sum _{d=1}^DX_{ds}^\prime \hat{V}_{ds}^{-1}y_{ds},\,\, \hat{u}_{d}=\hat{V}_{ud}Z_{ds}^\prime \hat{V}_{ds}^{-1}\big (y_{ds}-X_{ds}\hat{\beta }\big ), \end{aligned}$$

where $\hat{V}_{ds}$ and $\hat{V}_{ud}$ are obtained by substituting $\theta $ by ${\hat{\theta }}$ in ${V}_{ds}$ and ${V}_{ud}$, respectively. We calculate the inverse of $V_{ds}=V_{eds}+Z_{ds}V_{ud}Z_{ds}^\prime =A+BCD$ by applying the formula

$$\begin{aligned} (A+BCD)^{-1}=A^{-1}-A^{-1}B(C^{-1}+DA^{-1}B)^{-1}DA^{-1}. \end{aligned}$$

As $Z_{ds}^\prime V_{eds}^{-1}Z_{ds}=\sum _{j=1}^{n_d}V_{edj}^{-1}=n_dV_{edj}^{-1}$, we obtain

$$\begin{aligned} V_{ds}^{-1}= & {} V_{eds}^{-1}-V_{eds}^{-1}Z_{ds}\big (V_{ud}^{-1}+Z_{ds}^\prime V_{eds}^{-1}Z_{ds}\big )^{-1}Z_{ds}^\prime V_{eds}^{-1}\\= & {} V_{eds}^{-1}-V_{eds}^{-1}Z_{ds}\Big (V_{ud}^{-1}+{n_d}V_{edj}^{-1}\Big )^{-1}Z_{ds}^\prime V_{eds}^{-1}. \end{aligned}$$

As the sample indexes are fixed, the out-of-sample sub-vectors $y_{dr}$ follow the marginal models derived from the population model (2.3), i.e.

$$\begin{aligned} y_{dr}=X_{dr}\beta +Z_{dr}u_d+e_{dr},\quad d=1,\ldots ,D, \end{aligned}$$

where $u_d\sim N_m(0,{V}_{ud})$, $e_{dr}\sim N_{m(N_d-n_d)}(0,V_{eds})$ are independent and $V_{edr}=\underset{n_d+1\le j \le N_d}{\hbox {diag}}(V_{edj})$. The vectors $y_{dr}$ are independent with $y_{dr}\sim N_{N_d-n_d}(\mu _{dr},V_{dr})$, $\mu _{dr}=X_{dr}\beta $, $V_{dr}=Z_{dr}V_{ud}Z_{dr}^\prime +V_{edr}$. The covariance matrix between $y_{dr}$ and $y_{ds}$ is

$$\begin{aligned} V_{drs}=\text{ cov }(y_{dr},y_{ds})= & {} \text{ cov }(X_{dr}\beta +Z_{dr}u_d+e_{dr}, X_{ds}\beta +Z_{ds}u_d+e_{ds})\\= & {} Z_{dr}\text{ var }(u_d)Z_{ds}^\prime = Z_{dr}V_{ud}Z_{ds}^\prime . \end{aligned}$$

The distribution of $y_{dr}$, given the sample data $y_{s}$, is

$$\begin{aligned} y_{dr}|y_{s}\sim y_{dr}|y_{ds}\sim N(\mu _{dr|s},V_{dr|s}). \end{aligned}$$

The conditional $(N_d-n_d)\times 1$ mean vector is

$$\begin{aligned} \mu _{dr|s}= & {} \mu _{dr}+V_{drs}V_{ds}^{-1}(y_{ds}-\mu _{ds})= X_{dr}\beta +Z_{dr}V_{ud}Z_{ds}^\prime V_{ds}^{-1}(y_{ds}-X_{ds}\beta ) \\= & {} X_{dr}\beta +Z_{dr}V_{ud}Z_{ds}^\prime \Big \{V_{eds}^{-1}-V_{eds}^{-1}Z_{ds}\Big (V_{ud}^{-1}+{n_d}V_{edj}^{-1}\Big )^{-1}Z_{ds}^\prime V_{eds}^{-1}\Big \}(y_{ds}-X_{ds}\beta ). \end{aligned}$$

The conditional covariance matrix is

$$\begin{aligned} V_{dr|s}= & {} V_{dr}-V_{drs}V_{ds}^{-1}V_{dsr} = Z_{dr}V_{ud}Z_{dr}^\prime +V_{edr} -Z_{dr}V_{ud}Z_{ds}^\prime V_{ds}^{-1}Z_{ds}V_{ud}Z_{dr}^\prime \\= & {} Z_{dr}V_{ud}Z_{dr}^\prime +V_{edr} -Z_{dr}V_{ud}Z_{ds}^\prime \Big \{V_{eds}^{-1}-V_{eds}^{-1}Z_{ds}\Big (V_{ud}^{-1}+{n_d}V_{edj}^{-1}\Big )^{-1}Z_{ds}^\prime V_{eds}^{-1}\Big \} Z_{ds}V_{ud}Z_{dr}^\prime \\= & {} Z_{dr}V_{ud}Z_{dr}^\prime +V_{edr}-{n_d}Z_{dr}V_{ud}V_{edj}^{-1}V_{ud}Z_{dr}^\prime +n_d^2Z_{dr}V_{ud}V_{edj}^{-1}\Big (V_{ud}^{-1}+{n_d}V_{edj}^{-1}\Big )^{-1}V_{edj}^{-1}V_{ud}Z_{dr}^\prime . \end{aligned}$$

Note that

$$\begin{aligned} Z_{ds}^\prime V_{eds}^{-1}(y_{ds}-X_{ds}\beta )=\sum _{j=1}^{n_d}V_{edj}^{-1}(y_{dj}-X_{dj}\beta ),\quad \end{aligned}$$

If $n_d\ne 0$ and $j\in r_d$, $j>n_d$, the conditional $m\times 1$ mean vector is

$$\begin{aligned} \mu _{dj|s}= & {} X_{dj}\beta +V_{ud}Z_{ds}^\prime \Big \{V_{eds}^{-1}-V_{eds}^{-1}Z_{ds}\big (V_{ud}^{-1}+n_dV_{edj}^{-1}\big )^{-1}Z_{ds}^\prime V_{eds}^{-1}\Big \}(y_{ds}-X_{ds}\beta ) \\= & {} X_{dj}\beta +V_{ud}\Big \{I_m-n_dV_{edj}^{-1}\big (V_{ud}^{-1}+n_dV_{edj}^{-1}\big )^{-1}\Big \}\sum _{j=1}^{n_d}V_{edj}^{-1}(y_{dj}-X_{dj}\beta ). \end{aligned}$$

If $n_d=0$ and $j\in r_d$, the conditional $m\times 1$ mean vector is

$$\begin{aligned} \mu _{dj|s}=X_{dj}\beta . \end{aligned}$$

If $n_d\ne 0$ and $j\in r_d$, $j>n_d$, the conditional $m\times m$ covariance matrix is

$$\begin{aligned} V_{dj|s}=V_{d|s}=V_{ud}+V_{edj}-n_dV_{ud}V_{edj}^{-1}V_{ud} +n_d^2V_{ud}V_{edj}^{-1}\Big (V_{ud}^{-1}+n_dV_{edj}^{-1}\Big )^{-1}V_{edj}^{-1}V_{ud}. \end{aligned}$$

If $n_d=0$ and $j\in r_d$, the conditional $m\times m$ covariance matrix is

$$\begin{aligned} V_{dj|s}=V_{d|s}=V_{ud}+V_{edj}. \end{aligned}$$

3 Predictors of average compositions

This section deals with the problem of predicting the domain average compositions $A_{dk}$, $d=1,\ldots ,D$, $k=1,\ldots ,m+1$, defined in (2.1). As explained in Sect. 2 and Appendix A, we first transform the m-part compositions ${a}_{dj}=({a}_{dj1},\ldots ,{a}_{djm})^\prime $ into vectors of $R^m$. This is done by applying a one-to-one function $h=(h_1,\ldots ,h_m)^\prime :{{\mathcal {S}}}^m\mapsto R^m$. The transformed vectors $y_{dj}=h(a_{dj})$ have components $y_{dj1}=h_1(a_{dj}),\ldots ,y_{djm}=h_m(a_{dj})$. Let $h^{-1}=(h_1^{-1},\ldots ,h_m^{-1})^\prime :R^m\mapsto {{\mathcal {S}}}^m$ be the inverse function of h, so that $a_{dj1}=h_1^{-1}(y_{dj}),\ldots ,a_{djm}=h_m^{-1}(y_{dj})$.

For estimating $A_{dk}$, $k=1,\ldots ,m+1$, we assume that $y_{dj}=(y_{dj1},\ldots ,y_{djm})^\prime $ follows a multivariate nested error regression (MNER) model. For $d=1,\ldots ,D$, the target parameters are additive, i.e

$$\begin{aligned} A_{dk}=\frac{1}{N_d}\sum _{j=1}^{N_d}h_k^{-1}(y_{dj}),\,\,\, k=1,\ldots ,m;\quad A_{dm+1}=1-A_{d1}-\ldots -A_{dm}, \end{aligned}$$

The EBP of $A_{dk}$ is

$$\begin{aligned} \hat{A}_{dk}^{eb}= & {} \frac{1}{N_d}\Big \{\sum _{j\in s_{d}}h_k^{-1}(y_{dj}) +\sum _{j\in r_{d}}E_{y_r}\big [h_k^{-1}(y_{dj})|y_s;{\hat{\psi \big ]\Big \}}},\,k=1,\ldots ,m;\,\,\, \hat{A}_{dm+1}^{eb}\\= & {} 1-\hat{A}_{d1}^{eb}-\ldots -\hat{A}_{dm}^{eb}. \end{aligned}$$

For a general function h, the expected values above might be not tractable analytically. When this occurs, the following Monte Carlo procedure can be applied.

(a)
Estimate the unknown parameter $\psi =(\beta ^{\prime },\theta ^\prime )^{\prime }$ using sample data $(y_s,X_s)$.
(b)
Replacing $\psi =(\beta ^{\prime },\theta ^\prime )^\prime $ by the estimate ${\hat{\psi }}=({\hat{\beta ^\prime }},{\hat{\theta ^\prime }})^\prime $ obtained in (a), draw L copies of each out-of-sample variable $y_{dj}$ as
$$\begin{aligned} y_{dj}^{(\ell )}\sim N_2({\hat{\mu }}_{dj|s},\hat{V}_{d|s}),\quad j\in r_{d},\,\, d=1,\ldots ,D,\,\, \ell =1,\ldots ,L. \end{aligned}$$
where
$$\begin{aligned} {\hat{\mu }}_{dj|s}=\left\{ \begin{array}{ll} X_{dj}{\hat{\beta }}+\hat{V}_{ud}Z_{ds}^\prime \Big \{\hat{V}_{eds}^{-1}-\hat{V}_{eds}^{-1}Z_{ds}\Big (\hat{V}_{ud}^{-1}+n_d\hat{V}_{edj}^{-1}\Big )^{-1}Z_{ds}^\prime \hat{V}_{eds}^{-1}\Big \}(y_{ds}-X_{ds}{\hat{\beta }})&{} \text{ if } \, n_d\ne 0, \\ X_{dj}{\hat{\beta }}&{} \text{ if } \, n_d=0, \end{array}\right. \end{aligned}$$
and
$$\begin{aligned} \hat{V}_{d|s}=\left\{ \begin{array}{ll} \hat{V}_{ud}+\hat{V}_{edj}-n_d\hat{V}_{ud}\hat{V}_{edj}^{-1}\hat{V}_{ud} +n_d^2\hat{V}_{ud}\hat{V}_{edj}^{-1}\Big (\hat{V}_{ud}^{-1}+n_d\hat{V}_{edj}^{-1}\Big )^{-1}\hat{V}_{edj}^{-1}\hat{V}_{ud}&{} \text{ if } \, n_d\ne 0, \\ \hat{V}_{ud}+\hat{V}_{edj}&{} \text{ if } \, n_d=0. \end{array}\right. \end{aligned}$$
(c)
The Monte Carlo approximation of the expected value is
$$\begin{aligned} E_{y_r}\big [h_k^{-1}(y_{dj})|y_s;{\hat{\psi }}\big ]\approx \frac{1}{L}\sum _{\ell =1}^L h_k^{-1}(y_{dj}^{(\ell )}),\,\,\, j\in r_{d},\,\,\,d=1,\ldots ,D. \end{aligned}$$
The Monte Carlo approximation of the EBP of $A_{dk}$ is
$$\begin{aligned} \hat{A}_{dk}^{eb}\approx \frac{1}{L}\sum _{\ell =1}^LA_{dk}^{(\ell )},\,\,\, A_{dk}^{(\ell )}=\frac{1}{N_d}\bigg (\sum _{j\in s_{d}}h_k^{-1}(y_{dj})+\sum _{j\in r_{d}} h_k^{-1}(y_{dj}^{(\ell )})\bigg ),\,\,\,k=1,\ldots ,m. \end{aligned}$$

The plug-in estimator of $A_{dk}$ is

$$\begin{aligned} \hat{A}_{dk}^{in}=\frac{1}{N_d}\sum _{j\in U_d}h_k^{-1}(\hat{y}_{dj}^{eb}) =\frac{1}{N_d}\bigg \{\sum _{j\in s_{d}}h_k^{-1}(y_{dj})+\sum _{j\in r_{d}}h_k^{-1}(\hat{\mu }_{dj|s})\bigg \},\,\,\,k=1,\ldots ,m, \end{aligned}$$

and $\hat{A}_{dm+1}^{in}=1-\hat{A}_{d1}^{in}-\ldots -\hat{A}_{dm}^{in}$.

Remark 3.1

In many practical cases, the values of the auxiliary variables are not available for all the population units. If in addition some of the variables are continuous, the EBP method is not applicable. An important particular case, where this method is applicable, is when the number of values of the vector of auxiliary variables is finite. More concretely, suppose that the covariates are categorical such that $X_{dj}\in \{X_{01},\ldots ,X_{0T}\}$, then we can calculate $A_{dk}^{(\ell )}$ as

$$\begin{aligned} A_{dk}^{(\ell )}=\frac{1}{N_d}\left[ \sum _{j=1}^{n_{d}}h_k^{-1}(y_{dj}) + \sum _{t=1}^{T}\sum _{j=1}^{N_{dt}-n_{dt}} h_k^{-1}(y_{dj}^{(\ell )})\right] , \end{aligned}$$

where $N_{dt}=\#\{j\in U_{d}:\,X_{dj}=X_{0t}\}$ is available from external data sources (aggregated auxiliary information), $n_{dt}=\#\{j\in s_{d}:\,X_{dj}=X_{0t}\}$, $y_{dtj}^{(\ell )}\sim N_2({\hat{\mu }}_{dt|s},\hat{V}_{d|s})$, $d=1,\ldots ,D$, $j=1,\ldots ,N_{dt}-n_{dt}$, $t=1,\ldots ,T$, $\ell =1,\ldots ,L$, and

$$\begin{aligned} {\hat{\mu }}_{dj|s}=\left\{ \begin{array}{ll} X_{0t}{\hat{\beta }}+\hat{V}_{ud}Z_{ds}^\prime \Big \{\hat{V}_{eds}^{-1}-\hat{V}_{eds}^{-1}Z_{ds}\Big (\hat{V}_{ud}^{-1}+n_d\hat{V}_{edj}^{-1}\Big )^{-1}Z_{ds}^\prime \hat{V}_{eds}^{-1}\Big \}(y_{ds}-X_{ds}{\hat{\beta }})&{} \text{ if } \, n_d\ne 0, \\ X_{0t}{\hat{\beta }}&{} \text{ if } \, n_d=0, \end{array}\right. \end{aligned}$$

and $\hat{V}_{d|s}$ was defined in Step (b) of the above Monte Carlo procedure.

Remark 3.2

If some auxiliary variables are continuous, we can use the Hájek-type approximation to $A_{dk}^{(\ell )}$, i.e.

$$\begin{aligned} A_{dk}^{(\ell )}\approx \frac{1}{N_d}\sum _{j\in s_{d}}w_{dj}h_k^{-1}(y_{dj}^{(\ell )}). \end{aligned}$$

where $w_{dj}$ is the sample weight of unit j of domain d. A GREG-type approximation to $A_{dk}^{(\ell )}$ is

$$\begin{aligned} A_{dk}^{(\ell )}\approx \frac{1}{N_d}\bigg (\sum _{j\in s_{d}}\big \{h_k^{-1}(y_{dj})-h_k^{-1}(y_{dj}^{(\ell )})\big \} +\sum _{j\in s_{d}} \tilde{w}_{dj}h_k^{-1}(y_{dj}^{(\ell )})\bigg ), \end{aligned}$$

where $ \tilde{w}_{dj}=w_{dj}N_d/\hat{N}_d$, $\hat{N}_d=\sum _{j\in s_d}w_{dj}$.

Analytical approximations to the MSE are difficult to derive in the case of complex parameters. We therefore propose a parametric bootstrap MSE estimator by following the bootstrap method for finite populations of González-Manteiga et al. (2008b). The steps for implementing this method are

1.
Fit the model (2.5) to sample data $(y_s,X_s)$ and calculate an estimator ${\hat{\psi }}=({\hat{\beta }}^{\prime },{\hat{\theta }}^\prime )^{\prime }$ of $\psi =(\beta ^{\prime },\theta ^\prime )^{\prime }$.
2.
For $d=1,\ldots ,D$, $j=1,\ldots ,N_{d}$, generate independently $u_{d}^{*}\sim N(0,\hat{V}_{ud})$ and $e_{dj}^{*}\sim N(0,\hat{V}_{edj})$, where $\hat{V}_{ud}=V_{ud}(\hat{\theta })$ and $\hat{V}_{edj}=V_{edj}(\hat{\theta })$.
3.
Construct the bootstrap superpopulation model $\xi ^*$ using $\{u_{d}^{*}\}$, $\{e_{dj}^{*}\}$, $\{X_{dj}\}$ and $\hat{\beta }$, i.e
$$\begin{aligned} \xi ^{*}:\, y_{dj}^{*}=X_{dj}\hat{\beta }+u_{d}^{*}+e_{dj}^{*},\,\,d=1,\ldots ,D, j=1,\ldots ,N_{d}. \end{aligned}$$
(3.1)
4.
Under the bootstrap superpopulation model (3.1), generate a large number B of i.i.d. bootstrap populations $\{y_{dj}^{*(b)}:\,d=1,\ldots ,D, j=1,\ldots ,N_{d}\}$ and calculate the bootstrap population parameters
$$\begin{aligned} A_{dk}^{*(b)}=\frac{1}{N_d}\sum _{j=1}^{N_{d}}h_k(y_{dj}^{*(b)}),\quad k=1,\ldots ,m,\,\,\, b=1,\ldots ,B. \end{aligned}$$
5.
From each bootstrap population b generated in Step 4, take the sample with the same indices $s\subset U$ as the initial sample, and calculate the bootstrap EBPs, $\hat{A}_{dk}^{eb*(b)}$, $k=1,\ldots ,m$, as described in Sect. 3, using the bootstrap sample vector $y_s^*$ and the known values $X_{dj}$.
6.
A Monte Carlo approximation to the theoretical bootstrap estimator
$$\begin{aligned} MSE_*(\hat{A}_{dk}^{eb*})=E_{\xi ^*}\big [(\hat{A}_{dk}^{eb*}-A_{dk}^{*})(\hat{A}_{dk}^{eb*}-A_{dk}^{*})^\prime \big ],\quad k=1,\ldots ,m, \end{aligned}$$
is
$$\begin{aligned} mse_*(\hat{A}_{dk}^{eb*})=\frac{1}{B}\sum _{b=1}^B(\hat{A}_{dk}^{eb*(b)}-A_{dk}^{*(b)})(\hat{A}_{dk}^{eb*(b)}-A_{dk}^{*(b)})^\prime ,\quad k=1,\ldots ,m.\nonumber \\ \end{aligned}$$
(3.2)
The estimator (3.2) is used to estimate $MSE(\hat{A}_{dk}^{eb})$, $k=1,\ldots ,m$.

4 Simulations

The simulation experiments empirically investigate the asymptotic behavior of: (1) the REML estimators of model parameters in Sect. 4.1 and Appendix B.1, (2) the EBP and plug-in predictors of domain average compositions in Sect. 4.2 and Appendix B.2, and (3) the parametric bootstrap MSE estimators in Sect. 4.3 and Appendix B.3.

To meet these three objectives, we consider a basic scenario in which we run simulations for different sample sizes. Take $m=2$, $p_1=p_2=2$, $p=4$, $\beta _1=(\beta _{11},\beta _{12})^\prime =(10,10)^\prime $, $\beta _2=(\beta _{21},\beta _{22})^\prime =(10,10)^\prime $, For $k=1,2$, $d=1,\ldots ,D$, $j=1,\ldots ,n_d$, generate $X_{dj}=\text{ diag }(x_{dj1},x_{dj2})_{2\times 4}$, where $x_{dj1}=(x_{dj11},x_{dj12})$, $x_{dj2}=(x_{dj21},x_{dj22})$ and

$$\begin{aligned} x_{dj11}=x_{dj21}=1,\quad x_{dj12}\sim \text{ Bin }(1,1/2),\quad x_{dj22}\sim \text{ Bin }(1,1/2),\quad \end{aligned}$$

For $d=1,\ldots ,D$, simulate ${u}_{d}\sim N_{2}(0,{V}_{ud})$ and ${e}_{dj}\sim N_{2}(0,{V}_{edj})$, where

$$\begin{aligned} V_{ud}=\left( \begin{array}{cc} \theta _1&{}\theta _3\sqrt{\theta _1}\sqrt{\theta _2}\\ \theta _3\sqrt{\theta _1}\sqrt{\theta _2}&{}\theta _2\\ \end{array}\right) ,\,\,\, V_{ed}=\left( \begin{array}{cc} \theta _4&{}\theta _{6}\sqrt{\theta _4}\sqrt{\theta _5}\\ \theta _{6}\sqrt{\theta _4}\sqrt{\theta _5}&{}\theta _5\\ \end{array}\right) . \end{aligned}$$

where $\theta _1=0.75$, $\theta _2=0.75$, $\theta _4=0.5$, $\theta _5=0.5$ and $\theta _3=-0.4$, $\theta _6=0.4$. Simulation 1 generates only 4 different matrices $X_{dj}$. They are

$$\begin{aligned} X_{dj}=\left( \begin{array}{cc|cc} x_{dj11}&{}x_{dj12}&{}0&{}0\\ \hline 0&{}0&{}x_{dj21}&{}x_{dj22}\end{array}\right) \in \big \{ X_{01},X_{02},X_{03},X_{04}\big \}, \end{aligned}$$

where

$$\begin{aligned} X_{01}=\left( \begin{array}{cc|cc} 1&{}0&{}0&{}0\\ \hline 0&{}0&{}1&{}0\end{array}\right) , X_{02}=\left( \begin{array}{cc|cc} 1&{}0&{}0&{}0\\ \hline 0&{}0&{}1&{}1\end{array}\right) , X_{03}=\left( \begin{array}{cc|cc} 1&{}1&{}0&{}0\\ \hline 0&{}0&{}1&{}0\end{array}\right) , X_{04}=\left( \begin{array}{cc|cc} 1&{}1&{}0&{}0\\ \hline 0&{}0&{}1&{}1\end{array}\right) . \end{aligned}$$

4.1 Simulation 1 for REML estimators

The target of Simulation 1 is to check the behavior of the REML algorithm for fitting the MNER model (2.5). This simulation runs $I=200$ iterations. Appendix B.1 gives the steps of Simulation 1 and the definitions of the absolute and relative performance measures. For every REML estimator $\hat{\eta }\in \{\hat{\beta }_{11},\hat{\beta }_{12},\hat{\beta }_{21},\hat{\beta }_{22},\hat{\theta }_{1},\ldots ,\hat{\theta }_{6}\}$, Tables 1 and 2 present the relative bias $RB(\hat{\eta })$ and the relative root-mean-squared error $RRE(\hat{\eta })$ in %. Appendix B.1 gives the corresponding absolute performance measures. Simulation 1 shows that the REML Fisher-scoring algorithm works properly because $RB({\hat{\eta }})$ and $RRE(\hat{\eta })$ decrease as $n_d$ or D increase.

Table 1 $RB(\hat{\eta })$ (left) and $RRE(\hat{\eta })$ (right) with $n_d=10$

Full size table

Table 2 $RB(\hat{\eta })$ (left) and $RRE(\hat{\eta })$ (right) with $D=50$

Full size table

4.2 Simulation 2 for EBPs

Simulation 2 investigates the EBP and plug-in predictors, $\hat{A}_{dk}^{eb}$ and $\hat{A}_{dk}^{in}$, respectively, $k=1,2,3$. It takes $I=200$ iterations and generates $L=200$ random vectors for the Monte Carlo approximations of integrals. The population sizes are $N_d=200$ and $D=50$. Let h be the clr, alr or ilr transformation. Appendix B.2 gives the steps of Simulation 2 and the definitions of the absolute and relative performance measures. Tables 3, 4 and 5 present the relative absolute bias $RAB_k$ and the relative root-mean-squared error $RRE_k$ in %, $k=1,2,3$, for the clr, alr and ilr transformations, respectively. Appendix B.2 gives the corresponding absolute performance measures.

Table 3 $RAB_k$ (left) and $RRE_k$ (right) for clr with $D=50$

Full size table

Table 4 $RAB_k$ (left) and $RRE_k$ (right) for alr with $D=50$

Full size table

Table 5 $RAB_k$ (left) and $RRE_k$ (right) for ilr with $D=50$

Full size table

The performances measures decrease as the sample sizes, $n_d$’s, increase and the EBP gets better results (RAB and RRE) than the plug-in predictor. Note that for each transformation, the data generation, and therefore the true underlying model, is different. For this reason, the results in Tables 3, 4 and 5 are not comparable. It is curious to observe that if the data are generated by the MNER model derived from the alr transformation and its corresponding EBP is used, the results are slightly better than in the clr and ilr cases.

4.3 Simulation 3 for MSEs

Simulation 3 investigates the MSE estimators of predictors $\hat{A}_{dk}^{eb}$ and $\hat{A}_{dk}^{in}$, $k=1,2,3$. One of the goals is to give a recommendation on the number of bootstrap replicates B to implement. The simulation takes $I=200$ iterations and generates $L=200$ random vectors for the Monte Carlo approximations of integrals. The population sizes are $N_d=200$ and $D=50$. Let h be the clr, alr or ilr transformation. Appendix B.3 gives the steps of Simulation 3 and the definitions of the absolute and relative performance measures.

Table 6 $RAB_k$ (left) and $RRE_k$ (right) for clr with $D=50$ and $n_d=10$

Full size table

Table 7 $RAB_k$ (left) and $RRE_k$ (right) for alr with $D=50$ and $n_d=10$

Full size table

Table 8 $RAB_k$ (left) and $RRE_k$ (right) for ilr with $D=50$ and $n_d=10$

Full size table

Tables 6, 7 and 8 present the relative absolute bias $RAB_k$ and the relative root-mean-squared error $RRE_k$ in %, $k=1,2,3$, for the clr, alr and ilr transformations, respectively. The number of bootstrap replicates is $B=50, 100, 200, 300, 400$. Appendix B.3 gives the corresponding absolute performance measures. As in Simulation 2, we remark that the results in Tables 6, 7 and 8 are not comparable because the data generation is different. Nevertheless, we observe that if the data are generated by the MNER model derived from the alr transformation and its corresponding EBP is used, Simulation 3 gives slightly better results than in the clr or ilr cases. That is, the functional form of the transformation plays a non-negligible role. In any case, the selection of the transformation in an application to real data must be made based on the diagnosis of the corresponding MNER model that we select.

Figures 1 and 2 show the boxplots of $RRE_{dk}$ and $RAB_{dk}$ for the predictors $\hat{A}^{eb}_{dk}$, $k=1,2,3$, with the clr transformation. From the obtained performance measures, we recommend to implement the bootstrap algorithm with at least $B=300$ iterations. Appendix B.3 give the same recommendation for the alr and ilr transformations.

5 The Spanish Household Budget Survey (SHBS)

The SHBS is annually carried out by the “Instituto Nacional de Estadística” (INE), with the objective of obtaining information on the nature and destination of the consumption expenses, as well as on various characteristics related to the conditions of household life. In the Spanish economy, it is important to have good estimates of consume spending, since this spending represents, approximately, $60\%$ of gross domestic product. However, global political measures are not often satisfactory for regional authorities, which can also develop their own economic strategies. They need some tools to determine, with precision and reliability, the main variables and consume indicators in order to implement their strategies. Among the main consume indicators are the proportions of food and housing annual expenses of households. This section presents an application of the new statistical methodology to the estimation of domain parameters defined as average of proportions of annual household expenditures. We take data from the SHBS of 2016. The domains are the 50 Spanish provinces plus the autonomous cities Ceuta and Melilla, so that $D=52$.

Let $a_{dj1}$, $a_{dj2}$ and $a_{dj3}$ be the proportions of annual expenditures on food, housing and other for household j of domain d. Housing includes expenditure on current housing costs, water, electricity, gas and other fuels. Food includes both food and nonalcoholic beverages and other represent the remaining expenditures. The vectors $a_{dj}=(a_{dj1},a_{dj2})^\prime \in R^2$ are 2-part compositions that can be transformed into vectors $y_{dj}=h(a_{dj})$ of $R^2$ by one of the transformations h described in Appendix A. Let $x_{djk}$, $d=1,\ldots ,D$, $j=1,\ldots ,n_d$, $k=1,2$, be the $4\times 1$ vector whose components are the binary auxiliary variables that indicate the composition of the household to which household j belongs in domain d. As auxiliary variables, we thus consider the household composition HC with categories

HC1::: Single person or adult couple with at least one members with age over 65,
HC2::: Other compositions with a single person or a couple without children,
HC3::: Couple with children under 16 years old or adult with children under 16 years old,
HC4::: Other households.

The variable HC is treated as a factor with reference category HC4.

For calculating the EBPs of the domain parameters of interest, we need the true population sizes, $N_{dt}$, of the crossings of provinces with the categories of the variable HC. We calculate these sizes by using the sampling weights of the Spanish Labor Force Survey (SLFS). The SLFS sampling weights are calibrated to the population sizes of the provinces crossed with sex and age groups. These demographic quantities come from the INE population projection system and they are considered the most accurate demographic figures in Spain. On the other hand, the SHBS sampling weights are calibrated to the population sizes of the autonomous community (NUTS 2) crossed with sex and age groups, which are not the domains of interest.

This section presents an statistical analysis by applying the centered logratio transformation. This choice is due to the good fit of the MNER model to the transformed data. For the sake of completeness, Appendix C presents the corresponding data analysis for the alr and ilr transformations. Table 9 presents the estimates of the regression parameters, the z-values, the standard errors and the asymptotic p-values. The factor HC is significative for $y_1$ and $y_2$. Table 10 presents the asymptotic 95% confidence intervals (L.CI, U.CI) for the variance component parameters. None of them contains the zero.

Table 9 Regression parameters

Full size table

Table 10 Variance and correlation parameters

Full size table

For calculating the asymptotic p-values and confidence intervals of Tables 9 and 10, we take the asymptotic distributions of the REML estimators ${\hat{\theta }}$ and ${\hat{\beta }}$, i.e.

$$\begin{aligned} \hat{\theta }\sim N_{6}(\theta , {F}_s^{-1}(\theta )),\quad \hat{\beta }\sim N_p(\beta , ({X}_s^{\prime }{V}_s^{-1}{X}_s)^{-1}), \end{aligned}$$

where $F_s$ is the REML Fisher information matrix. For $\hat{\beta }_i=\beta _0$, the asymptotic p-value for testing the hypothesis $H_0:\,\beta _i=0$ is

$$\begin{aligned} \text{ p-value }=2P_{H_0}(\hat{\beta }_i>|\beta _0|)=2P(N(0,1)> |\beta _0|/\sqrt{q_{ii}}\,). \end{aligned}$$

where $({X}^{\prime }{V}^{-1}({\hat{\theta }}){X})^{-1}=(q_{ij})_{i,j=1,\ldots ,p}$ and $\beta _i$ denotes the i-th component of the vector $\beta $. The asymptotic $(1-\alpha )$-level confidence intervals for the components $\theta _{\ell }$ of $\theta $ are

$$\begin{aligned} \hat{\theta }_{\ell }\pm z_{\alpha /2}\,\nu _{\ell \ell }^{1/2},\,\, \ell =1,\ldots ,6,\, \end{aligned}$$

where ${F}^{-1}(\hat{\theta })=(\nu _{ab})_{a,b=1,\ldots ,6}$ and $z_{\alpha }$ is the $\alpha $-quantile of the N(0, 1) distribution.

Figure 3 plots the histograms of the $D=52$ standardized EBPs of the random effects of the fitted MNER model for food (left) and housing (right) expenditures. It also prints the corresponding probability density function estimates. The shapes of the densities are quite symmetrical, which indicates that the distributions of the random effects are not very far from the normal distributions. Since D is too small to obtain a good nonparametric estimate of the density functions, the definitive conclusions can not be drawn.

Figure 4 gives the histograms of standardized residuals for components $y_1$ and $y_2$. It also prints the corresponding probability density function estimates. We do not appreciate a large deviation from the normal distribution.

Figure 5 presents the dispersion plots of standardized residuals versus predicted values (in $10^4$ euros). Most standardized residuals fall within the interval $(-3,3)$, so we consider that outliers do not play a relevant role in the performance of the EBPs. Appendix C of the supplementary material gives the corresponding plots for the additive and isometric logratio transformations. The corresponding plots are similar to the ones shown in Figs. 4 and 5 for the centered logratio transformation. However, Fig. 5 presents more uniform clouds of points in both components than the corresponding figures for the two other transformations. From this graphical diagnosis, we finally prefer doing the data analysis with the centered logratio transformation. However, since the choice of the clr transformation can be debatable, Appendix C presents the full analysis of the data under the two other transformations.

Figure 6 plots the plug-in and the EBP predictions of ${a_{d1}}$ and ${a}_{d2}$. The domains are sorted by sample sizes and the sample size is printed in the axis OX. This figure shows that both estimators follow a similar pattern. This information is completed by Fig 7, which shows the relative root-MSEs (RRMSE).

Figure 8 (left) maps the proportions of the household annual expenditures in food by Spanish provinces. Figure 8 (right) maps the estimated RRMSE in %. These figures show that expenditures on food are rather variable between provinces. This happens mostly in the autonomous regions of Andalucía, Aragón or Castilla León, where there are many provinces and some of them are more deprived than others. In contrast, there are other regions, such as Basque Country where the variability of the estimated ratios is smaller. This information could be of great use to local governments in develo** economic plans aimed at households and improving the quality of life.

Figure 9 (left) maps the proportions of the household annual expenditures on housing by Spanish provinces. Figure 9 (right) maps the estimated RRMSE in $\%$. As is the case with food expenditure, these figures show that expenditures on housing is rather variable between provinces. This map shows clear differences between the north-central regions, where the proportion of spending is higher, and the southern regions, where household expenditures are lower.

Tables 11 and 12 present some condensed numerical results. The tables are constructed in two steps: First, the domains are sorted by sample size, starting by the domain with the smallest sample size. Finally, a selection of 14 domains out of 52 is done from the positions $1, 5, 9,\ldots , 52$. The name and code of provinces are labeled by province and d, respectively, and the sample sizes by $n_d$. Table 11 presents the model-based predictions of food and housing expenditures by provinces and Table 12 displays the corresponding estimates of RRMSEs. The plug-in predictors are denoted by in1 and in2 and the EBPs by ebp1 and ebp2.

Table 11 Predictions of ${a}_{d1}$ and ${a}_{d2}$ in %

Full size table

Table 12 RRMSE estimates for ${a}_{d1}$ and ${a}_{d2}$ in %

Full size table

6 Conclusions

Compositional data play an important role in public statistics. The proposed methodology is applied to estimate the proportions of annual household expenditures on food, housing and others from the 2016 SHBS at the province level. This paper introduces small area predictors of averages of unit-level vectors of compositions. For this purpose, the manuscript considers the centered logratio transformations of compositions into vectors of $R^m$. For the sake of completeness, Appendix C of the supplementary material presents the corresponding statistical analysis under the additive and isometric logratio transformations. A MNER model is proposed for analyzing the transformed compositional data, where the vectors of random effects and the vector of model errors have unstructured covariance matrices with unknown components. As usual in linear mixed models, the parameter estimates of the MNER model are obtained using the REML method. The selection of the centered logratio transformation was motivated by the interpretability and diagnosis of the selected MNER model. In this sense, we followed the recommendations of Greenacre (2019). This is to say, we have tried to provide a simple solution to a practical problem of compositional data.

Of the two proposed predictors, EBP and plug-in, EBP presents a slightly better performance, as can be seen in the simulation study. For the calculation of the MSE, we recommend a parametric bootstrap, following the ideas of González-Manteiga et al. (2008a) and for a number of repetitions greater than $B=300$.

As a result of the statistical analysis for Spanish provinces, we conclude that food expenditure in Spain accounts for $14.6\%$ of total household expenditure and presents great variability within autonomous communities. This happens mostly in the Autonomous Regions of Andalucía, Aragón or Castilla León, where there are many provinces and some of them are more deprived than others. In contrast, there are other regions, such as Basque Country where the variability of the estimated proportions is smaller. On the other hand, spending on housing in Spain accounts for $31\%$ of total household spending and there are important differences between the north-central provinces (with higher incomes) and those in the south.

In this case, we applied the introduced methodology to the SHBS, but it is useful in other topics of the official statistics, like the classification of the population by the educational level and according to economic activity. In both situations, it is necessary to take into account the simplex constraints.

We finally remind that there are other regression models for compositions, such as directional mixed effects models or Dirichlet regression mixed models. These models are likely to be adapted to the SAE context described in Sect. 2, including fitting algorithms, predictors of domain quantities, MSE estimators, and so on. They can be competitive options with respect to fitting a multivariate normal mixed model to logratio transformations of compositions. We believe that these tasks are interesting subjects for future research.

References

Aitchison J (1986) The statistical analysis of compositional data. Chapman and Hall, New York
Book MATH Google Scholar
Arima S, Bell WR, Datta GS, Franco C, Liseo B (2017) Multivariate Fay-Herriot Bayesian estimation of small area means under functional measurement error. J R Stat Soc Ser A 180(4):1191–1209
Article MathSciNet Google Scholar
Benavent R, Morales D (2016) Multivariate Fay-Herriot models for small area estimation. Comput Stat Data Anal 94:372–390
Article MathSciNet MATH Google Scholar
Benavent R, Morales D (2021) Small area estimation under a temporal bivariate area-level linear mixed model with independent time effects. Stat Methods Appl 30(1):195–222
Article MathSciNet MATH Google Scholar
Berg EJ, Fuller WA (2014) Small area prediction of proportions with applications to the Canadian Labour Force Survey. J Survey Stat Methodol 2:227–256
Article Google Scholar
Boubeta M, Lombardía MJ, Morales D (2016) Empirical best prediction under area-level Poisson mixed models. TEST 25:548–569
Article MathSciNet MATH Google Scholar
Boubeta M, Lombardía MJ, Morales D (2017) Poisson mixed models for studying the poverty in small areas. Comput Stat Data Anal 107:32–47
Article MathSciNet MATH Google Scholar
Burgard JP, Krause J, Münnich R, Morales D (2021) l2-Penalized temporal logit-mixed models for the estimation of regional obesity prevalence over time. Stat Methods Med Res 30(7):1744–1768
Article MathSciNet Google Scholar
Camargo AP, Stern JM, Lauretto MS (2012) Estimation and model selection in Dirichlet regression. AIP Conf Proc 1443:206–213. https://doi.org/10.1063/1.3703637
Article Google Scholar
Chambers R, Dreassi E, Salvati N (2014) Disease map** via negative binomial regression M-quantiles. Stat Med 33:4805–4824
Article MathSciNet Google Scholar
Chambers R, Salvati N, Tzavidis N (2016) Semiparametric small area estimation for binary outcomes with application to unemployment estimation for Local Authorities in the UK. J R Stat Soc Ser A 179:453–479
Article MathSciNet Google Scholar
Dreassi E, Ranalli MG, Salvati N (2014) Semiparametric M-quantile regression for count data. Stat Methods Med Res 23:591–610
Article MathSciNet Google Scholar
Egozcue JJ, Pawlowsky-Glahn V (2019) Compositional data: the sample space and its structure. TEST 28(3):599–638
Article MathSciNet MATH Google Scholar
Egozcue JJ, Pawlowsky-Glahn V, Mateu-Figueras G, Barceló-Vidal C (2003) Isometric logratio transformations for compositional data analysis. Math Geol 35(3):279–300
Article MathSciNet MATH Google Scholar
Esteban MD, Morales D, Pérez A, Santamaría L (2012) Small area estimation of poverty proportions under area-level time models. Comput Stat Data Anal 56:2840–2855
Article MathSciNet MATH Google Scholar
Esteban MD, Lombardía MJ, López-Vizcaíno E, Morales D, Pérez A (2020) Small area estimation of proportions under area-level compositional mixed models. TEST 29(3):793–818
Article MathSciNet MATH Google Scholar
Esteban MD, Lombardía MJ, López-Vizcaíno E, Morales D, Pérez A (2022a) Small area estimation of expenditure means and ratios under a unit-level bivariate linear mixed model. J Appl Stat 49(1):143–168
Esteban MD, Lombardía MJ, López-Vizcaíno E, Morales D, Pérez A (2022b) Empirical best prediction of small area bivariate parameters. Scand J Stat 49:1699–1727
Fabrizi E, Ferrante MR, Trivisano C (2016) Hierarchical Beta regression models for the estimation of poverty and inequality parameters in small areas. In: Pratesi Monica (ed) Analysis of poverty data by small area methods. Wiley, New York
MATH Google Scholar
Ferrante MR, Trivisano C (2010) Small area estimation of the number of firms’ recruits by using multivariate models for count data. Surv Methodol 36(2):171–180
Google Scholar
González-Manteiga W, Lombardía MJ, Molina I, Morales D, Santamaría L (2007) Estimation of the mean squared error of predictors of small area linear parameters under a logistic mixed model. Comput Stat Data Anal 51:2720–33
Article MathSciNet MATH Google Scholar
González-Manteiga W, Lombardía MJ, Molina I, Morales D, Santamaría L (2008a) Analytic and bootstrap approximations of prediction errors under a multivariate Fay-Herriot model. Comput Stat Data Anal 52:5242–5252
González-Manteiga W, Lombardía MJ, Molina I, Morales D, Santamaría L (2008b) Bootstrap mean squared error of small-area EBLUP. J Stat Comput Simul 78:443–462
Greenacre M (2019) Comments on: Compositional data: the sample space and its structure. TEST 28:644–652
Article MathSciNet MATH Google Scholar
Hobza T, Morales D (2016) Empirical best prediction under unit-level logit mixed models. J Off Stat 32(3):661–669
Article Google Scholar
Hobza T, Santamaría L, Morales D (2018) Small area estimation of poverty proportions under unit-level temporal binomial-logit mixed models. TEST 27(2):270–294
Article MathSciNet MATH Google Scholar
Hijazi RH, Jernigan RW (2009) Modeling compositional data using Dirichlet regression models. J Appl Probab Stat 4(1):77–91
MathSciNet MATH Google Scholar
Krause J, Burgard JP, Morales D (2022) Robust prediction of domain compositions from uncertain data using isometric logratio transformations in a penalized multivariate Fay-Herriot model. Stat Neerl 76(1):65–96
Article MathSciNet Google Scholar
López-Vizcaíno E, Lombardía MJ, Morales D (2013) Multinomial-based small area estimation of labour force indicators. Stat Model 13(2):153–178
Article MathSciNet MATH Google Scholar
López-Vizcaíno E, Lombardía MJ, Morales D (2015) Small area estimation of labour force indicators under a multinomial model with correlated time and area effects. J R Stat Soc Ser A 178(3):535–565
Article MathSciNet Google Scholar
Marhuenda Y, Molina I, Morales D (2013) Small area estimation with spatio-temporal Fay-Herriot models. Comput Stat Data Anal 58:308–325
Article MathSciNet MATH Google Scholar
Marhuenda Y, Morales D, Pardo MC (2014) Information criteria for Fay-Herriot model selection. Comput Stat Data Anal 70:268–280
Article MathSciNet MATH Google Scholar
Molina I, Saei A, Lombardía MJ (2007) Small area estimates of labour force participation under multinomial logit mixed model. J R Stat Soc Ser A 170:975–1000
Article MathSciNet Google Scholar
Morais J, Thomas-Agnan C, Simioni M (2018) Using compositional and Dirichlet models for market share regression. J Appl Stat 45(9):1670–1689
Article MathSciNet MATH Google Scholar
Morales D, Pagliarella MC, Salvatore R (2015) Small area estimation of poverty indicators under partitioned area-level time models. SORT Stat Oper Res Trans 39(1):19–34
MathSciNet MATH Google Scholar
Morales D, Esteban MD, Pérez A, Hobza T (2021) A course on small area estimation and mixed models. Springer, Berlin
Book MATH Google Scholar
Pawlowsky-Glahn V, Buccianti A (eds) (2011) Compositional data analysis. Wiley, Chichester
MATH Google Scholar
Rao JNK, Molina I (2015) Small area estimation, 2nd edn. Wiley, Hoboken
Book MATH Google Scholar
Saei A, Chambers R (2003) Small area estimation under linear an generalized linear mixed models with time and area effects. S3RI Methodology Working Paper M03/15, Southampton Statistical Sciences Research Institute
Scealy JL, Welsh AH (2017) A directional mixed effects model for compositional expenditure data. J Am Stat Assoc 112(517):24–36
Article MathSciNet Google Scholar
Souza DB, Moura FAS (2016) Multivariate Beta regression with applications in small area estimation. J Off Stat 32:747–768
Article Google Scholar
Tsagris M, Stewart C (2018) A Dirichlet regression model for compositional data with zeros. Lobachevskii J Math 39(3):398–412
Article MathSciNet MATH Google Scholar
Tzavidis N, Ranalli MG, Salvati N, Dreassi E, Chambers R (2015) Robust small area prediction for counts. Stat Methods Med Res 24(3):373–395
Article MathSciNet Google Scholar
Zhang L, Chambers R (2004) Small area estimates for cross-classifications. J Roy Stat Soc B 66(2):479–496
Article MathSciNet MATH Google Scholar

Download references

Funding

Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.

Author information

Authors and Affiliations

Universidad Miguel Hernández de Elche, Elche, Spain
María Dolores Esteban, Domingo Morales & Agustín Pérez
Universidade da Coruña, CITIC, A Coruña, Spain
María José Lombardía
Instituto Galego de Estatística, Santiago de Compostela, Spain
Esther López-Vizcaíno

Authors

María Dolores Esteban
View author publications
You can also search for this author in PubMed Google Scholar
María José Lombardía
View author publications
You can also search for this author in PubMed Google Scholar
Esther López-Vizcaíno
View author publications
You can also search for this author in PubMed Google Scholar
Domingo Morales
View author publications
You can also search for this author in PubMed Google Scholar
Agustín Pérez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to María José Lombardía.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supported by the Instituto Galego de Estatística, by the Grants PGC2018-096840-B-I00 and PID2020-113578RB-I00 of the Spanish Ministerio de Economía y Competitividad, by the Grant Prometeo/2021/063 of the Generalitat Valenciana, and by the Xunta de Galicia (Grupos de Referencia Competitiva ED431C 2020/14), and by GAIN (Galician Innovation Agency) and the Regional Ministry of Economy, Employment and Industry Grant COV20/00604 and Centro de Investigación del Sistema Universitario de Galicia ED431G 2019/01, all of them through the ERDF.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 4494 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Esteban, M.D., Lombardía, M.J., López-Vizcaíno, E. et al. Small area estimation of average compositions under multivariate nested error regression models. TEST 32, 651–676 (2023). https://doi.org/10.1007/s11749-023-00847-0

Download citation

Received: 13 March 2022
Accepted: 22 January 2023
Published: 15 February 2023
Issue Date: June 2023
DOI: https://doi.org/10.1007/s11749-023-00847-0

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Small area estimation of average compositions under multivariate nested error regression models

Abstract

Similar content being viewed by others

Partial linear regression of compositional data

A Dirichlet Regression Model for Compositional Data with Zeros

A generalized mixed model for skewed distributions applied to small area estimation

1 Introduction

2 The probabilistic framework