1 Introduction

Estimating income distribution in an accurate way is very important for the measurement of inequality and poverty, and more generally comparing welfare across space and time. An overview of the literature on modeling income distributions, various estimation methods and distribution specifications is available in the book by Kleiber and Kotz (2003), and the papers in Chotikapanich (2008).

When individual income data are available the estimation of income would be quite straightforward. However, very often the available income data is scarce, especially for many develo** countries, which encumbers deriving representative income distribution models and inequality statistics. Frequently the income data are only available in grouped form, for example income deciles or income shares, mean incomes and Gini coefficients. The World Income Inequality Database (WIID), the World Inequality Database (WID) and the World Bank are among the largest databases providing grouped income data. However, when looking into smaller areas, the data provided can be only in the form of income quantiles due to privacy of personal data and the proximity of the considered areas as, for example, household income data at local levels provided by the French National Institute of Statistics and Economic Studies (INSEE). This paper focuses on estimating income distribution using only quantile income data and aims at determining a method suitable for such data.

In terms of modeling grouped income data, various approaches have been used depending on the data available. Two main strategies have been developed, either nonparametric techniques like for instance employing a nonparametric kernel density function (Sala-i-Martin, 2006), or parametric techniques assuming that the income distribution follows a parametric model. Parametric models are shown to perform very well when estimating income distributions and inequality measures (Chotikapanich et al., 2007) and even outperform the nonparametric techniques (Minoiu & Reddy, 2014; Jordá et al., 2021).

For the parametric modeling, it is crucial to choose a reliable estimation technique and a suitable parametric distribution model. Besides, the estimation techniques have to be adjusted to the grouped data types, usually grouped data with fixed bounds and random cell size or grouped data with fixed cell size and random bounds. Among the most common estimation techniques is the maximum likelihood based on sample proportions using a multinomial likelihood function [see, for example, (McDonald, 1984; Jöhnk & Niermann, 2002; Bandourian et al., 2003; Chotikapanich et al., 2018)]. Eckernkemper and Gribisch (2021) propose a general framework for ML and Bayesian estimation based on grouped data information accounting for known and unknown group boundaries. Another widely used technique is the method-of-moments approach where population and income shares are matched to their theoretical counterparts. Chotikapanich et al. (2007, 2012) apply it for the beta-2 distribution using population shares and class mean income data. Hajargasht et al. (2012) extended the work of Chotikapanich et al. (2007, 2012) to a generalized method-of-moments (GMM) approach and provided inference for the estimated distributions. Further, minimizing the distance between a set of income indicators and their parametric representations is suggested by Graf and Nedyalkova (2014) and Hajargasht and Griffiths (2020) suggest minimum distance estimation of parametric Lorenz curves based on grouped data information.

In this work, we suggest the minumum quantile distance (MQD) method which is designed especially for quantile data (grouped data with fixed bounds) which as mentioned above could be the only grouped data available (for example, data from INSEE). Assuming that the income distribution of a country can be modeled with a specific parametric distribution, in this work we estimate the income distribution of each observed country by minimizing the distance between the empirical estimates of the respective country’s income quantiles and their parametric representations. We compare our estimates with the estimates obtained with a ML method. At the end, we verify the results by comparing them with representative microdata.

Some of the earliest research work introducing the minimum quantile distance approach was done by Aitchison and Brown (1957) who applied the method to the log-normal distribution. After Parzen (1979) introduced the density-quantile function, LaRiccia and Wehrly (1985) showed the asymptotic properties of a family of minimum quantile distance estimators and applied it to the three-parameter log-normal distribution. Carmody et al. (1984) applied it to the three-parameter Weibull distribution. Jöhnk and Niermann (2002) compare it with other methods employing the Weibull distribution.

In the present study, we contribute to the literature by examining the performance of the MQD method applied to the generalized beta distribution of the second type (GB2), which is the mostly used distribution in recent studies on income distribution (Chotikapanich et al., 2018), the Dagum (1977) and the Singh-Maddala distributions. We provide the density-quantile functions for the considered distributions and reproducible R code (R Core Team, 2022). Further, we compare the MQD method with the ML. We estimate the income distribution of 30 European countries using data on their income deciles and quintiles. We use data from Eurostat, namely the European Union Statistics on Income and Living Conditions (EU-SILC 2011) data. Due to the fact that we have microdata for all of the observed countries, we have the opportunity to compare the accuracy of our estimates from the grouped data with the more representative microdata estimates. The findings of our study reveal that the MQD method performs as good as the ML method for both decile and quintile data. However, the MQD method exhibits greater robustness and lower sensitivity to starting values, as supported by a simulation study we conducted. The Dagum and the Singh–Maddala distributions are outperformed by the GB2 in terms of absolute differences between the estimated parametric quantiles and their observed nonparametric counterparts. We note that the GB2 outperformance is sometimes at the cost of introducing significant empirical and analytical complexity [see also (Bandourian et al., 2003)]. The Gini coefficient and the mean are best approximated with the Dagum distribution irrelevant of the estimation technique, when evaluating the estimates based on absolute error (difference between parametric estimates and estimates from the microdata).

This work is structured as follows. In Sect. 2.1, the MQD method is described. Section 2.2 outlines briefly the ML technique. In Sect. 2.3, the GB2, the Dagum and the Singh–Maddala distributions are defined. Simulation results are shown in Sect. 3. The data being used and the empirical results are discussed in Sect. 4. Finally, we summarize and make some concluding remarks in Sect. 5.

2 Methodology

Let N be the number of income quantiles available for a given country and let \({\textbf{q}} = (q(u_1), \cdots , q(u_N))^{\top }\) be a N-vector of sample quantiles with q(u) denoting the uth quantile and \(0< u_1< \cdots< u_N < 1\).

2.1 The Minimum Quantile Distance Method

Assuming that given data comes from a specific parametric distribution, one can represent the observed income quantiles parametrically with the quantile function of the assumed distribution. Then the representative parametric distribution can be estimated by minimizing the distance between the observed income quantiles and their parametric counterparts. This method was applied and proved to be consistent, asymptotically normal and robust against gross errors under the regularity conditions specified by LaRiccia and Wehrly (1985).

Let \( {\textbf{Q}}(\theta ) = (Q(u_{i}; \theta ))^{N}_{i=1}\) be a N-vector of theoretical quantiles of a given parametric distribution and \(\theta \) the vector of the parameters of the considered distribution. Following LaRiccia and Wehrly (1985), the minimum quantile distance estimator is given by

$$\begin{aligned} {\hat{\theta }} = \mathop {\textrm{argmin}}\limits _{\theta } \{{\textbf{q}} - {\textbf{Q}}(\theta )\}^{\top } {\textbf{H}}(\theta )\{{\textbf{q}} - {\textbf{Q}}(\theta )\}, \end{aligned}$$
(1)

where is \({\textbf{q}}\) a N-vector of sample quantiles as defined above.

\({\textbf{H}}(\theta )\) is the optimal weighting matrix defined as

$$\begin{aligned} {\textbf{H}}(\theta ) = {\textbf{D}}(\theta ) {\textbf{V}}^{-1} {\textbf{D}}(\theta ), \end{aligned}$$
(2)

which is the inverse of the asymptotic covariance matrix of \(\sqrt{N}({\textbf{q}} - {\textbf{Q}}(\theta ))\) and \({\textbf{V}}^{-1}\) is the inverse of the matrix \({\textbf{V}}\) defined as

$$\begin{aligned} {\textbf{V}} = \{\min (u_i, u_j) - u_iu_j\}_{N \times N} \end{aligned}$$
(3)

and

$$\begin{aligned} {\textbf{D}}(\theta ) = \text {diag}[fQ(u_1; \theta ), \cdots , fQ(u_N; \theta )], \end{aligned}$$
(4)

with \(fQ(u; \theta ) = f[Q(u; \theta ); \theta ]\) being the density-quantile function defined in LaRiccia and Wehrly (1985) and Parzen (1979).

2.2 Maximum Likelihood Estimation

Let the cumulative number of the observed income group observations be \( \displaystyle s_i = \sum _{j=1}^{i} s_j \) with \(i = 1,..., N\) and \(s = s_{N+1}\) be the total number of group observations.

Having the information on the income quantiles and the corresponding number of observations for each income group, we could use the maximum likelihood estimation technique. Following Eckernkemper and Gribisch (2021, Equations (4)–(6)) and Nishino and Kakamu (2011), we obtain the likelihood from a joint distribution of order statistics

$$\begin{aligned} \begin{aligned} L(\theta ) =&s! \frac{(F(q(u_i); \theta ))^{s_i - 1}}{(s_i - 1)!}f(q(u_1); \theta ) \\&\times \biggl \{ \prod _{i=2}^{N} \frac{(F(q(u_i); \theta ) - F(q(u_{i-1}); \theta ))^{s_i - s_{i-1} - 1}}{(s_i - s_{i-1} - 1)!}f(q(u_i); \theta ) \biggl \}\\&\times \frac{(1 - F(q(u_N); \theta ))^{s - s_N}}{(s - s_N)!} \end{aligned} \end{aligned}$$
(5)

Taking logarithms of Eq. 5, we obtain the log-likelihood

$$\begin{aligned} \begin{aligned} \text {log} L(\theta ) = C&+ \sum _{i=1}^{N} \text {log}(f(q(u_i); \theta )) + (s_1 - 1) \text {log} (F(q(u_1); \theta )) \\&+ \sum _{i=2}^{N} (s_i - s_{i-1} - 1) \text {log} [F(q(u_i); \theta ) - F(q(u_{i-1}); \theta )] \\&+ (s - s_N) \text {log} (1-F(q(u_N); \theta )), \end{aligned} \end{aligned}$$
(6)

where F is a cumulative distribution function of the considered parametric distribution, f the respective density function, \(\theta \) the vector of the parameters of the considered distribution and \(q(u_i)\) is the \(u_i\)th sample quantile as defined above.

2.3 The GB2 Distribution

The GB2 was introduced by McDonald (1984) and is acknowledged to perform in an excellent way when estimating income distributions [see (Kleiber & Kotz, 2003; Jenkins, 2009; Chotikapanich et al., 2018)]. It is a four-parameter distribution, and we will denote it as \(GB2(\theta )\), where \(\theta \) is the quadruple (abpq). Its density is

$$\begin{aligned} f(x; \theta ) = \frac{ax^{ap-1}}{b^{ap}B(p, q)[1 + (x/b)^a]^{p+q}}, \quad x > 0, \end{aligned}$$
(7)

where a, b, p and q are positive and \(B(p, q) = \displaystyle \int _{0}^{1}t^{p-1}(1-t)^{q-1}dt\) is the beta function. When \(\theta \) is obvious in the context, we write only f(x).

The cumulative distribution function (cdf) is given by

$$\begin{aligned} F(x; \theta ) = B \left( \frac{(x/b)^a}{1+(x/b)^a}; p; q \right) , \quad x > 0, \end{aligned}$$
(8)

where \( B(\nu ; p, q) = \displaystyle \int _{0}^{\nu }t^{p-1}(1-t)^{q-1}dt/B(p, q) \) is the incomplete beta function ratio with \(\nu = \frac{(x/b)^a}{1+(x/b)^a}\). \( B(\nu ; p, q) \) is commonly included as readily-computed function in statistical software.

The quantile function is given by Chotikapanich et al. (2018)

$$\begin{aligned} Q(u; \theta ) = b \left( \frac{B^{-1}(u; p, q)}{1-B^{-1}(u; p, q)}\right) ^{1/a}, \quad 0< u < 1, \end{aligned}$$
(9)

where \(B^{-1} (u; p, q)\) is the quantile function of the standardized beta distribution evaluated at u.

The density-quantile function is a basic object in quantile-based methodology. It is obtained by substituting the density function (Eq. 7) into the quantile function (equation 9). For the GB2 distribution the density-quantile function is given by

$$\begin{aligned} fQ(u; \theta ) =\frac{a \left( \frac{B^{-1}(u; p, q)}{1 - B^{-1}(u; p, q)}\right) ^{(ap-1)/a}}{b B(p, q) \left( 1 + \frac{B^{-1}(u; p, q)}{1 - B^{-1}(u; p, q)}\right) ^{p + q}} \end{aligned}$$
(10)

where \(B^{-1} (u; p, q)\) is the quantile function of the standardized beta distribution evaluated at u and B(pq) is the beta function.

The moment distribution function for the kth moment is given by

$$\begin{aligned} F_k(x; \theta ) = B\left( \frac{(x/b)^a}{1+(x/b)^a}; p+k/a, q- k/a \right) , \end{aligned}$$
(11)

where \(B(\nu ; p+k/a, q- k/a )\) is the incomplete beta function ratio defined as above with \(\nu = \frac{(x/b)^a}{1+(x/b)^a}\).

The \(k-\)th moment is given by

$$\begin{aligned} \mu ^{(k)} = \frac{b^k B(p + k/a)B(q - k/a)}{B(p, q)}. \end{aligned}$$

The Gini coefficient was provided by McDonald (1984) and is given by

$$\begin{aligned} G = \frac{B(2q - 1/a, 2p + 1/a)}{B(p, q)B(p + 1/a, q - 1/a)}\left( \frac{1}{p}J^{(1)} - \frac{1}{p+1/a}J^{(2)}\right) \end{aligned}$$
(12)

where

$$\begin{aligned} \begin{aligned} J^{(1)}&= {}_{3}F_2\left[ 1, p +q, 2p + \frac{1}{a}; p +1, 2(p+q); 1\right] , \\ J^{(2)}&= {}_{3}F_2\left[ 1, p +q, 2p + \frac{1}{a}; p +\frac{1}{a} + 1, 2(p+q); 1\right] , \end{aligned} \end{aligned}$$

where \({}_{3}F_2\) is the generalized hypergeometric function.

Amongst the special cases of the GB2 distribution are Dagum distribution(\(q = 1\)) and the Singh–Maddala distribution (\(p = 1\)). These distributions are three-parameter distributions and the functions describing them are available in closed form. We provide the moments, Gini, density, cdf, quantile, density-quantile and moment distribution functions for the Dagum and the Singh–Maddala distributions in Table 1.

Table 1 Singh-Maddala and Dagum distributions characteristics

3 Simulation Results

In order to assess the effectiveness of the MQD method to the ML method as described in Sect. 2.2, we perform a simulation study. In this study, we assumed knowledge of the “true” distribution. The data was simulated from a GB2 distribution with parameter settings derived from our estimates obtained from income data of Austria \((a = 3.03, b = 21.71, p = 1.35, q = 1.61)\), from a Dagum distribution with parameters \((a = 3.03, b = 21.71, p = 1.35)\) and from a Singh–Maddala distribution with parameters \((a = 3.03, b = 21.71, q = 1.61)\) as described in Sect. 2.1 for the MQD and Sect. 2.2 for the ML methods, respectively. For every distribution, we simulate \(k = 5 000\) and \(k = 10{,}000\) observations in each trial and repeat this process for a total of \(K = 500\) trials. Subsequently, we establish \(N = 9\) and \(N = 4\) group income boundaries based on the respective quantiles of the simulated data. For each of the K data sets, we estimate the parameters of the underlying GB2, Dagum and Singh-Maddala distributions using the two estimation methods.

Table 2 presents the Mean Squared Error (MSE) results for the given parameter settings of the considered case, which were obtained from 500 independently simulated data sets.

Table 2 Mean squared errors for estimated distribution parameters (simulation results)

The MSE are decreasing with increasing sample size, indicating the consistency of the estimates. The MSE of the distribution parameters estimates exhibit negligible differences and are consistently small for both the MQD and ML methods when employing the Dagum and the Singh-Maddala distributions. However, the GB2 distribution estimated with the ML method has much larger MSE than the estimates computed with the MQD method which reflects that the MQD is more robust and less sensitive to starting values.

Notably, the disparities in the estimates of the mean, the median and the Gini coefficient between the grouped and raw data are minimal. This implies that the process of grou** data only results in modest reductions in estimation uncertainty when it comes to the income distribution and related metrics such as the Gini coefficient. This finding carries substantial implications given the prevalent use of grouped data in international income analysis. It challenges the common assumption that grouped data entails significant statistical limitations compared to raw data.

4 Applications to Income Data

We use income deciles data for 30 European countries for the year 2010. The income we use is equivalized disposable income in purchasing power parities and has been scaled by a thousand (the given income divided by 1000). Table 4 in Appendix A. Tables shows a complete list of the countries used with their country codes and names as given in EU-SILC. The average sample size is 7836. The used income deciles along with the mean incomes and Gini coefficients for each country are provided in Table 5 in Appendix A. Tables.

We have estimated the income deciles directly from the cross-sectional microdata set “EUSILC UDB 2011 version 2 of August 2013", Eurostat (2011). This cross-sectional data set is part of the EU-SILC data which provides representative data on income, poverty, social exclusion and living conditions for most of the European countries. The EU-SILC data for each country is provided to Eurostat by the relevant national statistical offices which collect the data according to the methodology suggested by Eurostat. We provide more computational details and the full R code for replicating the results in Appendix C. code.

Our estimates are based on nine income deciles [\(q(u_1), q(u_2),..., q(u_8), q(u_9)\)] and income quintiles [\(q(u_2), q(u_4), q(u_6), q(u_8)\)]. Table 3 provides the absolute differences between the empirical estimates for the Gini coefficients, the mean, the observed quantiles and their parametric counterparts approximated using the suggested parametric model. The columns “quantiles" provide the average absolute difference for all the estimated quantiles and countries. The GB2 estimates provide the smallest absolute differences for both methods MQD and ML. However, the GB2 is very sensitive to the starting values of the computation algorithm, especially for the ML method. For some of the countries we had to adjust them in order to have convergence of the algorithm. In terms of mean and Gini coefficients, the Dagum distribution is the one which provides the smallest differences between observed and estimated values.

Table 3 Absolute error of the estimates (difference between parametric estimates and estimates from the microdata)

Figure 1 displays the empirical quantiles plotted against the theoretical quantile functions aggregated together for all the considered countries. The theoretical quantiles are estimated with the MQD and the ML methods using deciles (\(N = 9\)) grouped data. It is confirmed that the GB2 distribution provides the best estimates also for the distribution tails.

Fig. 1
figure 1

Q-Q plots observed vs. estimated quantiles with MQD and MLE (all countries)

Figure 2 shows boxplots of the differences between the estimated and the observed Gini index for all the considered countries, methods and distributions using deciles (\(N = 9\)) grouped data. The GB2 estimates have the largest median (\(\approx 0.009\)) and deviation from the observed values and thus confirm the results in Table 3. The difference from the observed Gini coefficients are in the interval \([-0.02; 0.035]\). The Dagum distribution provides the best estimate with the smallest median (\(\approx 0.006\) estimated with MQD method).

Fig. 2
figure 2

Differences between estimated and observed Gini index (estimates from deciles (\(N = 9\)))

5 Conclusion and Further Research

Considering the importance of the exact estimation of inequality and adding the fact that still only sparse income data is available for many countries, it is crucial to find a well-performing method for estimating income distributions. This work proposes a method for estimating the income distribution when only quantile data is available. We suggested the MQD method and applied it to the GB2, Dagum and Singh–Maddala distributions. We use decile (\(N=9\)) and quintile (\(N=4\)) grouped data as starting values for 30 European countries. We note that the absolute differences between the parametric estimates and their nonparametric counterparts estimated from the microdata are preserved when we use quintiles instead of deciles. These results are confirmed by a simulation study. Further, we note that MQD method is more robust than the ML technique in terms of starting values for the underlying computation algorithm, especially for the GB2 distribution. The Dagum and the Singh–Maddala distributions are outperformed by the GB2 in terms of absolute differences between the estimated parametric quantiles and their observed counterparts. The Gini coefficient and the mean are best approximated with the Dagum distribution irrelevant of the estimation technique.

One of the potential computational challenges associated with the methodology used in this article relates to the starting values for the underlying computation algorithm. During our simulation study, we faced difficulties when we started with a different value than the mean of the considered data, for the scale parameter b. Therefore, we always used as initial value the mean of the given data. An interesting extension of the study could investigate the impact on accuracy when the mean of the underlying data is not available as a starting point for estimation. This would involve exploring alternative measures or techniques that can serve as effective substitutes for the mean in the estimation process.