Estimating Income Distributions From Grouped Data: A Minimum Quantile Distance Approach

Spasova, Tsvetana

doi:10.1007/s10614-023-10505-0

Estimating Income Distributions From Grouped Data: A Minimum Quantile Distance Approach

Open access
Published: 15 November 2023

(2023)
Cite this article

Download PDF

You have full access to this open access article

Computational Economics Aims and scope Submit manuscript

Estimating Income Distributions From Grouped Data: A Minimum Quantile Distance Approach

Download PDF

Tsvetana Spasova ORCID: orcid.org/0000-0002-7416-4492¹

780 Accesses
Explore all metrics

Abstract

This paper focuses on the estimation of income distribution from grouped data in the form of quantiles. We propose a novel application of the minimum quantile distance (MQD) approach and compare its performance with the maximum likelihood (ML) technique. The estimation methods are applied using three parametric distributions: the generalized beta distribution of the second kind (GB2), the Dagum distribution, and the Singh–Maddala distribution. We provide the density-quantile functions for these distributions, along with reproducible R code. A simulation study is conducted to evaluate the performance of the MQD and ML methods. The proposed methods are then applied to data from 30 European countries, utilizing the aforementioned parametric distributions. To validate the accuracy of the estimates, we compare them with estimates obtained from more detailed and informative microdata sets. The findings confirm the excellent performance of the considered parametric distributions in estimating income distribution. Additionally, the MQD approach is identified as a straightforward and reliable method for this purpose. Notably, the MQD method displays superior robustness in comparison to the ML technique when it comes to selecting suitable starting values for the underlying computation algorithm, specifically when dealing with the GB2 distribution.

Modelling Income Distributions with Limited Data

Interval Estimators for Inequality Measures Using Grouped Data

Approximate Bayesian computation for Lorenz curves from grouped data

Article 22 August 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Estimating income distribution in an accurate way is very important for the measurement of inequality and poverty, and more generally comparing welfare across space and time. An overview of the literature on modeling income distributions, various estimation methods and distribution specifications is available in the book by Kleiber and Kotz (2003), and the papers in Chotikapanich (2008).

When individual income data are available the estimation of income would be quite straightforward. However, very often the available income data is scarce, especially for many develo** countries, which encumbers deriving representative income distribution models and inequality statistics. Frequently the income data are only available in grouped form, for example income deciles or income shares, mean incomes and Gini coefficients. The World Income Inequality Database (WIID), the World Inequality Database (WID) and the World Bank are among the largest databases providing grouped income data. However, when looking into smaller areas, the data provided can be only in the form of income quantiles due to privacy of personal data and the proximity of the considered areas as, for example, household income data at local levels provided by the French National Institute of Statistics and Economic Studies (INSEE). This paper focuses on estimating income distribution using only quantile income data and aims at determining a method suitable for such data.

In terms of modeling grouped income data, various approaches have been used depending on the data available. Two main strategies have been developed, either nonparametric techniques like for instance employing a nonparametric kernel density function (Sala-i-Martin, 2006), or parametric techniques assuming that the income distribution follows a parametric model. Parametric models are shown to perform very well when estimating income distributions and inequality measures (Chotikapanich et al., 2007) and even outperform the nonparametric techniques (Minoiu & Reddy, 2014; Jordá et al., 2021).

For the parametric modeling, it is crucial to choose a reliable estimation technique and a suitable parametric distribution model. Besides, the estimation techniques have to be adjusted to the grouped data types, usually grouped data with fixed bounds and random cell size or grouped data with fixed cell size and random bounds. Among the most common estimation techniques is the maximum likelihood based on sample proportions using a multinomial likelihood function [see, for example, (McDonald, 1984; Jöhnk & Niermann, 2002; Bandourian et al., 2003; Chotikapanich et al., 2018)]. Eckernkemper and Gribisch (2021) propose a general framework for ML and Bayesian estimation based on grouped data information accounting for known and unknown group boundaries. Another widely used technique is the method-of-moments approach where population and income shares are matched to their theoretical counterparts. Chotikapanich et al. (2007, 2012) apply it for the beta-2 distribution using population shares and class mean income data. Hajargasht et al. (2012) extended the work of Chotikapanich et al. (2007, 2012) to a generalized method-of-moments (GMM) approach and provided inference for the estimated distributions. Further, minimizing the distance between a set of income indicators and their parametric representations is suggested by Graf and Nedyalkova (2014) and Hajargasht and Griffiths (2020) suggest minimum distance estimation of parametric Lorenz curves based on grouped data information.

In this work, we suggest the minumum quantile distance (MQD) method which is designed especially for quantile data (grouped data with fixed bounds) which as mentioned above could be the only grouped data available (for example, data from INSEE). Assuming that the income distribution of a country can be modeled with a specific parametric distribution, in this work we estimate the income distribution of each observed country by minimizing the distance between the empirical estimates of the respective country’s income quantiles and their parametric representations. We compare our estimates with the estimates obtained with a ML method. At the end, we verify the results by comparing them with representative microdata.

Some of the earliest research work introducing the minimum quantile distance approach was done by Aitchison and Brown (1957) who applied the method to the log-normal distribution. After Parzen (1979) introduced the density-quantile function, LaRiccia and Wehrly (1985) showed the asymptotic properties of a family of minimum quantile distance estimators and applied it to the three-parameter log-normal distribution. Carmody et al. (1984) applied it to the three-parameter Weibull distribution. Jöhnk and Niermann (2002) compare it with other methods employing the Weibull distribution.

In the present study, we contribute to the literature by examining the performance of the MQD method applied to the generalized beta distribution of the second type (GB2), which is the mostly used distribution in recent studies on income distribution (Chotikapanich et al., 2018), the Dagum (1977) and the Singh-Maddala distributions. We provide the density-quantile functions for the considered distributions and reproducible R code (R Core Team, 2022). Further, we compare the MQD method with the ML. We estimate the income distribution of 30 European countries using data on their income deciles and quintiles. We use data from Eurostat, namely the European Union Statistics on Income and Living Conditions (EU-SILC 2011) data. Due to the fact that we have microdata for all of the observed countries, we have the opportunity to compare the accuracy of our estimates from the grouped data with the more representative microdata estimates. The findings of our study reveal that the MQD method performs as good as the ML method for both decile and quintile data. However, the MQD method exhibits greater robustness and lower sensitivity to starting values, as supported by a simulation study we conducted. The Dagum and the Singh–Maddala distributions are outperformed by the GB2 in terms of absolute differences between the estimated parametric quantiles and their observed nonparametric counterparts. We note that the GB2 outperformance is sometimes at the cost of introducing significant empirical and analytical complexity [see also (Bandourian et al., 2003)]. The Gini coefficient and the mean are best approximated with the Dagum distribution irrelevant of the estimation technique, when evaluating the estimates based on absolute error (difference between parametric estimates and estimates from the microdata).

This work is structured as follows. In Sect. 2.1, the MQD method is described. Section 2.2 outlines briefly the ML technique. In Sect. 2.3, the GB2, the Dagum and the Singh–Maddala distributions are defined. Simulation results are shown in Sect. 3. The data being used and the empirical results are discussed in Sect. 4. Finally, we summarize and make some concluding remarks in Sect. 5.

2 Methodology

Let N be the number of income quantiles available for a given country and let ${\textbf{q}} = (q(u_1), \cdots , q(u_N))^{\top }$ be a N-vector of sample quantiles with q(u) denoting the uth quantile and $0< u_1< \cdots< u_N < 1$.

2.1 The Minimum Quantile Distance Method

Assuming that given data comes from a specific parametric distribution, one can represent the observed income quantiles parametrically with the quantile function of the assumed distribution. Then the representative parametric distribution can be estimated by minimizing the distance between the observed income quantiles and their parametric counterparts. This method was applied and proved to be consistent, asymptotically normal and robust against gross errors under the regularity conditions specified by LaRiccia and Wehrly (1985).

Let $ {\textbf{Q}}(\theta ) = (Q(u_{i}; \theta ))^{N}_{i=1}$ be a N-vector of theoretical quantiles of a given parametric distribution and $\theta $ the vector of the parameters of the considered distribution. Following LaRiccia and Wehrly (1985), the minimum quantile distance estimator is given by

$$\begin{aligned} {\hat{\theta }} = \mathop {\textrm{argmin}}\limits _{\theta } \{{\textbf{q}} - {\textbf{Q}}(\theta )\}^{\top } {\textbf{H}}(\theta )\{{\textbf{q}} - {\textbf{Q}}(\theta )\}, \end{aligned}$$

(1)

where is ${\textbf{q}}$ a N-vector of sample quantiles as defined above.

${\textbf{H}}(\theta )$ is the optimal weighting matrix defined as

$$\begin{aligned} {\textbf{H}}(\theta ) = {\textbf{D}}(\theta ) {\textbf{V}}^{-1} {\textbf{D}}(\theta ), \end{aligned}$$

(2)

which is the inverse of the asymptotic covariance matrix of $\sqrt{N}({\textbf{q}} - {\textbf{Q}}(\theta ))$ and ${\textbf{V}}^{-1}$ is the inverse of the matrix ${\textbf{V}}$ defined as

$$\begin{aligned} {\textbf{V}} = \{\min (u_i, u_j) - u_iu_j\}_{N \times N} \end{aligned}$$

(3)

and

$$\begin{aligned} {\textbf{D}}(\theta ) = \text {diag}[fQ(u_1; \theta ), \cdots , fQ(u_N; \theta )], \end{aligned}$$

(4)

with $fQ(u; \theta ) = f[Q(u; \theta ); \theta ]$ being the density-quantile function defined in LaRiccia and Wehrly (1985) and Parzen (1979).

2.2 Maximum Likelihood Estimation

Let the cumulative number of the observed income group observations be $ \displaystyle s_i = \sum _{j=1}^{i} s_j $ with $i = 1,..., N$ and $s = s_{N+1}$ be the total number of group observations.

Having the information on the income quantiles and the corresponding number of observations for each income group, we could use the maximum likelihood estimation technique. Following Eckernkemper and Gribisch (2021, Equations (4)–(6)) and Nishino and Kakamu (2011), we obtain the likelihood from a joint distribution of order statistics

$$\begin{aligned} \begin{aligned} L(\theta ) =&s! \frac{(F(q(u_i); \theta ))^{s_i - 1}}{(s_i - 1)!}f(q(u_1); \theta ) \\&\times \biggl \{ \prod _{i=2}^{N} \frac{(F(q(u_i); \theta ) - F(q(u_{i-1}); \theta ))^{s_i - s_{i-1} - 1}}{(s_i - s_{i-1} - 1)!}f(q(u_i); \theta ) \biggl \}\\&\times \frac{(1 - F(q(u_N); \theta ))^{s - s_N}}{(s - s_N)!} \end{aligned} \end{aligned}$$

(5)

Taking logarithms of Eq. 5, we obtain the log-likelihood

$$\begin{aligned} \begin{aligned} \text {log} L(\theta ) = C&+ \sum _{i=1}^{N} \text {log}(f(q(u_i); \theta )) + (s_1 - 1) \text {log} (F(q(u_1); \theta )) \\&+ \sum _{i=2}^{N} (s_i - s_{i-1} - 1) \text {log} [F(q(u_i); \theta ) - F(q(u_{i-1}); \theta )] \\&+ (s - s_N) \text {log} (1-F(q(u_N); \theta )), \end{aligned} \end{aligned}$$

(6)

where F is a cumulative distribution function of the considered parametric distribution, f the respective density function, $\theta $ the vector of the parameters of the considered distribution and $q(u_i)$ is the $u_i$th sample quantile as defined above.

2.3 The GB2 Distribution

The GB2 was introduced by McDonald (1984) and is acknowledged to perform in an excellent way when estimating income distributions [see (Kleiber & Kotz, 2003; Jenkins, 2009; Chotikapanich et al., 2018)]. It is a four-parameter distribution, and we will denote it as $GB2(\theta )$, where $\theta $ is the quadruple (a, b, p, q). Its density is

$$\begin{aligned} f(x; \theta ) = \frac{ax^{ap-1}}{b^{ap}B(p, q)[1 + (x/b)^a]^{p+q}}, \quad x > 0, \end{aligned}$$

(7)

where a, b, p and q are positive and $B(p, q) = \displaystyle \int _{0}^{1}t^{p-1}(1-t)^{q-1}dt$ is the beta function. When $\theta $ is obvious in the context, we write only f(x).

The cumulative distribution function (cdf) is given by

$$\begin{aligned} F(x; \theta ) = B \left( \frac{(x/b)^a}{1+(x/b)^a}; p; q \right) , \quad x > 0, \end{aligned}$$

(8)

where $ B(\nu ; p, q) = \displaystyle \int _{0}^{\nu }t^{p-1}(1-t)^{q-1}dt/B(p, q) $ is the incomplete beta function ratio with $\nu = \frac{(x/b)^a}{1+(x/b)^a}$. $ B(\nu ; p, q) $ is commonly included as readily-computed function in statistical software.

The quantile function is given by Chotikapanich et al. (2018)

$$\begin{aligned} Q(u; \theta ) = b \left( \frac{B^{-1}(u; p, q)}{1-B^{-1}(u; p, q)}\right) ^{1/a}, \quad 0< u < 1, \end{aligned}$$

(9)

where $B^{-1} (u; p, q)$ is the quantile function of the standardized beta distribution evaluated at u.

The density-quantile function is a basic object in quantile-based methodology. It is obtained by substituting the density function (Eq. 7) into the quantile function (equation 9). For the GB2 distribution the density-quantile function is given by

$$\begin{aligned} fQ(u; \theta ) =\frac{a \left( \frac{B^{-1}(u; p, q)}{1 - B^{-1}(u; p, q)}\right) ^{(ap-1)/a}}{b B(p, q) \left( 1 + \frac{B^{-1}(u; p, q)}{1 - B^{-1}(u; p, q)}\right) ^{p + q}} \end{aligned}$$

(10)

where $B^{-1} (u; p, q)$ is the quantile function of the standardized beta distribution evaluated at u and B(p, q) is the beta function.

The moment distribution function for the kth moment is given by

$$\begin{aligned} F_k(x; \theta ) = B\left( \frac{(x/b)^a}{1+(x/b)^a}; p+k/a, q- k/a \right) , \end{aligned}$$

(11)

where $B(\nu ; p+k/a, q- k/a )$ is the incomplete beta function ratio defined as above with $\nu = \frac{(x/b)^a}{1+(x/b)^a}$.

The $k-$th moment is given by

$$\begin{aligned} \mu ^{(k)} = \frac{b^k B(p + k/a)B(q - k/a)}{B(p, q)}. \end{aligned}$$

The Gini coefficient was provided by McDonald (1984) and is given by

$$\begin{aligned} G = \frac{B(2q - 1/a, 2p + 1/a)}{B(p, q)B(p + 1/a, q - 1/a)}\left( \frac{1}{p}J^{(1)} - \frac{1}{p+1/a}J^{(2)}\right) \end{aligned}$$

(12)

where

$$\begin{aligned} \begin{aligned} J^{(1)}&= {}_{3}F_2\left[ 1, p +q, 2p + \frac{1}{a}; p +1, 2(p+q); 1\right] , \\ J^{(2)}&= {}_{3}F_2\left[ 1, p +q, 2p + \frac{1}{a}; p +\frac{1}{a} + 1, 2(p+q); 1\right] , \end{aligned} \end{aligned}$$

where ${}_{3}F_2$ is the generalized hypergeometric function.

Amongst the special cases of the GB2 distribution are Dagum distribution($q = 1$) and the Singh–Maddala distribution ($p = 1$). These distributions are three-parameter distributions and the functions describing them are available in closed form. We provide the moments, Gini, density, cdf, quantile, density-quantile and moment distribution functions for the Dagum and the Singh–Maddala distributions in Table 1.

Table 1 Singh-Maddala and Dagum distributions characteristics

Full size table

3 Simulation Results

In order to assess the effectiveness of the MQD method to the ML method as described in Sect. 2.2, we perform a simulation study. In this study, we assumed knowledge of the “true” distribution. The data was simulated from a GB2 distribution with parameter settings derived from our estimates obtained from income data of Austria $(a = 3.03, b = 21.71, p = 1.35, q = 1.61)$, from a Dagum distribution with parameters $(a = 3.03, b = 21.71, p = 1.35)$ and from a Singh–Maddala distribution with parameters $(a = 3.03, b = 21.71, q = 1.61)$ as described in Sect. 2.1 for the MQD and Sect. 2.2 for the ML methods, respectively. For every distribution, we simulate $k = 5 000$ and $k = 10{,}000$ observations in each trial and repeat this process for a total of $K = 500$ trials. Subsequently, we establish $N = 9$ and $N = 4$ group income boundaries based on the respective quantiles of the simulated data. For each of the K data sets, we estimate the parameters of the underlying GB2, Dagum and Singh-Maddala distributions using the two estimation methods.

Table 2 presents the Mean Squared Error (MSE) results for the given parameter settings of the considered case, which were obtained from 500 independently simulated data sets.

Table 2 Mean squared errors for estimated distribution parameters (simulation results)

Full size table

The MSE are decreasing with increasing sample size, indicating the consistency of the estimates. The MSE of the distribution parameters estimates exhibit negligible differences and are consistently small for both the MQD and ML methods when employing the Dagum and the Singh-Maddala distributions. However, the GB2 distribution estimated with the ML method has much larger MSE than the estimates computed with the MQD method which reflects that the MQD is more robust and less sensitive to starting values.

Notably, the disparities in the estimates of the mean, the median and the Gini coefficient between the grouped and raw data are minimal. This implies that the process of grou** data only results in modest reductions in estimation uncertainty when it comes to the income distribution and related metrics such as the Gini coefficient. This finding carries substantial implications given the prevalent use of grouped data in international income analysis. It challenges the common assumption that grouped data entails significant statistical limitations compared to raw data.

4 Applications to Income Data

We use income deciles data for 30 European countries for the year 2010. The income we use is equivalized disposable income in purchasing power parities and has been scaled by a thousand (the given income divided by 1000). Table 4 in Appendix A. Tables shows a complete list of the countries used with their country codes and names as given in EU-SILC. The average sample size is 7836. The used income deciles along with the mean incomes and Gini coefficients for each country are provided in Table 5 in Appendix A. Tables.

We have estimated the income deciles directly from the cross-sectional microdata set “EUSILC UDB 2011 version 2 of August 2013", Eurostat (2011). This cross-sectional data set is part of the EU-SILC data which provides representative data on income, poverty, social exclusion and living conditions for most of the European countries. The EU-SILC data for each country is provided to Eurostat by the relevant national statistical offices which collect the data according to the methodology suggested by Eurostat. We provide more computational details and the full R code for replicating the results in Appendix C. code.

Our estimates are based on nine income deciles [$q(u_1), q(u_2),..., q(u_8), q(u_9)$] and income quintiles [$q(u_2), q(u_4), q(u_6), q(u_8)$]. Table 3 provides the absolute differences between the empirical estimates for the Gini coefficients, the mean, the observed quantiles and their parametric counterparts approximated using the suggested parametric model. The columns “quantiles" provide the average absolute difference for all the estimated quantiles and countries. The GB2 estimates provide the smallest absolute differences for both methods MQD and ML. However, the GB2 is very sensitive to the starting values of the computation algorithm, especially for the ML method. For some of the countries we had to adjust them in order to have convergence of the algorithm. In terms of mean and Gini coefficients, the Dagum distribution is the one which provides the smallest differences between observed and estimated values.

Table 3 Absolute error of the estimates (difference between parametric estimates and estimates from the microdata)

Full size table

Figure 1 displays the empirical quantiles plotted against the theoretical quantile functions aggregated together for all the considered countries. The theoretical quantiles are estimated with the MQD and the ML methods using deciles ($N = 9$) grouped data. It is confirmed that the GB2 distribution provides the best estimates also for the distribution tails.

Figure 2 shows boxplots of the differences between the estimated and the observed Gini index for all the considered countries, methods and distributions using deciles ($N = 9$) grouped data. The GB2 estimates have the largest median ($\approx 0.009$) and deviation from the observed values and thus confirm the results in Table 3. The difference from the observed Gini coefficients are in the interval $[-0.02; 0.035]$. The Dagum distribution provides the best estimate with the smallest median ($\approx 0.006$ estimated with MQD method).

5 Conclusion and Further Research

Considering the importance of the exact estimation of inequality and adding the fact that still only sparse income data is available for many countries, it is crucial to find a well-performing method for estimating income distributions. This work proposes a method for estimating the income distribution when only quantile data is available. We suggested the MQD method and applied it to the GB2, Dagum and Singh–Maddala distributions. We use decile ($N=9$) and quintile ($N=4$) grouped data as starting values for 30 European countries. We note that the absolute differences between the parametric estimates and their nonparametric counterparts estimated from the microdata are preserved when we use quintiles instead of deciles. These results are confirmed by a simulation study. Further, we note that MQD method is more robust than the ML technique in terms of starting values for the underlying computation algorithm, especially for the GB2 distribution. The Dagum and the Singh–Maddala distributions are outperformed by the GB2 in terms of absolute differences between the estimated parametric quantiles and their observed counterparts. The Gini coefficient and the mean are best approximated with the Dagum distribution irrelevant of the estimation technique.

One of the potential computational challenges associated with the methodology used in this article relates to the starting values for the underlying computation algorithm. During our simulation study, we faced difficulties when we started with a different value than the mean of the considered data, for the scale parameter b. Therefore, we always used as initial value the mean of the given data. An interesting extension of the study could investigate the impact on accuracy when the mean of the underlying data is not available as a starting point for estimation. This would involve exploring alternative measures or techniques that can serve as effective substitutes for the mean in the estimation process.

Data availability

The quantile data generated and analysed with the MQD method in this work is available in Table 5 in the current article. This data is computed from the microdata set “EUSILC UDB 2011 version 2 of August 2013", Eurostat (2011) which can be accessed as described in “How to Apply for Microdata Access?", Eurostat (2022). Due to the confidential nature of the data, we cannot grant access to the microdata.However, we provide R code for replicating the main results of this work in Appendix C. code. The data necessary for the code is available in this article.

References

Aitchison, J., & Brown, J. A. C. (1957). The lognormal distribution. Cambridge University Press.
Google Scholar
Bandourian, R., McDonald, J. B., & Turley, R. S. (2003). A comparison of parametric models of income distribution across countries and over time. Estadistica, 55, 135–152.
Google Scholar
Carmody, T. J., Eubank, R. L., & LaRiccia, V. N. (1984). A family of minimum quantile distance estimators for the three-parameter Weibull distribution. Statistische Hefte, 25, 69–82.
Article Google Scholar
Chotikapanich, D. (Ed.). (2008). Modeling income distributions and Lorenz curves. Springer.
Chotikapanich, D., Griffiths, W. E., Hajargasht, G., Karunarathne, W., & PrasadaRao, D. S. (2018). Using the GB2 income distribution. Econometrics, 6(2), 21.
Article Google Scholar
Chotikapanich, D., Griffiths, W. E., & Prasada Rao, D. S. (2007). Estimating and combining national income distributions using limited data. Journal of Business and Economic Statistics, 25(1), 97–109.
Article Google Scholar
Chotikapanich, D., Griffiths, W. E., Prasada Rao, D. S., & Valencia, V. (2012). Global income distributions and inequality, 1993 and 2000: Incorporating country-level inequality modeled with Beta distributions. The Review of Economics and Statistics, 94(1), 52–73.
Article Google Scholar
Dagum, C. (1977). A new model of personal income distribution: Specification and estimation. Économie Appliquée, 30, 413–437.
Article Google Scholar
Eckernkemper, T., & Gribisch, B. (2021). Classical and Bayesian inference for income distributions using grouped data. Oxford Bulletin of Economics and Statistics, 83(1), 0305–9049.
Article Google Scholar
Eurostat: “EUSILC UDB 2011 version 2 of August 2013". (2011). European Union Statistics on Income and Living Conditions.
Eurostat: “How to Apply for Microdata Access?". European Union Statistics on Income and Living Conditions. (2022). https://ec.europa.eu/eurostat/documents/203647/771732/How_to_apply_for_microdata_access.pdf
Graf, M., & Nedyalkova, D. (2014). Modeling of personal income and indicators of poverty and social exclusion using the generalized beta distribution of the second kind. Review of Income and Wealth, 60(4), 821–842.
Article Google Scholar
Hajargasht, G., & Griffiths, W. (2020). Minimum distance estimation of parametric Lorenz curves based on grouped data. Econometric Reviews, 39(4), 344–361.
Article Google Scholar
Hajargasht, G., Griffiths, W., Brice, J., Prasada Rao, D. S., & Chotikapanich, D. (2012). Inference for income distributions using grouped data. Journal of Business and Economic Statistics, 30(4), 563–575.
Article Google Scholar
Handcock, M. S.(2015) Relative distribution methods. Version 1.6–4. Project home page at http://www.stat.ucla.edu/~handcock/RelDist
Harrell Jr, F. E. (2015) with contributions from Charles Dupont, and many others: Hmisc: Harrell Miscellaneous. R package version 3.17–0 . http://CRAN.R-project.org/package=Hmisc
Jenkins, S. P. (2009). Distributionally-sensitive inequality indices and the GB2 income distribution. Review of Income and Wealth, 55, 392–398.
Article Google Scholar
Jöhnk, M. D., & Niermann, S. (2002). Parameter estimation with grouped data according to the linearization method—A comparison with alternative approaches. Statistical Papers, 43(2), 237–255.
Article Google Scholar
Jordá, V., Sarabia, J. M., & Jäntti, M. (2021). Inequality measurement with grouped data: Parametric and non-parametric methods. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(3), 964–984.
Article Google Scholar
Kleiber, C., & Kotz, S. (2003). Statistical size distributions in economics and actuarial sciences. Wiley.
Book Google Scholar
LaRiccia, V. N., & Wehrly, T. E. (1985). Asymptotic properties of a family of minimum quantile distance estimators. Journal of the American Statistical Association, 80(391), 742–747.
Article Google Scholar
McDonald, J. B. (1984). Some generalized functions for the size distribution of income. Econometrica, 52(3), 647–665.
Article Google Scholar
McDonald, J. B., & Ransom, M. R. (1979). Functional forms, estimation techniques and the distribution of income. Econometrica, 47(6), 1513–1525.
Article Google Scholar
Minoiu, C., & Reddy, S. G. (2014). Kernel density estimation on grouped data: The case of poverty assessment. The Journal of Economic Inequality, 12, 163–189.
Article Google Scholar
Nishino, H., & Kakamu, K. (2011). Grouped data estimation and testing of Gini coefficients using lognormal distributions. Sankhya B, 73, 193–210.
Article Google Scholar
Parzen, E. (1979). Nonparametric statistical data modeling. Journal of the American Statistical Association, 74(365), 105–121.
Article Google Scholar
R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.Rproject.org/
Sala-i-Martin, X. (2006). The world distribution of income: Falling poverty and... convergence, period. The Quarterly Journal of Economics, 121(2), 351–97.
Article Google Scholar

Download references

Acknowledgements

The author thanks Eurostat for providing the data. Any views expressed in this article are those of the author, and do not necessarily reflect the official position of Eurostat, the European Commission or any of the national authorities whose data have been used. The author would also like to thank Christian Kleiber and Kurt Schmidheiny for their insightful comments and suggestions. Comments from two anonymous referees and the editor have led to substantial improvements in the paper. Financial support by WWZ-Forum (WWZ Förderverein) is gratefully acknowledged.

Funding

Open access funding provided by FHNW University of Applied Sciences and Arts Northwestern Switzerland.

Author information

Authors and Affiliations

School of Business, FHNW University of Applied Sciences and Arts Northwestern Switzerland, Riggenbachstrasse 16, 4600, Olten, Switzerland
Tsvetana Spasova

Authors

Tsvetana Spasova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tsvetana Spasova.

Ethics declarations

Conflict of interest

The author declares no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Tables

Table 4 shows a complete list of the countries used in this work with their country codes and names as given in EU-SILC. The average sample size is 7, 836. The population size of a country is computed as the sum of the product of the household size and the household weight.

Table 4 Country codes and names in EU-SILC

Full size table

Table 5 provides the observed quantiles used in this work for estimating the corresponding distribution parameters with the MQD method. Table 5 displays also the observed mean incomes and Gini coefficients for each country. The empirical estimates called “observed" are computed from the microdata set “EUSILC UDB 2011 version 2 of August 2013" using the "quantile" option of the wtd.quantile function from the R package Hmisc (Harrell Jr et al., 2015) and the observed mean using the function weighted.mean (package stats). The income deciles and the mean values are given in thousands of purchasing power parities. The Gini coefficients are computed with the gini function from the R package reldist (Handcock, 2015) using the corresponding sample weights.

Table 5 Observed income deciles and mean

Full size table

Appendix B: Code

In this appendix, the code used in this work for estimating the GB2, Dagum and Singh–Maddala distributions parameters a, b, p and q with the minimum quantile distance method is provided. To reduce precision loss (due to disproportionately large parameter values) in our computations, we scale the income deciles by 1, 000 (the observed ones divided by 1, 000). We perform the optimization of the minimum quantile distance estimator ${\hat{\theta }}$ (given in Eq. 1) with the statistical software R (R Core Team, 2022) using the function optim (from the R package stats). We employ the L-BFGS-B optimization method which is a modification of the quasi-Newton method. It is crucial to set the starting value of the parameter b equal or close to the (scaled by a thousand) mean income of each country. Otherwise, the algorithm may not converge. Further, we set the initial values of $a = 3$, $p = 1$, $q = 1$ for all the observed countries.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Spasova, T. Estimating Income Distributions From Grouped Data: A Minimum Quantile Distance Approach. Comput Econ (2023). https://doi.org/10.1007/s10614-023-10505-0

Download citation

Accepted: 17 October 2023
Published: 15 November 2023
DOI: https://doi.org/10.1007/s10614-023-10505-0

Keywords

JEL Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Estimating Income Distributions From Grouped Data: A Minimum Quantile Distance Approach

Abstract

Similar content being viewed by others

Modelling Income Distributions with Limited Data

Interval Estimators for Inequality Measures Using Grouped Data

Approximate Bayesian computation for Lorenz curves from grouped data

1 Introduction