Minimum Wasserstein Distance Estimator Under Finite Location-Scale Mixtures

  • Chapter
  • First Online:
Advances and Innovations in Statistics and Data Science

Part of the book series: ICSA Book Series in Statistics ((ICSABSS))

  • 422 Accesses

Abstract

When a population exhibits heterogeneity, we often model it via a finite mixture: decompose it into several different but homogeneous subpopulations. Contemporary practice favors learning the mixtures by maximizing the likelihood for statistical efficiency and the convenient EM algorithm for numerical computation. Yet the maximum likelihood estimate (MLE) is not well defined for finite location-scale mixture in general. We hence investigate feasible alternatives to MLE such as minimum distance estimators. Recently, the Wasserstein distance has drawn increased attention in the machine learning community. It has intuitive geometric interpretation and is successfully employed in many new applications. Do we gain anything by learning finite location-scale mixtures via a minimum Wasserstein distance estimator (MWDE)? This chapter investigates this possibility in several respects. We find that the MWDE is consistent and derive a numerical solution under finite location-scale mixtures. We study its robustness against outliers and mild model mis-specifications. Our moderate scaled simulation study shows the MWDE suffers some efficiency loss against a penalized version of MLE in general without noticeable gain in robustness. We reaffirm the general superiority of the likelihood-based learning strategies even for the non-regular finite location-scale mixtures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now
Chapter
EUR 29.95
Price includes VAT (Germany)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 117.69
Price includes VAT (Germany)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 160.49
Price includes VAT (Germany)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info
Hardcover Book
EUR 160.49
Price includes VAT (Germany)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.pinterest.se/pin/761952830692007143/.

References

  • Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein GAN. Preprint. ar**v:1701.07875.

    Google Scholar 

  • Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.

    Google Scholar 

  • Chen, J., & Tan, X. (2009). Inference for multivariate normal mixtures. Journal of Multivariate Analysis, 100(7), 1367–1383.

    Article  Google Scholar 

  • Chen, J., Tan, X., & Zhang, R. (2008). Inference for normal mixtures in mean and variance. Statistica Sinica, 18(2), 443–465.

    Google Scholar 

  • Chen, J., Li, P., & Liu, G. (2020). Homogeneity testing under finite location-scale mixtures. Canadian Journal of Statistics, 48(4), 670–684.

    Article  Google Scholar 

  • Choi, K. (1969). Estimators for the parameters of a finite mixture of distributions. Annals of the Institute of Statistical Mathematics, 21(1), 107–116.

    Article  Google Scholar 

  • Clark, A. (2015). Pillow (PIL Fork) documentation.

    Google Scholar 

  • Clarke, B., & Heathcote, C. (1994). Robust estimation of k-component univariate normal mixtures. Annals of the Institute of Statistical Mathematics, 46(1), 83–93.

    Article  Google Scholar 

  • Cutler, A., & Cordero-Brana, O. I. (1996). Minimum Hellinger distance estimation for finite mixture models. Journal of the American Statistical Association, 91(436), 1716–1723.

    Article  Google Scholar 

  • Deely, J., & Kruse, R. (1968). Construction of sequences estimating the mixing distribution. The Annals of Mathematical Statistics, 39(1), 286–288.

    Article  Google Scholar 

  • Evans, S. N., & Matsen, F. A. (2012). The phylogenetic Kantorovich–Rubinstein metric for environmental sequence samples. Journal of the Royal Statistical Society: Series B (Methodological), 74(3), 569–592.

    Article  Google Scholar 

  • Farnoosh, R., & Zarpak, B. (2008). Image segmentation using Gaussian mixture model. IUST International Journal of Engineering Science, 19(1–2), 29–32.

    Google Scholar 

  • Holzmann, H., Munk, A., & Stratmann, B. (2004). Identifiability of finite mixtures-with applications to circular distributions. Sankhyā: The Indian Journal of Statistics, 66(3), 440–449.

    Google Scholar 

  • Kolouri, S., Rohde, G. K., & Hoffmann, H. (2018). Sliced Wasserstein distance for learning Gaussian mixture models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3427–3436).

    Google Scholar 

  • Nocedal, J., & Wright, S. (2006). Numerical optimization. Springer Science & Business Media.

    Google Scholar 

  • Pearson, K. (1894). Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. A, 185(326-330), 71–110.

    Google Scholar 

  • Plataniotis, K. N., & Hatzinak, D. (2000). Gaussian mixtures and their applications to signal processing. In S. Stergiopoulos (Ed.), Advanced signal processing handbook: theory and implementation for radar, sonar, and medical imaging real time systems (vol. 25, chapter 3, pp. 3-1–3-35, 1st edn). Boca Raton: CRC Press.

    Google Scholar 

  • Santosh, D. H. H., Venkatesh, P., Poornesh, P., Rao, L. N., & Kumar, N. A. (2013). Tracking multiple moving objects using Gaussian mixture model. International Journal of Soft Computing and Engineering (IJSCE), 3(2), 114–119.

    Google Scholar 

  • Schork, N. J., Allison, D. B., & Thiel, B. (1996). Mixture distributions in human genetics research. Statistical Methods in Medical Research, 5(2), 155–178.

    Article  CAS  PubMed  Google Scholar 

  • Tanaka, K. (2009). Strong consistency of the maximum likelihood estimator for finite mixtures of location–scale distributions when penalty is imposed on the ratios of the scale parameters. Scandinavian Journal of Statistics, 36(1), 171–184.

    Google Scholar 

  • Teicher, H. (1961). Identifiability of mixtures. The Annals of Mathematical Statistics, 32(1), 244–248.

    Article  Google Scholar 

  • Van der Vaart, A. W. (2000). Asymptotic statistics (vol. 3). Cambridge University Press.

    Google Scholar 

  • Villani, C. (2003). Topics in optimal transportation (vol. 58). American Mathematical Society.

    Google Scholar 

  • Woodward, W. A., Parr, W. C., Schucany, W. R., & Lindsey, H. (1984). A comparison of minimum distance and maximum likelihood estimation of a mixture proportion. Journal of the American Statistical Association, 79(387), 590–598.

    Article  Google Scholar 

  • Yakowitz, S. (1969). A consistent estimator for the identification of finite mixtures. The Annals of Mathematical Statistics, 40(5), 1728–1735.

    Article  Google Scholar 

  • Zhu, D. (2016). A two-component mixture model for density estimation and classification. Journal of Interdisciplinary Mathematics, 19(2), 311–319.

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank Richard Schonberg for proofreading the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiahua Chen .

Editor information

Editors and Affiliations

Appendices

Appendix

Numerically Friendly Expression of W 2(F N, F(⋅|G))

To learn the finite mixture distribution through MWDE, we must compute

$$\displaystyle \begin{aligned} \mathbb{W}_{N}(G) =W_2^2(F_N(\cdot), F(\cdot|G)) = \int_{0}^{1} \{ F_N^{-1}(t) - F^{-1}(t|G)\} ^2 dt \end{aligned}$$

for finite location-scale mixture

$$\displaystyle \begin{aligned} F(\cdot |G) = \sum_{k=1}^K \pi_k F(\cdot| \boldsymbol{\theta}_k) = \sum_{k=1}^K \pi_k \sigma_k^{-1} F_0( (x - \mu_k)/\sigma_k). \end{aligned}$$

We write \({\mathbb E}_k(\cdot )\) as expectation under distribution F(⋅|θ k). For instance,

$$\displaystyle \begin{aligned} \mathbb{E}_k\{X^2\} = \mu_k^2 + \sigma_k^2(\mu_0^2+\sigma_0^2)+2\mu_k\sigma_k\mu_0. \end{aligned}$$

Let I n = ((n − 1)∕N, nN] for n = 1, 2, …, N so that \(F^{-1}_N (t) = x_{(n)}\) when t ∈ I n, where x (n) is the nth order statistic. For ease of notation, we write x (n) as x n. Over this interval, we have

$$\displaystyle \begin{aligned} \int_{I_n} \{ F^{-1}_N(t) - F^{-1}(t|G) \}^2 dt = \int_{I_n} [ x^2_n - 2 x_n F^{-1}(t|G) + \{ F^{-1}(t|G) \}^2 ] dt. \end{aligned} $$
(8)

The integration of the first term in (8), after summing over n, is given by

$$\displaystyle \begin{aligned} \sum_{n=1}^N \int_{I_n} x_n^2 dt =N^{-1} \sum_n x_n^2=\overline{x^2}. \end{aligned}$$

The integration of the third term in (8) is

$$\displaystyle \begin{aligned} \sum_{n=1}^N \int_{I_n} \{ F^{-1}(t|G) \}^2 dt = \int_{-\infty}^{\infty} x^2 f(x|G) dx = \sum_{k=1}^{K} w_k \mathbb{E}_k \{X^2\}. \end{aligned}$$

Let ξ 0 = −, ξ N+1 = , and ξ n = F −1(nN|G) for n = 1, …, N. Denote

$$\displaystyle \begin{aligned} \Delta F_{nk} = F(\xi_{n}| \boldsymbol{\theta}_k) - F(\xi_{n-1}| \boldsymbol{\theta}_k) \end{aligned}$$

and

$$\displaystyle \begin{aligned} T(x) = \int_{-\infty}^x t f_0(t) dt, ~~~ \Delta T_{nk} = T((\xi_{n}-\mu_k)/\sigma_k) - T(\xi_{n-1}-\mu_k)/\sigma_k). \end{aligned}$$

Then

$$\displaystyle \begin{aligned} \begin{array}{rcl} \int_{I_n} F^{-1}(t|G)dt & = \sum_k w_k \int_{\xi_{n-1}}^{\xi_{n}} x f(x|\mu_k,\sigma_k) dx \\ & = \sum_k w_k \{ \mu_k \Delta F_{nk} + \sigma_k \Delta T_{nk} \}. \end{array} \end{aligned} $$

These lead to numerically convenient expression

$$\displaystyle \begin{aligned} \mathbb{W}_{N}(G) = \overline{x^2}+ \sum_k w_k {\mathbb E}_{k}\{ X^2\} - 2\sum_k w_k \{ \mu_k \Delta F_{nk} + \sigma_k \Delta T_{nk} \}. \end{aligned}$$

To most effectively use BFGS algorithm, it is best to provide gradients of the objective function. Here are some numerically friendly expressions of some partial derivatives.

Lemma 1

Let δ jk = 1 when j = k and δ jk = 0 when j  k. For n = 1, …, N and j = 1, 2, …, K, we have

$$\displaystyle \begin{aligned} \begin{array}{rcl} \frac{\partial}{\partial w_j} F(\xi_n| \boldsymbol{\theta}_k) & =&\displaystyle f(\xi_n|\boldsymbol{\theta}_k)\frac{\partial \xi_n}{\partial w_j},\\ \frac{\partial}{\partial \mu_j} F(\xi_n| \boldsymbol{\theta}_k)& =&\displaystyle f(\xi_n|\boldsymbol{\theta}_k)\left (\frac{\partial \xi_n}{\partial \mu_j}-\delta_{jk}\right ), \\ \frac{\partial}{\partial \sigma_j} F(\xi_n| \boldsymbol{\theta}_k) & =&\displaystyle f(\xi_n|\boldsymbol{\theta}_k) \Big (\frac{\partial\xi_n}{\partial\sigma_j} -\left \{\frac{\xi_n-\mu_k}{\sigma_k}\right \}\delta_{jk}\Big ), \end{array} \end{aligned} $$

and

$$\displaystyle \begin{aligned} \begin{array}{rcl} \frac{\partial}{\partial w_j}T \left (\frac{\xi_n-\mu_k}{\sigma_k} \right ) & =&\displaystyle f(\xi_n|\boldsymbol{\theta}_k) \left ( \frac{\xi_n-\mu_k}{\sigma_k}\right ) \frac{\partial\xi_i}{\partial w_j}, \\ \frac{\partial}{\partial \mu_j}T \left (\frac{\xi_n-\mu_k}{\sigma_k} \right ) & =&\displaystyle f(\xi_n|\boldsymbol{\theta}_k) \left ( \frac{\xi_n-\mu_k}{\sigma_k}\right ) \left ( \frac{\partial\xi_n}{\partial \mu_j}-\delta_{jk}\right ), \\ \frac{\partial}{\partial \sigma_j} T \left (\frac{\xi_n-\mu_k}{\sigma_k} \right ) & =&\displaystyle f(\xi_n|\boldsymbol{\theta}_k) \left ( \frac{\xi_n-\mu_k}{\sigma_k}\right ) \left \{ \frac{\partial\xi_i}{\partial\sigma_j} - \left ( \frac{\xi_n-\mu_k}{\sigma_k} \right ) \delta_{jk} \right \}. \end{array} \end{aligned} $$

Furthermore, we have

$$\displaystyle \begin{aligned} \begin{array}{rcl} \frac{\partial\xi_n}{\partial \mu_k} & =&\displaystyle \frac{w_k f(\xi_i|\boldsymbol{\theta}_k)}{f(\xi_n|G)}, \\ \frac{\partial\xi_n}{\partial \sigma_k} & =&\displaystyle \frac{w_k f(\xi_n|\boldsymbol{\theta}_k)}{f(\xi_i|G)}\left ( \frac{\xi_n-\mu_k}{\sigma_k}\right ) ,\\ \frac{\partial\xi_n}{\partial w_k} & =&\displaystyle - \frac{ F(\xi_n|\boldsymbol{\theta}_k)}{f(\xi_n|G)}. \end{array} \end{aligned} $$

Based on this lemma, it is seen that

$$\displaystyle \begin{aligned} \begin{array}{rcl} \frac{\partial}{\partial \mu_j}\mathbb{W}_N& =&\displaystyle 2w_j(\mu_j+\sigma_j\mu_0) - 2w_j\sum_{n=1}^N x_{(n)}\Delta F_{nj} \\ & &\displaystyle - 2 \sum_{n=1}^{N}\sum_{k} w_k \mu_k x_{(n)} \left \{ \ \frac{\partial F_0(\xi_n| \boldsymbol{\theta}_k)}{\partial\mu_j} - \frac{\partial F_0(\xi_{n-1} | \boldsymbol{\theta}_k)}{\partial\mu_j} \right \} \\ & &\displaystyle -2\sum_{n=1}^{N}\sum_{k} w_k\sigma_k x_{(n)} \frac{\partial }{\partial \mu_j} \left \{ T \left ( \frac{\xi_{n}-\mu_k}{\sigma_k} \right ) - T \left ( \frac{\xi_{n-1}-\mu_k}{\sigma_k} \right ) \right \} \end{array} \end{aligned} $$

with F 0(ξ 0|θ k) = 0, F 0(ξ N+1|θ k) = 1, \(T \big ( \frac {\xi _{0}-\mu _k}{\sigma _k} \big )=0\), and \(T\big (\frac {\xi _{N+1}-\mu _k}{\sigma _k} \big )=\int _{-\infty }^{\infty } tf_0(t)dt\) is a constant that does not depend on any parameters. Substituting the partial derivatives in Lemma 1, we then get

$$\displaystyle \begin{aligned} \begin{array}{rcl} \frac{\partial}{\partial \mu_j}\mathbb{W}_N & =&\displaystyle 2w_j(\mu_j+\sigma_j\mu_0) -2w_j\sum_{n=1}^N x_{(n)}\Delta F_{nj}\\ & &\displaystyle -2\sum_{n=1}^{N-1}x_{(n)}\xi_n\sum_{k}w_k f(\xi_n|\mu_k,\sigma_k) \big(\frac{\partial\xi_n}{\partial\mu_j}-\delta_{jk}\big)\\ & &\displaystyle +2\sum_{n=1}^{N-1}x_{(n)}\xi_{n-1}\sum_{k}w_k f(\xi_{n-1}| \mu_k,\sigma_k) \big(\frac{\partial\xi_{n-1}}{\partial\mu_j}-\delta_{jk}\big)\\ & =&\displaystyle 2w_j\big\{\mu_j + \sigma_j\mu_0 - \sum_{n=1}^N x_{(n)}\Delta F_{nj}\big\}. \end{array} \end{aligned} $$

Similarly, we have

$$\displaystyle \begin{aligned} \begin{array}{rcl} \frac{\partial}{\partial \sigma_j}\mathbb{W}_N & =&\displaystyle \ 2w_j\{\sigma_j(\mu_0^2+\sigma_0^2) + \mu_j\mu_0-\sum_{n=1}^N x_{(n)}\Delta\mu_{nj}\}, \\ \frac{\partial}{\partial w_k} \mathbb{W}_N & =&\displaystyle \ \{\mu_k^2 + \sigma_k^2(\mu_0^2+\sigma_0^2)+2\mu_k\sigma_k\mu_0\} - 2 \sum_{n=1}^{N-1} \{x_{(n+1)} - x_{(n)}\} \xi_iF(\xi_n|\boldsymbol{\theta}_k)\\ & &\displaystyle - 2\big \{ \mu_k \sum_{n=1}^{N} x_{(n)} \Delta F_{nk} + \sigma_k \sum_{n=1}^{N} x_{(n)}\Delta T_{nk} \big \}. \end{array} \end{aligned} $$

Computing the quantiles of the mixture distribution F(⋅|G) for each G is one of the most demanding tasks. The property stated in the following lemma allows us to develop a bi-section algorithm.

Lemma 2

Let \(F(x| G)=\sum _{k=1}^K F(x|\mu _k,\sigma _k)\) be a K-component mixture, and ξ(t) = F −1(t|G) and ξ k(t) = F −1(t|θ k), respectively, the t-quantile of the mixture and its kth subpopulation. For any t ∈ (0, 1),

$$\displaystyle \begin{aligned} \min_{k} \xi_{k}(t)\leq \xi(t) \leq \max_{k} \xi_{k}(t). \end{aligned} $$
(9)

Proof

Since F(x|θ) has a continuous CDF, we must have F(ξ k(t)|θ k) = t. By the monotonicity of the CDF F(⋅|θ k), we have

$$\displaystyle \begin{aligned} F(\min_{k}\xi_{k}(t)| \boldsymbol{\theta}_k) \leq F(\xi_{k}(t)| \boldsymbol{\theta}_k) \leq F(\max_{k} \xi_{k}(t)| \boldsymbol{\theta}_k ). \end{aligned}$$

Multiplying by w k and summing over k lead to

$$\displaystyle \begin{aligned} F(\min_{k}\xi_{k}(t)| G)\leq t\leq F(\max_{k} \xi_{k}(t)| G). \end{aligned}$$

This implies (9) and completes the proof. □

In view of this lemma, we can easily find the quantiles of F(⋅|θ k) to form an interval containing the targeting quantile of F(⋅|G). We can quickly find F −1(t|G) value through a bi-section algorithm.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Cite this chapter

Zhang, Q., Chen, J. (2022). Minimum Wasserstein Distance Estimator Under Finite Location-Scale Mixtures. In: He, W., Wang, L., Chen, J., Lin, C.D. (eds) Advances and Innovations in Statistics and Data Science. ICSA Book Series in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-031-08329-7_4

Download citation

Publish with us

Policies and ethics

Navigation