Minimum Wasserstein Distance Estimator Under Finite Location-Scale Mixtures

Zhang, Qiong; Chen, Jiahua

doi:10.1007/978-3-031-08329-7_4

Qiong Zhang⁷ &
Jiahua Chen⁷

Part of the book series: ICSA Book Series in Statistics ((ICSABSS))

422 Accesses

Abstract

When a population exhibits heterogeneity, we often model it via a finite mixture: decompose it into several different but homogeneous subpopulations. Contemporary practice favors learning the mixtures by maximizing the likelihood for statistical efficiency and the convenient EM algorithm for numerical computation. Yet the maximum likelihood estimate (MLE) is not well defined for finite location-scale mixture in general. We hence investigate feasible alternatives to MLE such as minimum distance estimators. Recently, the Wasserstein distance has drawn increased attention in the machine learning community. It has intuitive geometric interpretation and is successfully employed in many new applications. Do we gain anything by learning finite location-scale mixtures via a minimum Wasserstein distance estimator (MWDE)? This chapter investigates this possibility in several respects. We find that the MWDE is consistent and derive a numerical solution under finite location-scale mixtures. We study its robustness against outliers and mild model mis-specifications. Our moderate scaled simulation study shows the MWDE suffers some efficiency loss against a penalized version of MLE in general without noticeable gain in robustness. We reaffirm the general superiority of the likelihood-based learning strategies even for the non-regular finite location-scale mixtures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Chapter: EUR 29.95; Price includes VAT (Germany)

eBook: EUR 117.69; Price includes VAT (Germany)

Softcover Book: EUR 160.49; Price includes VAT (Germany)

Hardcover Book: EUR 160.49; Price includes VAT (Germany)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Robust Mixture Regression Using Mixture of Different Distributions

Robust Minimax Variance Estimation of Location under Bounded Distribution Interquantile Ranges

Article 22 May 2020

Recent Developments in Model-Based Clustering with Applications

Notes

1.
https://www.pinterest.se/pin/761952830692007143/.

References

Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein GAN. Preprint. ar**v:1701.07875.
Google Scholar
Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.
Google Scholar
Chen, J., & Tan, X. (2009). Inference for multivariate normal mixtures. Journal of Multivariate Analysis, 100(7), 1367–1383.
Article Google Scholar
Chen, J., Tan, X., & Zhang, R. (2008). Inference for normal mixtures in mean and variance. Statistica Sinica, 18(2), 443–465.
Google Scholar
Chen, J., Li, P., & Liu, G. (2020). Homogeneity testing under finite location-scale mixtures. Canadian Journal of Statistics, 48(4), 670–684.
Article Google Scholar
Choi, K. (1969). Estimators for the parameters of a finite mixture of distributions. Annals of the Institute of Statistical Mathematics, 21(1), 107–116.
Article Google Scholar
Clark, A. (2015). Pillow (PIL Fork) documentation.
Google Scholar
Clarke, B., & Heathcote, C. (1994). Robust estimation of k-component univariate normal mixtures. Annals of the Institute of Statistical Mathematics, 46(1), 83–93.
Article Google Scholar
Cutler, A., & Cordero-Brana, O. I. (1996). Minimum Hellinger distance estimation for finite mixture models. Journal of the American Statistical Association, 91(436), 1716–1723.
Article Google Scholar
Deely, J., & Kruse, R. (1968). Construction of sequences estimating the mixing distribution. The Annals of Mathematical Statistics, 39(1), 286–288.
Article Google Scholar
Evans, S. N., & Matsen, F. A. (2012). The phylogenetic Kantorovich–Rubinstein metric for environmental sequence samples. Journal of the Royal Statistical Society: Series B (Methodological), 74(3), 569–592.
Article Google Scholar
Farnoosh, R., & Zarpak, B. (2008). Image segmentation using Gaussian mixture model. IUST International Journal of Engineering Science, 19(1–2), 29–32.
Google Scholar
Holzmann, H., Munk, A., & Stratmann, B. (2004). Identifiability of finite mixtures-with applications to circular distributions. Sankhyā: The Indian Journal of Statistics, 66(3), 440–449.
Google Scholar
Kolouri, S., Rohde, G. K., & Hoffmann, H. (2018). Sliced Wasserstein distance for learning Gaussian mixture models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3427–3436).
Google Scholar
Nocedal, J., & Wright, S. (2006). Numerical optimization. Springer Science & Business Media.
Google Scholar
Pearson, K. (1894). Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. A, 185(326-330), 71–110.
Google Scholar
Plataniotis, K. N., & Hatzinak, D. (2000). Gaussian mixtures and their applications to signal processing. In S. Stergiopoulos (Ed.), Advanced signal processing handbook: theory and implementation for radar, sonar, and medical imaging real time systems (vol. 25, chapter 3, pp. 3-1–3-35, 1st edn). Boca Raton: CRC Press.
Google Scholar
Santosh, D. H. H., Venkatesh, P., Poornesh, P., Rao, L. N., & Kumar, N. A. (2013). Tracking multiple moving objects using Gaussian mixture model. International Journal of Soft Computing and Engineering (IJSCE), 3(2), 114–119.
Google Scholar
Schork, N. J., Allison, D. B., & Thiel, B. (1996). Mixture distributions in human genetics research. Statistical Methods in Medical Research, 5(2), 155–178.
Article CAS PubMed Google Scholar
Tanaka, K. (2009). Strong consistency of the maximum likelihood estimator for finite mixtures of location–scale distributions when penalty is imposed on the ratios of the scale parameters. Scandinavian Journal of Statistics, 36(1), 171–184.
Google Scholar
Teicher, H. (1961). Identifiability of mixtures. The Annals of Mathematical Statistics, 32(1), 244–248.
Article Google Scholar
Van der Vaart, A. W. (2000). Asymptotic statistics (vol. 3). Cambridge University Press.
Google Scholar
Villani, C. (2003). Topics in optimal transportation (vol. 58). American Mathematical Society.
Google Scholar
Woodward, W. A., Parr, W. C., Schucany, W. R., & Lindsey, H. (1984). A comparison of minimum distance and maximum likelihood estimation of a mixture proportion. Journal of the American Statistical Association, 79(387), 590–598.
Article Google Scholar
Yakowitz, S. (1969). A consistent estimator for the identification of finite mixtures. The Annals of Mathematical Statistics, 40(5), 1728–1735.
Article Google Scholar
Zhu, D. (2016). A two-component mixture model for density estimation and classification. Journal of Interdisciplinary Mathematics, 19(2), 311–319.
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank Richard Schonberg for proofreading the manuscript.

Author information

Authors and Affiliations

Department of Statistics, University of British Columbia, Vancouver, BC, Canada
Qiong Zhang & Jiahua Chen

Authors

Qiong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jiahua Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiahua Chen .

Editor information

Editors and Affiliations

Department of Statistical and Actuarial Sciences, University of Western Ontario, London, ON, Canada
Wenqing He
Department of Statistics, University of Manitoba, Winnipeg, MB, Canada
Liqun Wang
Department of Statistics, University of British Columbia, Vancouver, BC, Canada
Jiahua Chen
Department of Mathematics and Statistics, Queen’s University, Kingston, ON, Canada
Chunfang Devon Lin

Appendices

Appendix

Numerically Friendly Expression of W ₂(F _N, F(⋅|G))

To learn the finite mixture distribution through MWDE, we must compute

$$\displaystyle \begin{aligned} \mathbb{W}_{N}(G) =W_2^2(F_N(\cdot), F(\cdot|G)) = \int_{0}^{1} \{ F_N^{-1}(t) - F^{-1}(t|G)\} ^2 dt \end{aligned}$$

for finite location-scale mixture

$$\displaystyle \begin{aligned} F(\cdot |G) = \sum_{k=1}^K \pi_k F(\cdot| \boldsymbol{\theta}_k) = \sum_{k=1}^K \pi_k \sigma_k^{-1} F_0( (x - \mu_k)/\sigma_k). \end{aligned}$$

We write ${\mathbb E}_k(\cdot )$ as expectation under distribution F(⋅|θ _k). For instance,

$$\displaystyle \begin{aligned} \mathbb{E}_k\{X^2\} = \mu_k^2 + \sigma_k^2(\mu_0^2+\sigma_0^2)+2\mu_k\sigma_k\mu_0. \end{aligned}$$

Let I _n = ((n − 1)∕N, n∕N] for n = 1, 2, …, N so that $F^{-1}_N (t) = x_{(n)}$ when t ∈ I _n, where x _(n) is the nth order statistic. For ease of notation, we write x _(n) as x _n. Over this interval, we have

$$\displaystyle \begin{aligned} \int_{I_n} \{ F^{-1}_N(t) - F^{-1}(t|G) \}^2 dt = \int_{I_n} [ x^2_n - 2 x_n F^{-1}(t|G) + \{ F^{-1}(t|G) \}^2 ] dt. \end{aligned} $$

(8)

The integration of the first term in (8), after summing over n, is given by

$$\displaystyle \begin{aligned} \sum_{n=1}^N \int_{I_n} x_n^2 dt =N^{-1} \sum_n x_n^2=\overline{x^2}. \end{aligned}$$

The integration of the third term in (8) is

$$\displaystyle \begin{aligned} \sum_{n=1}^N \int_{I_n} \{ F^{-1}(t|G) \}^2 dt = \int_{-\infty}^{\infty} x^2 f(x|G) dx = \sum_{k=1}^{K} w_k \mathbb{E}_k \{X^2\}. \end{aligned}$$

Let ξ ₀ = −∞, ξ _N+1 = ∞, and ξ _n = F ⁻¹(n∕N|G) for n = 1, …, N. Denote

$$\displaystyle \begin{aligned} \Delta F_{nk} = F(\xi_{n}| \boldsymbol{\theta}_k) - F(\xi_{n-1}| \boldsymbol{\theta}_k) \end{aligned}$$

and

$$\displaystyle \begin{aligned} T(x) = \int_{-\infty}^x t f_0(t) dt, ~~~ \Delta T_{nk} = T((\xi_{n}-\mu_k)/\sigma_k) - T(\xi_{n-1}-\mu_k)/\sigma_k). \end{aligned}$$

Then

$$\displaystyle \begin{aligned} \begin{array}{rcl} \int_{I_n} F^{-1}(t|G)dt & = \sum_k w_k \int_{\xi_{n-1}}^{\xi_{n}} x f(x|\mu_k,\sigma_k) dx \\ & = \sum_k w_k \{ \mu_k \Delta F_{nk} + \sigma_k \Delta T_{nk} \}. \end{array} \end{aligned} $$

These lead to numerically convenient expression

$$\displaystyle \begin{aligned} \mathbb{W}_{N}(G) = \overline{x^2}+ \sum_k w_k {\mathbb E}_{k}\{ X^2\} - 2\sum_k w_k \{ \mu_k \Delta F_{nk} + \sigma_k \Delta T_{nk} \}. \end{aligned}$$

To most effectively use BFGS algorithm, it is best to provide gradients of the objective function. Here are some numerically friendly expressions of some partial derivatives.

Lemma 1

Let δ _jk = 1 when j = k and δ _jk = 0 when j ≠ k. For n = 1, …, N and j = 1, 2, …, K, we have

$$\displaystyle \begin{aligned} \begin{array}{rcl} \frac{\partial}{\partial w_j} F(\xi_n| \boldsymbol{\theta}_k) & =&\displaystyle f(\xi_n|\boldsymbol{\theta}_k)\frac{\partial \xi_n}{\partial w_j},\\ \frac{\partial}{\partial \mu_j} F(\xi_n| \boldsymbol{\theta}_k)& =&\displaystyle f(\xi_n|\boldsymbol{\theta}_k)\left (\frac{\partial \xi_n}{\partial \mu_j}-\delta_{jk}\right ), \\ \frac{\partial}{\partial \sigma_j} F(\xi_n| \boldsymbol{\theta}_k) & =&\displaystyle f(\xi_n|\boldsymbol{\theta}_k) \Big (\frac{\partial\xi_n}{\partial\sigma_j} -\left \{\frac{\xi_n-\mu_k}{\sigma_k}\right \}\delta_{jk}\Big ), \end{array} \end{aligned} $$

and

$$\displaystyle \begin{aligned} \begin{array}{rcl} \frac{\partial}{\partial w_j}T \left (\frac{\xi_n-\mu_k}{\sigma_k} \right ) & =&\displaystyle f(\xi_n|\boldsymbol{\theta}_k) \left ( \frac{\xi_n-\mu_k}{\sigma_k}\right ) \frac{\partial\xi_i}{\partial w_j}, \\ \frac{\partial}{\partial \mu_j}T \left (\frac{\xi_n-\mu_k}{\sigma_k} \right ) & =&\displaystyle f(\xi_n|\boldsymbol{\theta}_k) \left ( \frac{\xi_n-\mu_k}{\sigma_k}\right ) \left ( \frac{\partial\xi_n}{\partial \mu_j}-\delta_{jk}\right ), \\ \frac{\partial}{\partial \sigma_j} T \left (\frac{\xi_n-\mu_k}{\sigma_k} \right ) & =&\displaystyle f(\xi_n|\boldsymbol{\theta}_k) \left ( \frac{\xi_n-\mu_k}{\sigma_k}\right ) \left \{ \frac{\partial\xi_i}{\partial\sigma_j} - \left ( \frac{\xi_n-\mu_k}{\sigma_k} \right ) \delta_{jk} \right \}. \end{array} \end{aligned} $$

Furthermore, we have

$$\displaystyle \begin{aligned} \begin{array}{rcl} \frac{\partial\xi_n}{\partial \mu_k} & =&\displaystyle \frac{w_k f(\xi_i|\boldsymbol{\theta}_k)}{f(\xi_n|G)}, \\ \frac{\partial\xi_n}{\partial \sigma_k} & =&\displaystyle \frac{w_k f(\xi_n|\boldsymbol{\theta}_k)}{f(\xi_i|G)}\left ( \frac{\xi_n-\mu_k}{\sigma_k}\right ) ,\\ \frac{\partial\xi_n}{\partial w_k} & =&\displaystyle - \frac{ F(\xi_n|\boldsymbol{\theta}_k)}{f(\xi_n|G)}. \end{array} \end{aligned} $$

Based on this lemma, it is seen that

$$\displaystyle \begin{aligned} \begin{array}{rcl} \frac{\partial}{\partial \mu_j}\mathbb{W}_N& =&\displaystyle 2w_j(\mu_j+\sigma_j\mu_0) - 2w_j\sum_{n=1}^N x_{(n)}\Delta F_{nj} \\ & &\displaystyle - 2 \sum_{n=1}^{N}\sum_{k} w_k \mu_k x_{(n)} \left \{ \ \frac{\partial F_0(\xi_n| \boldsymbol{\theta}_k)}{\partial\mu_j} - \frac{\partial F_0(\xi_{n-1} | \boldsymbol{\theta}_k)}{\partial\mu_j} \right \} \\ & &\displaystyle -2\sum_{n=1}^{N}\sum_{k} w_k\sigma_k x_{(n)} \frac{\partial }{\partial \mu_j} \left \{ T \left ( \frac{\xi_{n}-\mu_k}{\sigma_k} \right ) - T \left ( \frac{\xi_{n-1}-\mu_k}{\sigma_k} \right ) \right \} \end{array} \end{aligned} $$

with F ₀(ξ ₀|θ _k) = 0, F ₀(ξ _N+1|θ _k) = 1, $T \big ( \frac {\xi _{0}-\mu _k}{\sigma _k} \big )=0$, and $T\big (\frac {\xi _{N+1}-\mu _k}{\sigma _k} \big )=\int _{-\infty }^{\infty } tf_0(t)dt$ is a constant that does not depend on any parameters. Substituting the partial derivatives in Lemma 1, we then get

$$\displaystyle \begin{aligned} \begin{array}{rcl} \frac{\partial}{\partial \mu_j}\mathbb{W}_N & =&\displaystyle 2w_j(\mu_j+\sigma_j\mu_0) -2w_j\sum_{n=1}^N x_{(n)}\Delta F_{nj}\\ & &\displaystyle -2\sum_{n=1}^{N-1}x_{(n)}\xi_n\sum_{k}w_k f(\xi_n|\mu_k,\sigma_k) \big(\frac{\partial\xi_n}{\partial\mu_j}-\delta_{jk}\big)\\ & &\displaystyle +2\sum_{n=1}^{N-1}x_{(n)}\xi_{n-1}\sum_{k}w_k f(\xi_{n-1}| \mu_k,\sigma_k) \big(\frac{\partial\xi_{n-1}}{\partial\mu_j}-\delta_{jk}\big)\\ & =&\displaystyle 2w_j\big\{\mu_j + \sigma_j\mu_0 - \sum_{n=1}^N x_{(n)}\Delta F_{nj}\big\}. \end{array} \end{aligned} $$

Similarly, we have

$$\displaystyle \begin{aligned} \begin{array}{rcl} \frac{\partial}{\partial \sigma_j}\mathbb{W}_N & =&\displaystyle \ 2w_j\{\sigma_j(\mu_0^2+\sigma_0^2) + \mu_j\mu_0-\sum_{n=1}^N x_{(n)}\Delta\mu_{nj}\}, \\ \frac{\partial}{\partial w_k} \mathbb{W}_N & =&\displaystyle \ \{\mu_k^2 + \sigma_k^2(\mu_0^2+\sigma_0^2)+2\mu_k\sigma_k\mu_0\} - 2 \sum_{n=1}^{N-1} \{x_{(n+1)} - x_{(n)}\} \xi_iF(\xi_n|\boldsymbol{\theta}_k)\\ & &\displaystyle - 2\big \{ \mu_k \sum_{n=1}^{N} x_{(n)} \Delta F_{nk} + \sigma_k \sum_{n=1}^{N} x_{(n)}\Delta T_{nk} \big \}. \end{array} \end{aligned} $$

Computing the quantiles of the mixture distribution F(⋅|G) for each G is one of the most demanding tasks. The property stated in the following lemma allows us to develop a bi-section algorithm.

Lemma 2

Let $F(x| G)=\sum _{k=1}^K F(x|\mu _k,\sigma _k)$ be a K-component mixture, and ξ(t) = F ⁻¹(t|G) and ξ _k(t) = F ⁻¹(t|θ _k), respectively, the t-quantile of the mixture and its kth subpopulation. For any t ∈ (0, 1),

$$\displaystyle \begin{aligned} \min_{k} \xi_{k}(t)\leq \xi(t) \leq \max_{k} \xi_{k}(t). \end{aligned} $$

(9)

Proof

Since F(x|θ) has a continuous CDF, we must have F(ξ _k(t)|θ _k) = t. By the monotonicity of the CDF F(⋅|θ _k), we have

$$\displaystyle \begin{aligned} F(\min_{k}\xi_{k}(t)| \boldsymbol{\theta}_k) \leq F(\xi_{k}(t)| \boldsymbol{\theta}_k) \leq F(\max_{k} \xi_{k}(t)| \boldsymbol{\theta}_k ). \end{aligned}$$

Multiplying by w _k and summing over k lead to

$$\displaystyle \begin{aligned} F(\min_{k}\xi_{k}(t)| G)\leq t\leq F(\max_{k} \xi_{k}(t)| G). \end{aligned}$$

This implies (9) and completes the proof. □

In view of this lemma, we can easily find the quantiles of F(⋅|θ _k) to form an interval containing the targeting quantile of F(⋅|G). We can quickly find F ⁻¹(t|G) value through a bi-section algorithm.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zhang, Q., Chen, J. (2022). Minimum Wasserstein Distance Estimator Under Finite Location-Scale Mixtures. In: He, W., Wang, L., Chen, J., Lin, C.D. (eds) Advances and Innovations in Statistics and Data Science. ICSA Book Series in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-031-08329-7_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-08329-7_4
Published: 24 February 2012
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-08328-0
Online ISBN: 978-3-031-08329-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Minimum Wasserstein Distance Estimator Under Finite Location-Scale Mixtures

Abstract

Access this chapter

Similar content being viewed by others

Robust Mixture Regression Using Mixture of Different Distributions

Robust Minimax Variance Estimation of Location under Bounded Distribution Interquantile Ranges

Recent Developments in Model-Based Clustering with Applications

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

Appendix

Numerically Friendly Expression of W ₂(F _N, F(⋅|G))

Lemma 1

Lemma 2

Proof

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Minimum Wasserstein Distance Estimator Under Finite Location-Scale Mixtures

Abstract

Access this chapter

Similar content being viewed by others

Robust Mixture Regression Using Mixture of Different Distributions

Robust Minimax Variance Estimation of Location under Bounded Distribution Interquantile Ranges

Recent Developments in Model-Based Clustering with Applications

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

Appendix

Numerically Friendly Expression of W 2(F N, F(⋅|G))

Lemma 1

Lemma 2

Proof

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation

Numerically Friendly Expression of W ₂(F _N, F(⋅|G))