Model-based clustering of probability density functions

Montanari, Angela; Calò, Daniela G.

doi:10.1007/s11634-013-0140-8

Model-based clustering of probability density functions

Regular Article
Published: 27 June 2013

Volume 7, pages 301–319, (2013)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Angela Montanari¹ &
Daniela G. Calò¹

962 Accesses
24 Citations
Explore all metrics

Abstract

Complex data such as those where each statistical unit under study is described not by a single observation (or vector variable), but by a unit-specific sample of several or even many observations, are becoming more and more popular. Reducing these sample data by summary statistics, like the average or the median, implies that most inherent information (about variability, skewness or multi-modality) gets lost. Full information is preserved only if each unit is described by a whole distribution. This new kind of data, a.k.a. “distribution-valued data”, require the development of adequate statistical methods. This paper presents a method to group a set of probability density functions (pdfs) into homogeneous clusters, provided that the pdfs have to be estimated nonparametrically from the unit-specific data. Since elements belonging to the same cluster are naturally thought of as samples from the same probability model, the idea is to tackle the clustering problem by defining and estimating a proper mixture model on the space of pdfs. The issue of model building is challenging here because of the infinite-dimensionality and the non-Euclidean geometry of the domain space. By adopting a wavelet-based representation for the elements in the space, the task is accomplished by using mixture models for hyper-spherical data. The proposed solution is illustrated through a simulation experiment and on two real data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Institutional subscriptions

Generalised Watson Distribution on the Hypersphere with Applications to Clustering

Article 18 August 2022

Learning Mixtures by Simplifying Kernel Density Estimators

Variational Bayesian inference for infinite generalized inverted Dirichlet mixtures with feature selection and its application to clustering

Article 06 October 2015

References

Abramowitz M, Stegun IA (1974) Handbook of mathematical functions. Dover Publ Inc., New York
Applegate D, Dasu T, Krishnan S, Urbanek S (2011) Unsupervised clustering of multidimensional distributions using earth mover distance. In: the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 636–644. doi:10.1145/2020408.2020508
Banerjee A, Dhillon IS, Ghosh J, Sra S (2005) Clustering on the unit hypersphere using von Mises-Fisher distributions. J Mach Learn Res 6:1345–1382
MathSciNet MATH Google Scholar
Bezdeck JC (1981) Pattern recognition with Fuzzy objective function algorithms. Plenum Press, New York
Book Google Scholar
Bock H-H, Diday E (2000) Analysis of symbolic data. Springer, Heidelberg
Book Google Scholar
Cattani C (2010) Fractals and Hidden Symmetries in DNA. Mathematical Problems in Engineering. Article ID 507056: doi:10.1155/2010/507056
Chervoneva I, Zhan T, Iglewicz B, Walter H, Birck DE (2012) Two-stage hierarchical modeling for analysis of subpopulations in conditional distributions. J Appl Stat 39:445–460
Article MathSciNet Google Scholar
Delicado P (2011) Dimensionality reduction when data are density functions. Comput Stat Data An 55: 401–420
Article MathSciNet MATH Google Scholar
Dempster NM, Laird AP, Rubin DB (1977) Maximum likelihood for incomplete data via the EM algorithm. J Roy Stat Soc (Ser B) 39:1–39
MathSciNet MATH Google Scholar
Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42:143–175
Article MATH Google Scholar
Diday E, Noirhomme M (2008) Symbolic data analysis. Wiley, New York
MATH Google Scholar
Hartigan JA, Wong MA (1979) A k-means clustering algorithm. Appl Stat 28:100–108
Article MATH Google Scholar
Herrick DRM, Nason GP, Silverman BW (2001) Some new methods for wavelet density estimation. Sankhya A 63:391–411
MathSciNet Google Scholar
Maharaj EA, D’Urso P, Galagedera DUA (2010) Wavelets-based fuzzy clustering of time series. J Classif 27:231–275
Article MathSciNet Google Scholar
Mallat SG (1989) A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans Patt An Mach Intell 11:674–693
Article MATH Google Scholar
Mardia KV, Jupp PE (2000) Directional statistics. Wiley, New York
MATH Google Scholar
Marron S, Wand M (1992) Exact mean integrated squared error. Ann Stat 20:712–736
Article MathSciNet MATH Google Scholar
Noirhomme-Fraiture M, Brito P (2011) Far beyond the classical data models: symbolic data analysis. Stat Anal Data Min 4:157–170
Article MathSciNet Google Scholar
Ogden RT (1997) Essential wavelets for statistical application and data analysis. Birkhauser, Boston
Book Google Scholar
Peel D, Whiten WJ, McLachlan GJ (2001) Fitting mixtures of Kent distributions to aid in joint set identification. J Am Stat Assoc 96:56–63
Article MathSciNet Google Scholar
Penev S, Dechevsky L (1997) On non-negative wavelet-based density estimators. J Nonparameter Stat 7:365–394
Article MathSciNet MATH Google Scholar
Percival DB, Walden AT (2000) Wavelet methods for time series analysis. Cambridge University Press, New York
MATH Google Scholar
Peter A, Rangarajan A (2008) Maximum likelihood wavelet density estimation with applications to image and shape matching. IEEE Trans Image Proc 17:458–468
Article MathSciNet Google Scholar
Pinheiro A, Vidakovic B (1997) Estimating the square root of a density via compactly supported wavelets. Comput Stat Data Anal 25:399–415
Article MathSciNet MATH Google Scholar
Sakurai Y, Chong R, Lei L, Faloutsos C (2008) Efficient distribution mining and classification. In: Proceedings of the 2008 SIAM international conference on data mining. http://www.siam.org/proceedings/datamining/2008/dm08_58_sakurai.pdf
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Article MATH Google Scholar
Sheather SJ, Jones MC (1991) A reliable data-based bandwidth selection method for kernel density estimation. J Roy Statist Soc (Ser B) 53:683–690
MathSciNet MATH Google Scholar
Silverman B (1986) Density estimation. Chapman and Hall, London
MATH Google Scholar
Spellman E, Vemuri BC, Rao M (2005) Using the KL-center for efficient and accurate retrieval of distributions arising from texture images. IEEE Comput Soc Confer Comput V Pattern Recogn 1:111–116. doi:10.1109/CVPR.2005.363
Google Scholar
Sra S, Karp D (2013) The multivariate Watson distribution: maximum-likelihood estimation and other aspects. J Multivariate Anal 114:256–269
Article MathSciNet MATH Google Scholar
Srivastava A, Jermyn I, Joshi S (2007) Riemannian analysis of probability density functions with applications in vision. IEEE Conf Comput Vision Patt Recogn. doi:10.1109/CVPR.2007.383188
Sturges H (1926) The choice of a class-interval. J Am Stat Assoc 21:65–66
Article Google Scholar
Terada Y, Yadohisa H (2010) Non-hierarchical clustering for distribution-valued data. In: Lechevallier Y, Saporta G (eds) Proceedings of COMPSTAT 2010. Physica-Verlag, Heidelberg, pp 1653–1660
Vannucci M (1998) Nonparametric density estimation using wavelets. ISDS, D.P. http://www.isds.duke.edu
Verde R, Irpino A (2008) Comparing histogram data using a Mahalanobis-Wasserstein distance. In: Brito P (ed) Proceedings of COMPSTAT2008. Physica-Verlag, Heidelberg, pp 77–89
Vrac M, Billard L, Diday E, Chdin A (2011) Copula analysis of mixture models. Comput Stat 27:427–457
Article Google Scholar
Walter GG (1995) Estimation with wavelets and the curse of dimensionality. Technical report—Department of Mathematical Sciences. University of Wisconsin-Milwaukee
Wouters BJ, Lwenberg B, Erpelinck-Verschueren CA, van Putten W, Valk P, Delwel R (2009) Double CEBPA mutations, but not single CEBPA mutations, define a subgroup of acute myeloid leukemia with a distinctive gene expression profile that is uniquely associated with a favorable outcome. Blood 26:3088–3091
Article Google Scholar
Yamal JM, Follen M, Guillaud M, Cox D (2011) Classifying tissue samples from measurements on cells with within-class tissue sample heterogeneity. Biostatistics 12:695–709
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, University of Bologna, via Belle Arti 41, 40126 , Bologna, Italy
Angela Montanari & Daniela G. Calò

Authors

Angela Montanari
View author publications
You can also search for this author in PubMed Google Scholar
Daniela G. Calò
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniela G. Calò.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Montanari, A., Calò, D.G. Model-based clustering of probability density functions. Adv Data Anal Classif 7, 301–319 (2013). https://doi.org/10.1007/s11634-013-0140-8

Download citation

Received: 29 December 2012
Revised: 31 May 2013
Accepted: 12 June 2013
Published: 27 June 2013
Issue Date: September 2013
DOI: https://doi.org/10.1007/s11634-013-0140-8

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Institutional subscriptions

Model-based clustering of probability density functions

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Generalised Watson Distribution on the Hypersphere with Applications to Clustering

Learning Mixtures by Simplifying Kernel Density Estimators

Variational Bayesian inference for infinite generalized inverted Dirichlet mixtures with feature selection and its application to clustering

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Navigation

Model-based clustering of probability density functions

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Generalised Watson Distribution on the Hypersphere with Applications to Clustering

Learning Mixtures by Simplifying Kernel Density Estimators

Variational Bayesian inference for infinite generalized inverted Dirichlet mixtures with feature selection and its application to clustering

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Search

Navigation