Abstract
Complex data such as those where each statistical unit under study is described not by a single observation (or vector variable), but by a unit-specific sample of several or even many observations, are becoming more and more popular. Reducing these sample data by summary statistics, like the average or the median, implies that most inherent information (about variability, skewness or multi-modality) gets lost. Full information is preserved only if each unit is described by a whole distribution. This new kind of data, a.k.a. “distribution-valued data”, require the development of adequate statistical methods. This paper presents a method to group a set of probability density functions (pdfs) into homogeneous clusters, provided that the pdfs have to be estimated nonparametrically from the unit-specific data. Since elements belonging to the same cluster are naturally thought of as samples from the same probability model, the idea is to tackle the clustering problem by defining and estimating a proper mixture model on the space of pdfs. The issue of model building is challenging here because of the infinite-dimensionality and the non-Euclidean geometry of the domain space. By adopting a wavelet-based representation for the elements in the space, the task is accomplished by using mixture models for hyper-spherical data. The proposed solution is illustrated through a simulation experiment and on two real data sets.
Similar content being viewed by others
References
Abramowitz M, Stegun IA (1974) Handbook of mathematical functions. Dover Publ Inc., New York
Applegate D, Dasu T, Krishnan S, Urbanek S (2011) Unsupervised clustering of multidimensional distributions using earth mover distance. In: the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 636–644. doi:10.1145/2020408.2020508
Banerjee A, Dhillon IS, Ghosh J, Sra S (2005) Clustering on the unit hypersphere using von Mises-Fisher distributions. J Mach Learn Res 6:1345–1382
Bezdeck JC (1981) Pattern recognition with Fuzzy objective function algorithms. Plenum Press, New York
Bock H-H, Diday E (2000) Analysis of symbolic data. Springer, Heidelberg
Cattani C (2010) Fractals and Hidden Symmetries in DNA. Mathematical Problems in Engineering. Article ID 507056: doi:10.1155/2010/507056
Chervoneva I, Zhan T, Iglewicz B, Walter H, Birck DE (2012) Two-stage hierarchical modeling for analysis of subpopulations in conditional distributions. J Appl Stat 39:445–460
Delicado P (2011) Dimensionality reduction when data are density functions. Comput Stat Data An 55: 401–420
Dempster NM, Laird AP, Rubin DB (1977) Maximum likelihood for incomplete data via the EM algorithm. J Roy Stat Soc (Ser B) 39:1–39
Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42:143–175
Diday E, Noirhomme M (2008) Symbolic data analysis. Wiley, New York
Hartigan JA, Wong MA (1979) A k-means clustering algorithm. Appl Stat 28:100–108
Herrick DRM, Nason GP, Silverman BW (2001) Some new methods for wavelet density estimation. Sankhya A 63:391–411
Maharaj EA, D’Urso P, Galagedera DUA (2010) Wavelets-based fuzzy clustering of time series. J Classif 27:231–275
Mallat SG (1989) A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans Patt An Mach Intell 11:674–693
Mardia KV, Jupp PE (2000) Directional statistics. Wiley, New York
Marron S, Wand M (1992) Exact mean integrated squared error. Ann Stat 20:712–736
Noirhomme-Fraiture M, Brito P (2011) Far beyond the classical data models: symbolic data analysis. Stat Anal Data Min 4:157–170
Ogden RT (1997) Essential wavelets for statistical application and data analysis. Birkhauser, Boston
Peel D, Whiten WJ, McLachlan GJ (2001) Fitting mixtures of Kent distributions to aid in joint set identification. J Am Stat Assoc 96:56–63
Penev S, Dechevsky L (1997) On non-negative wavelet-based density estimators. J Nonparameter Stat 7:365–394
Percival DB, Walden AT (2000) Wavelet methods for time series analysis. Cambridge University Press, New York
Peter A, Rangarajan A (2008) Maximum likelihood wavelet density estimation with applications to image and shape matching. IEEE Trans Image Proc 17:458–468
Pinheiro A, Vidakovic B (1997) Estimating the square root of a density via compactly supported wavelets. Comput Stat Data Anal 25:399–415
Sakurai Y, Chong R, Lei L, Faloutsos C (2008) Efficient distribution mining and classification. In: Proceedings of the 2008 SIAM international conference on data mining. http://www.siam.org/proceedings/datamining/2008/dm08_58_sakurai.pdf
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Sheather SJ, Jones MC (1991) A reliable data-based bandwidth selection method for kernel density estimation. J Roy Statist Soc (Ser B) 53:683–690
Silverman B (1986) Density estimation. Chapman and Hall, London
Spellman E, Vemuri BC, Rao M (2005) Using the KL-center for efficient and accurate retrieval of distributions arising from texture images. IEEE Comput Soc Confer Comput V Pattern Recogn 1:111–116. doi:10.1109/CVPR.2005.363
Sra S, Karp D (2013) The multivariate Watson distribution: maximum-likelihood estimation and other aspects. J Multivariate Anal 114:256–269
Srivastava A, Jermyn I, Joshi S (2007) Riemannian analysis of probability density functions with applications in vision. IEEE Conf Comput Vision Patt Recogn. doi:10.1109/CVPR.2007.383188
Sturges H (1926) The choice of a class-interval. J Am Stat Assoc 21:65–66
Terada Y, Yadohisa H (2010) Non-hierarchical clustering for distribution-valued data. In: Lechevallier Y, Saporta G (eds) Proceedings of COMPSTAT 2010. Physica-Verlag, Heidelberg, pp 1653–1660
Vannucci M (1998) Nonparametric density estimation using wavelets. ISDS, D.P. http://www.isds.duke.edu
Verde R, Irpino A (2008) Comparing histogram data using a Mahalanobis-Wasserstein distance. In: Brito P (ed) Proceedings of COMPSTAT2008. Physica-Verlag, Heidelberg, pp 77–89
Vrac M, Billard L, Diday E, Chdin A (2011) Copula analysis of mixture models. Comput Stat 27:427–457
Walter GG (1995) Estimation with wavelets and the curse of dimensionality. Technical report—Department of Mathematical Sciences. University of Wisconsin-Milwaukee
Wouters BJ, Lwenberg B, Erpelinck-Verschueren CA, van Putten W, Valk P, Delwel R (2009) Double CEBPA mutations, but not single CEBPA mutations, define a subgroup of acute myeloid leukemia with a distinctive gene expression profile that is uniquely associated with a favorable outcome. Blood 26:3088–3091
Yamal JM, Follen M, Guillaud M, Cox D (2011) Classifying tissue samples from measurements on cells with within-class tissue sample heterogeneity. Biostatistics 12:695–709
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Montanari, A., Calò, D.G. Model-based clustering of probability density functions. Adv Data Anal Classif 7, 301–319 (2013). https://doi.org/10.1007/s11634-013-0140-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-013-0140-8