Abstract
Symbolic data analysis is based on special descriptions of data known as symbolic objects (SOs). Such descriptions preserve more detailed information about units and their clusters than the usual representations with mean values. A special type of SO is a representation with frequency or probability distributions (modal values). This representation enables us to simultaneously consider variables of all measurement types during the clustering process. In this paper, we present the theoretical basis for compatible leaders and agglomerative clustering methods with alternative dissimilarities for modal-valued SOs. The leaders method efficiently solves clustering problems with large numbers of units, while the agglomerative method can be applied either alone to a small data set, or to leaders, obtained from the compatible leaders clustering method. We focus on (a) the inclusion of weights that enables clustering representatives to retain the same structure as if clustering only first order units and (b) the selection of relative dissimilarities that produce more interpretable, i.e., meaningful optimal clustering representatives. The usefulness of the proposed methods with adaptations was assessed and substantiated by carefully constructed simulation settings and demonstrated on three different real-world data sets gaining in interpretability from the use of weights (population pyramids and ESS data) or relative dissimilarity (US patents data).
Similar content being viewed by others
References
Anderberg MR (1973) Cluster analysis for applications. Academic Press, New York
Batagelj V (1988) Generalized ward and related clustering problems. In: Bock HH (ed) Classification and related methods of data analysis. North-Holland, Amsterdam, pp 67–74
Batagelj V, Kejžar N (2016) Clamix—clustering symbolic objects. Program in R. https://r-forge.r-project.org/projects/clamix/. Accessed 21 Oct 2020
Billard L, Diday E (2003) From the statistics of data to the statistics of knowledge: symbolic data analysis. J Am Stat Assoc 98(462):470–487
Billard L, Diday E (2006) Symbolic data analysis. Conceptual statistics and data mining. Wiley, Chichester
Billard L, Diday E (2019) Clustering methodology for symbolic data. Wiley, New York
Bock HH, Diday E (eds) (2000) Analysis of symbolic data. Exploratory methods for extracting statistical information from complex data. Springer, Heidelberg
Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat Theory Methods 3(1):1–27
de Carvalho FAT, Sousa RMCR (2010) Unsupervised pattern recognition models for mixed feature-type symbolic data. Pattern Recogn Lett 31:430–443
de Carvalho FAT, Brito P, Bock HH (2006) Dynamic clustering for interval data based on L2 distance. Comput Stat 21(2):231–250
Diday E (1979) Optimisation en classification automatique. Tome 1.,2. INRIA, Rocquencourt (in French)
Diday E, Noirhomme-Fraiture M (2008) Symbolic data analysis and the SODAS software. Wiley, Chichester
ESS Round 5: European Social Survey Round 5 Data (2010). Data file edition 2.0. Norwegian Social Science Data Services, Norway—Data Archive and distributor of ESS data
ESS website. http://www.europeansocialsurvey.org/. Accessed 27 Sept 2012
Filzmoser P, Hron K, Templ M (2018) Applied compositional data analysis. Springer, Switzerland
Gowda KC, Diday E (1991) Symbolic clustering using a new dissimilarity measure. Pattern Recogn 24(6):567–678
Hall BH, Jaffe AB, Tratjenberg M (2001) The NBER patent citation data file: lessons, insights and methodological tools. NBER Working Paper 8498, NBER, 2001
Hardy A, Lallemand P (2004) Clustering of symbolic objects described by multi-valued and modal variables. In: Banks D, House L, McMorris F, Arabie P, Gaul W (eds) Classification, clustering and data mining applications. Springer, Berlin, pp 325–332
Hartigan JA (1975) Clustering algorithms. Wiley, New York
Ichino M, Yaguchi H (1994) Generalized Minkowski metrics for mixed feature type data analysis. IEEE Trans Syst Man Cybern 24(4):698–708
IDB: International Data Base (2008). http://www.census.gov/ipc/www/idbnew.html. Accessed 10 Feb 2008
Irpino A, Verde R (2006) A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In: Batagelj V, Bock HH, Ferligoj A, Ziberna A (eds) Data science and classification. Springer, Berlin, pp 185–192
Irpino A, Verde R, de Carvalho FAT (2014) Dynamic clustering of histogram data based on adaptive squared Wasserstein distances. Expert Syst Appl 41:3351–3366
Kejžar N, Korenjak-Černe S, Batagelj V (2011) Clustering of distributions: a case of patent citations. J Classif 28(2):156–183
Kim J, Billard L (2011) A polythetic clustering process and cluster validity indexes for histogram-valued objects. Comput Stat Data Anal 55:2250–2262
Kim J, Billard L (2012) Dissimilarity measures and divisive clustering for symbolic multimodal-valued data. Comput Stat Data Anal 56(9):2795–2808
Kim J, Billard L (2013) Dissimilarity measures for histogram-valued observations. Commun Stat Theory Methods 42(2):283–303
Korenjak-Černe S, Batagelj V (1998) Clustering large data sets of mixed units. In: Rizzi A, Vichi M, Bock HH (eds) 6th Conference of the international federation of classification societies (IFCS-98) Universita “La Sapienza”, Rome, 21–24 July 1998. Advances in data science and classification. Springer, Berlin, pp 43–48
Korenjak-Černe S, Batagelj V (2002) Symbolic data analysis approach to clustering large datasets. In: Jajuga K, Sokołowski A, Bock HH (eds) 8th Conference of the international federation of classification societies, 16–19 July 2002, Cracow, Classification, clustering and data analysis. Springer, Berlin, pp 319–327
Korenjak-Černe S, Batagelj V, Japelj Pavešić B (2011) Clustering large data sets described with discrete distributions and its application on TIMSS data set. Stat Anal Data Min 4(2):199–215
Korenjak-Černe S, Kejžar N, Batagelj V (2015) A weighted clustering of population pyramids for the world’s countries, 1996, 2001, 2006. Popul Stud 69(1):105–120
Košmelj K, Billard L (2011) Clustering of population pyramids using Mallows’ L2 distance. Metodološki zvezki 8(1):1–15
Krichevsky RE, Trofimov VK (1981) The performance of universal encoding. IEEE Trans Inf Theory IT–27(2):199–207
Manning C, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
NBER Patent Data Project. https://sites.google.com/site/patentdataproject/Home. Accessed 6 Sept 2016
Noirhomme-Fraiture M, Brito P (2011) Far beyond the classical data models: symbolic data analysis. Stat Anal Data Min 4(2):157–170
Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, New York
Verde R, Irpino A (2010) Ordinary least squares for histogram data based on Wasserstein distance. In: Proc. COMPSTAT’2010. Springer, Berlin, pp 581–589
Verde R, de Carvalho FAT, Lechevallier Y (2000) A dynamic clustering algorithm for multi-nominal data. In: Kiers HAL, Rasson JP, Groenen PJF, Schader M (eds) Data analysis, classification, and related methods. Springer, Berlin
Ward JH (1963) Hierarchical grou** to optimize an objective function. J Am Stat Assoc 58:236–244
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Kejžar, N., Korenjak-Černe, S. & Batagelj, V. Clustering of modal-valued symbolic data. Adv Data Anal Classif 15, 513–541 (2021). https://doi.org/10.1007/s11634-020-00425-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-020-00425-4
Keywords
- Symbolic objects
- Leaders method
- Hierarchical clustering
- Ward’s method
- Clustering demographic structures
- United States Patents data set
- European social survey data set