Clustering of modal-valued symbolic data

Kejžar, Nataša; Korenjak-Černe, Simona; Batagelj, Vladimir

doi:10.1007/s11634-020-00425-4

Clustering of modal-valued symbolic data

Regular Article
Published: 24 October 2020

Volume 15, pages 513–541, (2021)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

526 Accesses
5 Citations
2 Altmetric
Explore all metrics

Abstract

Symbolic data analysis is based on special descriptions of data known as symbolic objects (SOs). Such descriptions preserve more detailed information about units and their clusters than the usual representations with mean values. A special type of SO is a representation with frequency or probability distributions (modal values). This representation enables us to simultaneously consider variables of all measurement types during the clustering process. In this paper, we present the theoretical basis for compatible leaders and agglomerative clustering methods with alternative dissimilarities for modal-valued SOs. The leaders method efficiently solves clustering problems with large numbers of units, while the agglomerative method can be applied either alone to a small data set, or to leaders, obtained from the compatible leaders clustering method. We focus on (a) the inclusion of weights that enables clustering representatives to retain the same structure as if clustering only first order units and (b) the selection of relative dissimilarities that produce more interpretable, i.e., meaningful optimal clustering representatives. The usefulness of the proposed methods with adaptations was assessed and substantiated by carefully constructed simulation settings and demonstrated on three different real-world data sets gaining in interpretability from the use of weights (population pyramids and ESS data) or relative dissimilarity (US patents data).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Institutional subscriptions

Hierarchical conceptual clustering based on quantile method for identifying microscopic details in distributional data

Article 22 July 2020

Feature selection on probabilistic symbolic objects

Article 23 October 2014

Comparison of Two Distribution Valued Dissimilarities and Its Application for Symbolic Clustering

References

Anderberg MR (1973) Cluster analysis for applications. Academic Press, New York
MATH Google Scholar
Batagelj V (1988) Generalized ward and related clustering problems. In: Bock HH (ed) Classification and related methods of data analysis. North-Holland, Amsterdam, pp 67–74
Google Scholar
Batagelj V, Kejžar N (2016) Clamix—clustering symbolic objects. Program in R. https://r-forge.r-project.org/projects/clamix/. Accessed 21 Oct 2020
Billard L, Diday E (2003) From the statistics of data to the statistics of knowledge: symbolic data analysis. J Am Stat Assoc 98(462):470–487
Article MathSciNet Google Scholar
Billard L, Diday E (2006) Symbolic data analysis. Conceptual statistics and data mining. Wiley, Chichester
Book Google Scholar
Billard L, Diday E (2019) Clustering methodology for symbolic data. Wiley, New York
Book Google Scholar
Bock HH, Diday E (eds) (2000) Analysis of symbolic data. Exploratory methods for extracting statistical information from complex data. Springer, Heidelberg
MATH Google Scholar
Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat Theory Methods 3(1):1–27
Article MathSciNet Google Scholar
de Carvalho FAT, Sousa RMCR (2010) Unsupervised pattern recognition models for mixed feature-type symbolic data. Pattern Recogn Lett 31:430–443
Article Google Scholar
de Carvalho FAT, Brito P, Bock HH (2006) Dynamic clustering for interval data based on L2 distance. Comput Stat 21(2):231–250
Article Google Scholar
Diday E (1979) Optimisation en classification automatique. Tome 1.,2. INRIA, Rocquencourt (in French)
Diday E, Noirhomme-Fraiture M (2008) Symbolic data analysis and the SODAS software. Wiley, Chichester
MATH Google Scholar
ESS Round 5: European Social Survey Round 5 Data (2010). Data file edition 2.0. Norwegian Social Science Data Services, Norway—Data Archive and distributor of ESS data
ESS website. http://www.europeansocialsurvey.org/. Accessed 27 Sept 2012
Filzmoser P, Hron K, Templ M (2018) Applied compositional data analysis. Springer, Switzerland
Book Google Scholar
Gowda KC, Diday E (1991) Symbolic clustering using a new dissimilarity measure. Pattern Recogn 24(6):567–678
Article Google Scholar
Hall BH, Jaffe AB, Tratjenberg M (2001) The NBER patent citation data file: lessons, insights and methodological tools. NBER Working Paper 8498, NBER, 2001
Hardy A, Lallemand P (2004) Clustering of symbolic objects described by multi-valued and modal variables. In: Banks D, House L, McMorris F, Arabie P, Gaul W (eds) Classification, clustering and data mining applications. Springer, Berlin, pp 325–332
Chapter Google Scholar
Hartigan JA (1975) Clustering algorithms. Wiley, New York
MATH Google Scholar
Ichino M, Yaguchi H (1994) Generalized Minkowski metrics for mixed feature type data analysis. IEEE Trans Syst Man Cybern 24(4):698–708
Article MathSciNet Google Scholar
IDB: International Data Base (2008). http://www.census.gov/ipc/www/idbnew.html. Accessed 10 Feb 2008
Irpino A, Verde R (2006) A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In: Batagelj V, Bock HH, Ferligoj A, Ziberna A (eds) Data science and classification. Springer, Berlin, pp 185–192
Chapter Google Scholar
Irpino A, Verde R, de Carvalho FAT (2014) Dynamic clustering of histogram data based on adaptive squared Wasserstein distances. Expert Syst Appl 41:3351–3366
Article Google Scholar
Kejžar N, Korenjak-Černe S, Batagelj V (2011) Clustering of distributions: a case of patent citations. J Classif 28(2):156–183
Article MathSciNet Google Scholar
Kim J, Billard L (2011) A polythetic clustering process and cluster validity indexes for histogram-valued objects. Comput Stat Data Anal 55:2250–2262
Article MathSciNet Google Scholar
Kim J, Billard L (2012) Dissimilarity measures and divisive clustering for symbolic multimodal-valued data. Comput Stat Data Anal 56(9):2795–2808
Article MathSciNet Google Scholar
Kim J, Billard L (2013) Dissimilarity measures for histogram-valued observations. Commun Stat Theory Methods 42(2):283–303
Article MathSciNet Google Scholar
Korenjak-Černe S, Batagelj V (1998) Clustering large data sets of mixed units. In: Rizzi A, Vichi M, Bock HH (eds) 6th Conference of the international federation of classification societies (IFCS-98) Universita “La Sapienza”, Rome, 21–24 July 1998. Advances in data science and classification. Springer, Berlin, pp 43–48
Korenjak-Černe S, Batagelj V (2002) Symbolic data analysis approach to clustering large datasets. In: Jajuga K, Sokołowski A, Bock HH (eds) 8th Conference of the international federation of classification societies, 16–19 July 2002, Cracow, Classification, clustering and data analysis. Springer, Berlin, pp 319–327
Korenjak-Černe S, Batagelj V, Japelj Pavešić B (2011) Clustering large data sets described with discrete distributions and its application on TIMSS data set. Stat Anal Data Min 4(2):199–215
Article MathSciNet Google Scholar
Korenjak-Černe S, Kejžar N, Batagelj V (2015) A weighted clustering of population pyramids for the world’s countries, 1996, 2001, 2006. Popul Stud 69(1):105–120
Article Google Scholar
Košmelj K, Billard L (2011) Clustering of population pyramids using Mallows’ L2 distance. Metodološki zvezki 8(1):1–15
Google Scholar
Krichevsky RE, Trofimov VK (1981) The performance of universal encoding. IEEE Trans Inf Theory IT–27(2):199–207
Article MathSciNet Google Scholar
Manning C, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Book Google Scholar
NBER Patent Data Project. https://sites.google.com/site/patentdataproject/Home. Accessed 6 Sept 2016
Noirhomme-Fraiture M, Brito P (2011) Far beyond the classical data models: symbolic data analysis. Stat Anal Data Min 4(2):157–170
Article MathSciNet Google Scholar
Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, New York
Book Google Scholar
Verde R, Irpino A (2010) Ordinary least squares for histogram data based on Wasserstein distance. In: Proc. COMPSTAT’2010. Springer, Berlin, pp 581–589
Verde R, de Carvalho FAT, Lechevallier Y (2000) A dynamic clustering algorithm for multi-nominal data. In: Kiers HAL, Rasson JP, Groenen PJF, Schader M (eds) Data analysis, classification, and related methods. Springer, Berlin
MATH Google Scholar
Ward JH (1963) Hierarchical grou** to optimize an objective function. J Am Stat Assoc 58:236–244
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana, Vrazov trg 2, 1000, Ljubljana, Slovenia
Nataša Kejžar
School of Economics and Business (SEB), University of Ljubljana, Ljubljana, Slovenia
Simona Korenjak-Černe
Institute of Mathematics, Physics and Mechanics, Ljubljana, Slovenia
Vladimir Batagelj
Andrej Marušič Institute, University of Primorska, Koper, Slovenia
Vladimir Batagelj
National Research University Higher School of Economics, Moscow, Russia
Vladimir Batagelj

Authors

Nataša Kejžar
View author publications
You can also search for this author in PubMed Google Scholar
Simona Korenjak-Černe
View author publications
You can also search for this author in PubMed Google Scholar
Vladimir Batagelj
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nataša Kejžar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 816 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kejžar, N., Korenjak-Černe, S. & Batagelj, V. Clustering of modal-valued symbolic data. Adv Data Anal Classif 15, 513–541 (2021). https://doi.org/10.1007/s11634-020-00425-4

Download citation

Received: 12 August 2014
Revised: 20 August 2020
Accepted: 12 October 2020
Published: 24 October 2020
Issue Date: June 2021
DOI: https://doi.org/10.1007/s11634-020-00425-4

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Institutional subscriptions

Clustering of modal-valued symbolic data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Hierarchical conceptual clustering based on quantile method for identifying microscopic details in distributional data

Feature selection on probabilistic symbolic objects

Comparison of Two Distribution Valued Dissimilarities and Its Application for Symbolic Clustering

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 816 KB)

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Navigation

Clustering of modal-valued symbolic data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Hierarchical conceptual clustering based on quantile method for identifying microscopic details in distributional data

Feature selection on probabilistic symbolic objects

Comparison of Two Distribution Valued Dissimilarities and Its Application for Symbolic Clustering

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 816 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Search

Navigation