Log in

Clustering of modal-valued symbolic data

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Symbolic data analysis is based on special descriptions of data known as symbolic objects (SOs). Such descriptions preserve more detailed information about units and their clusters than the usual representations with mean values. A special type of SO is a representation with frequency or probability distributions (modal values). This representation enables us to simultaneously consider variables of all measurement types during the clustering process. In this paper, we present the theoretical basis for compatible leaders and agglomerative clustering methods with alternative dissimilarities for modal-valued SOs. The leaders method efficiently solves clustering problems with large numbers of units, while the agglomerative method can be applied either alone to a small data set, or to leaders, obtained from the compatible leaders clustering method. We focus on (a) the inclusion of weights that enables clustering representatives to retain the same structure as if clustering only first order units and (b) the selection of relative dissimilarities that produce more interpretable, i.e., meaningful optimal clustering representatives. The usefulness of the proposed methods with adaptations was assessed and substantiated by carefully constructed simulation settings and demonstrated on three different real-world data sets gaining in interpretability from the use of weights (population pyramids and ESS data) or relative dissimilarity (US patents data).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  • Anderberg MR (1973) Cluster analysis for applications. Academic Press, New York

    MATH  Google Scholar 

  • Batagelj V (1988) Generalized ward and related clustering problems. In: Bock HH (ed) Classification and related methods of data analysis. North-Holland, Amsterdam, pp 67–74

    Google Scholar 

  • Batagelj V, Kejžar N (2016) Clamix—clustering symbolic objects. Program in R. https://r-forge.r-project.org/projects/clamix/. Accessed 21 Oct 2020

  • Billard L, Diday E (2003) From the statistics of data to the statistics of knowledge: symbolic data analysis. J Am Stat Assoc 98(462):470–487

    Article  MathSciNet  Google Scholar 

  • Billard L, Diday E (2006) Symbolic data analysis. Conceptual statistics and data mining. Wiley, Chichester

    Book  Google Scholar 

  • Billard L, Diday E (2019) Clustering methodology for symbolic data. Wiley, New York

    Book  Google Scholar 

  • Bock HH, Diday E (eds) (2000) Analysis of symbolic data. Exploratory methods for extracting statistical information from complex data. Springer, Heidelberg

    MATH  Google Scholar 

  • Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat Theory Methods 3(1):1–27

    Article  MathSciNet  Google Scholar 

  • de Carvalho FAT, Sousa RMCR (2010) Unsupervised pattern recognition models for mixed feature-type symbolic data. Pattern Recogn Lett 31:430–443

    Article  Google Scholar 

  • de Carvalho FAT, Brito P, Bock HH (2006) Dynamic clustering for interval data based on L2 distance. Comput Stat 21(2):231–250

    Article  Google Scholar 

  • Diday E (1979) Optimisation en classification automatique. Tome 1.,2. INRIA, Rocquencourt (in French)

  • Diday E, Noirhomme-Fraiture M (2008) Symbolic data analysis and the SODAS software. Wiley, Chichester

    MATH  Google Scholar 

  • ESS Round 5: European Social Survey Round 5 Data (2010). Data file edition 2.0. Norwegian Social Science Data Services, Norway—Data Archive and distributor of ESS data

  • ESS website. http://www.europeansocialsurvey.org/. Accessed 27 Sept 2012

  • Filzmoser P, Hron K, Templ M (2018) Applied compositional data analysis. Springer, Switzerland

    Book  Google Scholar 

  • Gowda KC, Diday E (1991) Symbolic clustering using a new dissimilarity measure. Pattern Recogn 24(6):567–678

    Article  Google Scholar 

  • Hall BH, Jaffe AB, Tratjenberg M (2001) The NBER patent citation data file: lessons, insights and methodological tools. NBER Working Paper 8498, NBER, 2001

  • Hardy A, Lallemand P (2004) Clustering of symbolic objects described by multi-valued and modal variables. In: Banks D, House L, McMorris F, Arabie P, Gaul W (eds) Classification, clustering and data mining applications. Springer, Berlin, pp 325–332

    Chapter  Google Scholar 

  • Hartigan JA (1975) Clustering algorithms. Wiley, New York

    MATH  Google Scholar 

  • Ichino M, Yaguchi H (1994) Generalized Minkowski metrics for mixed feature type data analysis. IEEE Trans Syst Man Cybern 24(4):698–708

    Article  MathSciNet  Google Scholar 

  • IDB: International Data Base (2008). http://www.census.gov/ipc/www/idbnew.html. Accessed 10 Feb 2008

  • Irpino A, Verde R (2006) A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In: Batagelj V, Bock HH, Ferligoj A, Ziberna A (eds) Data science and classification. Springer, Berlin, pp 185–192

    Chapter  Google Scholar 

  • Irpino A, Verde R, de Carvalho FAT (2014) Dynamic clustering of histogram data based on adaptive squared Wasserstein distances. Expert Syst Appl 41:3351–3366

    Article  Google Scholar 

  • Kejžar N, Korenjak-Černe S, Batagelj V (2011) Clustering of distributions: a case of patent citations. J Classif 28(2):156–183

    Article  MathSciNet  Google Scholar 

  • Kim J, Billard L (2011) A polythetic clustering process and cluster validity indexes for histogram-valued objects. Comput Stat Data Anal 55:2250–2262

    Article  MathSciNet  Google Scholar 

  • Kim J, Billard L (2012) Dissimilarity measures and divisive clustering for symbolic multimodal-valued data. Comput Stat Data Anal 56(9):2795–2808

    Article  MathSciNet  Google Scholar 

  • Kim J, Billard L (2013) Dissimilarity measures for histogram-valued observations. Commun Stat Theory Methods 42(2):283–303

    Article  MathSciNet  Google Scholar 

  • Korenjak-Černe S, Batagelj V (1998) Clustering large data sets of mixed units. In: Rizzi A, Vichi M, Bock HH (eds) 6th Conference of the international federation of classification societies (IFCS-98) Universita “La Sapienza”, Rome, 21–24 July 1998. Advances in data science and classification. Springer, Berlin, pp 43–48

  • Korenjak-Černe S, Batagelj V (2002) Symbolic data analysis approach to clustering large datasets. In: Jajuga K, Sokołowski A, Bock HH (eds) 8th Conference of the international federation of classification societies, 16–19 July 2002, Cracow, Classification, clustering and data analysis. Springer, Berlin, pp 319–327

  • Korenjak-Černe S, Batagelj V, Japelj Pavešić B (2011) Clustering large data sets described with discrete distributions and its application on TIMSS data set. Stat Anal Data Min 4(2):199–215

    Article  MathSciNet  Google Scholar 

  • Korenjak-Černe S, Kejžar N, Batagelj V (2015) A weighted clustering of population pyramids for the world’s countries, 1996, 2001, 2006. Popul Stud 69(1):105–120

    Article  Google Scholar 

  • Košmelj K, Billard L (2011) Clustering of population pyramids using Mallows’ L2 distance. Metodološki zvezki 8(1):1–15

    Google Scholar 

  • Krichevsky RE, Trofimov VK (1981) The performance of universal encoding. IEEE Trans Inf Theory IT–27(2):199–207

    Article  MathSciNet  Google Scholar 

  • Manning C, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • NBER Patent Data Project. https://sites.google.com/site/patentdataproject/Home. Accessed 6 Sept 2016

  • Noirhomme-Fraiture M, Brito P (2011) Far beyond the classical data models: symbolic data analysis. Stat Anal Data Min 4(2):157–170

    Article  MathSciNet  Google Scholar 

  • Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, New York

    Book  Google Scholar 

  • Verde R, Irpino A (2010) Ordinary least squares for histogram data based on Wasserstein distance. In: Proc. COMPSTAT’2010. Springer, Berlin, pp 581–589

  • Verde R, de Carvalho FAT, Lechevallier Y (2000) A dynamic clustering algorithm for multi-nominal data. In: Kiers HAL, Rasson JP, Groenen PJF, Schader M (eds) Data analysis, classification, and related methods. Springer, Berlin

    MATH  Google Scholar 

  • Ward JH (1963) Hierarchical grou** to optimize an objective function. J Am Stat Assoc 58:236–244

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nataša Kejžar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 816 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kejžar, N., Korenjak-Černe, S. & Batagelj, V. Clustering of modal-valued symbolic data. Adv Data Anal Classif 15, 513–541 (2021). https://doi.org/10.1007/s11634-020-00425-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-020-00425-4

Keywords

Mathematics Subject Classification

Navigation