Abstract
Missing data introduces a challenge in the field of unsupervised learning. In clustering, when the form and the number of clusters are to be determined, one needs to deal with the missing values both in the clustering process and in the cluster validation. In the previous research, the clustering algorithm has been treated using robust clustering methods and available data strategy, and the cluster validation indices have been computed with the partial distance approximation. However, lately special methods for distance estimation with missing values have been proposed and this work is the first one where these methods are systematically applied and tested in clustering and cluster validation. More precisely, we propose, implement, and analyze the use of distance estimation methods to improve the discrimination power of clustering and cluster validation indices. A novel, robust prototype-based clustering process in two stages is suggested. Our results and conclusions confirm the usefulness of the distance estimation methods in clustering but, surprisingly, not in cluster validation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms. SIAM, pp 1027–1035.
Äyrämö S (2006) Knowledge mining using robust clustering. PhD thesis, University of Jyväskylä
Eirola E, Doquire G, Verleysen M, Lendasse A (2013) Distance estimation in numerical data sets with missing values. Inform Sci 240:115–128
Eirola E, Lendasse A, Vandewalle V, Biernacki C (2014) Mixture of Gaussians for distance estimation with missing data. Neurocomputing 131:32–42
Fu W, Perry PO (2020) Estimating the number of clusters using cross-validation. J Comput Graph Stat 29(1):162–173
Gower J (1971) A general coefficient of similarity and some of its properties. Biometrics 27(4):857–871
Hämäläinen J, Jauhiainen S, Kärkkäinen T (2017) Comparison of internal clustering validation indices for prototype-based clustering. Algorithms 10(3)
Hämäläinen J, Kärkkäinen T, Rossi T (2018) Scalable robust clustering method for large and sparse data. In: Proceedings of ESANN2018 – 26th european symposium on artificial neural networks, computational intelligence and machine learning. ESANN, pp 449–454
Kärkkäinen T, Heikkola E (2004) Robust formulations for training multilayer perceptrons. Neural Comput 16(4):837–862
Kärkkäinen T, Toivanen J (2001) Building blocks for odd-even multigrid with applications to reduced systems. J Comput Appl Math 131(1–2):15–33
Lin W-C, Tsai C-F (2020) Missing value imputation: a review and analysis of the literature (2006–2017). Artif Intell Rev 53(2):1487–1509
Mesquita DPP, Gomes JPP, Souza Junior AH, Nobre JS (2017) Euclidean distance estimation in incomplete datasets. Neurocomputing 248:11–18
Niemelä M, Äyrämö S, Kärkkäinen T (2018) Comparison of cluster validation indices with missing data. In: Proceedings of ESANN2018 – 26th european symposium on artificial neural networks, computational intelligence and machine learning. ESANN, pp 461–466
Rouaud M (2013) Probability. Propagation of uncertainties in experimental measurement, statistics and estimation
Acknowledgements
The authors would like to thank the Academy of Finland for the financial support (grants 311877 and 315550).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Niemelä, M., Kärkkäinen, T. (2022). Improving Clustering and Cluster Validation with Missing Data Using Distance Estimation Methods. In: Tuovinen, T., Periaux, J., Neittaanmäki, P. (eds) Computational Sciences and Artificial Intelligence in Industry. Intelligent Systems, Control and Automation: Science and Engineering, vol 76. Springer, Cham. https://doi.org/10.1007/978-3-030-70787-3_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-70787-3_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-70786-6
Online ISBN: 978-3-030-70787-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)