Improving Clustering and Cluster Validation with Missing Data Using Distance Estimation Methods

Niemelä, Marko; Kärkkäinen, Tommi

doi:10.1007/978-3-030-70787-3_9

Marko Niemelä¹⁸ &
Tommi Kärkkäinen¹⁸

Part of the book series: Intelligent Systems, Control and Automation: Science and Engineering ((ISCA,volume 76))

481 Accesses

Abstract

Missing data introduces a challenge in the field of unsupervised learning. In clustering, when the form and the number of clusters are to be determined, one needs to deal with the missing values both in the clustering process and in the cluster validation. In the previous research, the clustering algorithm has been treated using robust clustering methods and available data strategy, and the cluster validation indices have been computed with the partial distance approximation. However, lately special methods for distance estimation with missing values have been proposed and this work is the first one where these methods are systematically applied and tested in clustering and cluster validation. More precisely, we propose, implement, and analyze the use of distance estimation methods to improve the discrimination power of clustering and cluster validation indices. A novel, robust prototype-based clustering process in two stages is suggested. Our results and conclusions confirm the usefulness of the distance estimation methods in clustering but, surprisingly, not in cluster validation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: EUR 29.95; Price includes VAT (Germany)

eBook: EUR 160.49; Price includes VAT (Germany)

Softcover Book: EUR 213.99; Price includes VAT (Germany)

Hardcover Book: EUR 213.99; Price includes VAT (Germany)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Clustering with missing features: a penalized dissimilarity measure based approach

Article 12 June 2018

Model-based clustering and outlier detection with missing data

Article 22 January 2022

Clustering large mixed-type data with ordinal variables

Article Open access 27 May 2024

Notes

References

Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms. SIAM, pp 1027–1035.
Google Scholar
Äyrämö S (2006) Knowledge mining using robust clustering. PhD thesis, University of Jyväskylä
Google Scholar
Eirola E, Doquire G, Verleysen M, Lendasse A (2013) Distance estimation in numerical data sets with missing values. Inform Sci 240:115–128
Article MathSciNet Google Scholar
Eirola E, Lendasse A, Vandewalle V, Biernacki C (2014) Mixture of Gaussians for distance estimation with missing data. Neurocomputing 131:32–42
Article Google Scholar
Fu W, Perry PO (2020) Estimating the number of clusters using cross-validation. J Comput Graph Stat 29(1):162–173
Article MathSciNet Google Scholar
Gower J (1971) A general coefficient of similarity and some of its properties. Biometrics 27(4):857–871
Article Google Scholar
Hämäläinen J, Jauhiainen S, Kärkkäinen T (2017) Comparison of internal clustering validation indices for prototype-based clustering. Algorithms 10(3)
Google Scholar
Hämäläinen J, Kärkkäinen T, Rossi T (2018) Scalable robust clustering method for large and sparse data. In: Proceedings of ESANN2018 – 26th european symposium on artificial neural networks, computational intelligence and machine learning. ESANN, pp 449–454
Google Scholar
Kärkkäinen T, Heikkola E (2004) Robust formulations for training multilayer perceptrons. Neural Comput 16(4):837–862
Article Google Scholar
Kärkkäinen T, Toivanen J (2001) Building blocks for odd-even multigrid with applications to reduced systems. J Comput Appl Math 131(1–2):15–33
Article MathSciNet Google Scholar
Lin W-C, Tsai C-F (2020) Missing value imputation: a review and analysis of the literature (2006–2017). Artif Intell Rev 53(2):1487–1509
Article Google Scholar
Mesquita DPP, Gomes JPP, Souza Junior AH, Nobre JS (2017) Euclidean distance estimation in incomplete datasets. Neurocomputing 248:11–18
Google Scholar
Niemelä M, Äyrämö S, Kärkkäinen T (2018) Comparison of cluster validation indices with missing data. In: Proceedings of ESANN2018 – 26th european symposium on artificial neural networks, computational intelligence and machine learning. ESANN, pp 461–466
Google Scholar
Rouaud M (2013) Probability. Propagation of uncertainties in experimental measurement, statistics and estimation
Google Scholar

Download references

Acknowledgements

The authors would like to thank the Academy of Finland for the financial support (grants 311877 and 315550).

Author information

Authors and Affiliations

Faculty of Information Technology, University of Jyväskylä, P.O. Box 35, 40014, Jyväskylä, Finland
Marko Niemelä & Tommi Kärkkäinen

Authors

Marko Niemelä
View author publications
You can also search for this author in PubMed Google Scholar
Tommi Kärkkäinen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marko Niemelä .

Editor information

Editors and Affiliations

Faculty of Information Technology, University of Jyväskylä, Jyväskylä, Finland
Tero Tuovinen
CIMNE, International Center for Numerical Methods in Engineering, Barcelona, Spain
Jacques Periaux
Faculty of Information Technology, University of Jyväskylä, Jyväskylä, Finland
Pekka Neittaanmäki

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Niemelä, M., Kärkkäinen, T. (2022). Improving Clustering and Cluster Validation with Missing Data Using Distance Estimation Methods. In: Tuovinen, T., Periaux, J., Neittaanmäki, P. (eds) Computational Sciences and Artificial Intelligence in Industry. Intelligent Systems, Control and Automation: Science and Engineering, vol 76. Springer, Cham. https://doi.org/10.1007/978-3-030-70787-3_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-70787-3_9
Published: 20 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-70786-6
Online ISBN: 978-3-030-70787-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Improving Clustering and Cluster Validation with Missing Data Using Distance Estimation Methods

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Clustering with missing features: a penalized dissimilarity measure based approach

Model-based clustering and outlier detection with missing data

Clustering large mixed-type data with ordinal variables

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Improving Clustering and Cluster Validation with Missing Data Using Distance Estimation Methods

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Clustering with missing features: a penalized dissimilarity measure based approach

Model-based clustering and outlier detection with missing data

Clustering large mixed-type data with ordinal variables

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation