Abstract
In this chapter, we present an overview of various chemometric methods, appropriate for analyzing and interpreting data from social media, industry, academia, medicine, and other sources. We discuss unsupervised machine-learning techniques used for grou** (hierarchical cluster analysis, k-means) and exploring (principal component analysis, self-organizing Kohonen maps) all types of data, both quantitative and qualitative. For each method described in this chapter, we explain the basic concepts, provide a rudimentary algorithm, and present practical applications. All the examples are based on a set of molecular descriptors calculated for a selected group of persistent organic pollutants (POPs).
Similar content being viewed by others
References
Brereton, R. G. (2003). Chemometrics: Data analysis for the laboratory and chemical plant. Chichester/Hoboken: Wiley.
Brereton, R. G. (2009). Chemometrics for pattern recognition. Chichester: Wiley.
Brown, S. D., TauleriFerre, R., & Walczak, B. (2009). Comprehensive chemometrics: Chemical and biochemical data analysis. Amsterdam/London: Elsevier.
Everitt, B., Landau, S., Leese, M., & Stahl, D. (2011). Cluster analysis (5th ed.). Oxford: Wiley-Blackwell.
Gajewicz, A., Haranczyk, M., & Puzyn, T. (2010). Predicting logarithmic values of the subcooled liquid vapor pressure of halogenated persistent organic pollutants with QSPR: How different are chlorinated and brominated congeners? Atmospheric Environment, 44(11), 1428–1436.
Gemperline, P. (2006). Practical guide to chemometrics (2nd ed.). Boca Raton: CRC/Taylor & Francis.
Golebiowski, M., Sosnowska, A., Puzyn, T., Bogus, M. I., Wieloch, W., Włóka, E., & Stepnowski, P. (2014). Application of two-way hierarchical cluster analysis for the identification of similarities between the individual lipid fractions of Lucilia sericata. Chemistry and Biodiversity, 11, 733–748.
Han, J., Kamber, M., & Pei, J. P. D. (2012). Data mining: Concepts and techniques (3rd ed.). Waltham/Oxford: Morgan Kaufmann/Elsevier Science, distributor.
Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). New York: Springer.
Jolliffe, I. T. (2002). Principal component analysis (Springer series in statistics 2nd ed.). New York: Springer.
Khan, S. S., & Kant, S. (2007). Computation of initial modes for K-modes clustering algorithm using evidence accumulation. Paper presented at the Proceedings of the 20th international joint conference on artificial intelligence, Hyderabad.
Kohonen, T. (2001). Self-organizing maps (3rd ed.). Berlin/London: Springer.
Kountchev, R., & Iantovics, B. (2013). Advances in intelligent analysis of medical data and decision support systems (Studies in Computational Intelligence, Vol. 473). Springer International Publishing Switzerland.
Li, Y., Pang, G.-F., Fan, C.-L., & Chen, X. (2013). Hierarchical cluster analysis of matrix effects on 110 pesticide residues in 28 tea matrixes. Journal of AOAC International, 96(6), 1453–1465.
Livingstone, D. (2009). A practical guide to scientific data analysis. Chichester: Wiley.
Maimon, O. Z., & Rokach, L. (2005). Data mining and knowledge discovery handbook. Ramat-Aviv: Springer.
Milligan, G., & Cooper, M. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50(2), 159–179.
Myatt, G. J. (2007). Making sense of data: A practical guide to exploratory data analysis and data mining. Hoboken: Wiley-Interscience.
Petushkova, N. A., Pyatnitskiy, M. A., Rudenko, V. A., Larina, O. V., Trifonova, O. P., Kisrieva, J. S., Samenkova, N. F., Kuznetsova, G. P., Karuzina, I. I., & Lisitsa, A. V. (2014). Applying of hierarchical clustering to analysis of protein patterns in the human cancer-associated liver. PloS One, 9(8), e103950.
Schnegg, M., Massonnet, G., & Gueissaz, L. (2015). Motorcycle helmets: What about their coating? Forensic Science International, 252, 114–126.
Skwarzec, B., Kabat, K., Puzyn, T., & Astel, A. (2011). Inflow of polonium, uranium and plutonium radionuclides in Odra River catchment area assessment by environmetric expertise. Journal of Radioanalytical and Nuclear Chemistry, 292(2), 519–529.
Varmuza, K., & Filzmoser, P. (2009). Introduction to multivariate statistical analysis in chemometrics. CRC Press: Boca Raton, p xiii, 321 p.
Vesanto, J., & Alhoniemi, E. (2000). Clustering of the self-organizing map. IEEE Transactions on Neural Networks/A Publication of the IEEE Neural Networks Council, 11(3), 586–600.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media Dordrecht
About this entry
Cite this entry
Odziomek, K., Rybinska, A., Puzyn, T. (2016). Unsupervised Learning Methods and Similarity Analysis in Chemoinformatics. In: Leszczynski, J. (eds) Handbook of Computational Chemistry. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-6169-8_53-1
Download citation
DOI: https://doi.org/10.1007/978-94-007-6169-8_53-1
Received:
Accepted:
Published:
Publisher Name: Springer, Dordrecht
Online ISBN: 978-94-007-6169-8
eBook Packages: Springer Reference Chemistry and Mat. ScienceReference Module Physical and Materials ScienceReference Module Chemistry, Materials and Physics