Abstract
In recent years, with the development of the information age, the amount of data has grown dramatically. At the same time, dirty data have already existed in various types of databases. Due to the negative impacts of dirty data on data mining and machine learning results, data quality issues have attracted widespread attention. Motivated by this, this book aims to analyze the impacts of dirty data on machine learning models and explore the proper methods for dirty data processing. This chapter discusses the background of dirty data processing for machine learning. In Sect. 1.1, we analyze three basic dimensions of data quality to motivate the necessity of processing dirty data in the database and machine learning communities. In Sect. 1.2, we summarize the existing studies and explain the differences of our research and current work. We conclude the chapter with an overview of the structure of this book in Sect. 1.3.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
J. Wang, T. Kraska, M.J. Franklin, J. Feng, Crowder: Crowdsourcing entity resolution. PVLDB 5(11), 1483–1494 (2012)
G. Beskales, I.F. Ilyas, L. Golab, A. Galiullin, On the relative trust between inconsistent data and inaccurate constraints, in 2013 IEEE 29th International Conference on Data Engineering (ICDE) (IEEE, New York, 2013), pp. 541–552
X. Chu, I.F. Ilyas, P. Papotti, Holistic data cleaning: Putting violations into context, in 2013 IEEE 29th International Conference on Data Engineering (ICDE) (2013), pp. 458–469
X. Chu, J. Morcos, I.F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, Y. Ye, Katara: a data cleaning system powered by knowledge bases and crowdsourcing, in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD) (2015), pp. 1247–1261
S. Hao, N. Tang, G. Li, J. Li, Cleaning relations using knowledge bases, in 2017 IEEE 33rd International Conference on Data Engineering (ICDE) (2017), pp. 933–944
D. Gamberger, N. Lavrač, Conditions for Occam’s razor applicability and noise elimination, in European Conference on Machine Learning (Springer, Berlin, 1997), pp. 108–123
P.J. García-Laencina, J. Sancho-Gómez, A.R. Figueiras-Vidal, Pattern classification with missing data: a review. Neural Comput. Appl. 19(2), 263–282 (2010)
S. Lim, Cleansing noisy city names in spatial data mining, in International Conference on Information Science and Applications (IEEE, New York, 2010), pp. 1–8
S. Song, C. Li, X. Zhang, Turn waste into wealth: On simultaneous clustering and cleaning over dirty data, in Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2015), pp. 1115–1124
X. Zhu, X. Wu, Class noise vs. attribute noise: a quantitative study. Artif. Intell. Rev. 22(3), 177–210 (2004)
B. Frénay, M. Verleysen, Classification in the presence of label noise: a survey. IEEE Trans. Neural Networks Learn. Syst. 25(5), 845–869 (2014)
S. Lomax, S. Vadera, A survey of cost-sensitive decision tree induction algorithms. ACM Comput. Surv. (CSUR) 45(2), 1–35 (2013)
M. Tan, Cost-sensitive learning of classification knowledge and its applications in robotics. Mach. Learn. 13, 7–33 (1993)
C. Ferri, P. Flach, J. Hernández-Orallo, Learning decision trees using the area under the ROC curve, in ICML, vol. 2 (2002), pp. 139–146
P. Domingos, Metacost: a general method for making classifiers cost-sensitive, in Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (1999), pp. 155–164
J. Mingers, An empirical comparison of selection measures for decision-tree induction. Mach. Learn. 3, 319–342 (1989)
J.R. Quinlan, Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
M. Núñez, The use of background knowledge in decision tree induction. Mach. Learn. 6, 231–250 (1991)
P.D. Turney, Cost-sensitive classification: empirical evaluation of a hybrid genetic decision tree induction algorithm. J. Artif. Intell. Res. 2, 369–409 (1994)
K.M. Ting, Z. Zheng, Boosting cost-sensitive trees, in International Conference on Discovery Science (Springer, Berlin, 1998), pp. 244–255
C.X. Ling, V.S. Sheng, Q. Yang, Test strategies for cost-sensitive decision trees. IEEE Trans. Knowl. Data Eng. 18(8), 1055–1067 (2006)
S. Esmeir, S. Markovitch, Anytime induction of low-cost, low-error classifiers: a sampling-based approach. J. Artif. Intell. Res. 33, 1–31 (2008)
S. Esmeir, S. Markovitch, Anytime learning of anycost classifiers. Mach. Learn. 82, 445–473 (2011)
Y.-L. Chen, C.-C. Wu, K. Tang, Time-constrained cost-sensitive decision tree induction. Inf. Sci. 354, 140–152 (2016)
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Qi, Z., Wang, H., Dong, Z. (2024). Introduction. In: Dirty Data Processing for Machine Learning. Springer, Singapore. https://doi.org/10.1007/978-981-99-7657-7_1
Download citation
DOI: https://doi.org/10.1007/978-981-99-7657-7_1
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-7656-0
Online ISBN: 978-981-99-7657-7
eBook Packages: Computer ScienceComputer Science (R0)