Introduction

Qi, Zhixin; Wang, Hongzhi; Dong, Zejiao

doi:10.1007/978-981-99-7657-7_1

Zhixin Qi⁴,
Hongzhi Wang⁴ &
Zejiao Dong⁴

278 Accesses

Abstract

In recent years, with the development of the information age, the amount of data has grown dramatically. At the same time, dirty data have already existed in various types of databases. Due to the negative impacts of dirty data on data mining and machine learning results, data quality issues have attracted widespread attention. Motivated by this, this book aims to analyze the impacts of dirty data on machine learning models and explore the proper methods for dirty data processing. This chapter discusses the background of dirty data processing for machine learning. In Sect. 1.1, we analyze three basic dimensions of data quality to motivate the necessity of processing dirty data in the database and machine learning communities. In Sect. 1.2, we summarize the existing studies and explain the differences of our research and current work. We conclude the chapter with an overview of the structure of this book in Sect. 1.3.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

J. Wang, T. Kraska, M.J. Franklin, J. Feng, Crowder: Crowdsourcing entity resolution. PVLDB 5(11), 1483–1494 (2012)
Google Scholar
G. Beskales, I.F. Ilyas, L. Golab, A. Galiullin, On the relative trust between inconsistent data and inaccurate constraints, in 2013 IEEE 29th International Conference on Data Engineering (ICDE) (IEEE, New York, 2013), pp. 541–552
Google Scholar
X. Chu, I.F. Ilyas, P. Papotti, Holistic data cleaning: Putting violations into context, in 2013 IEEE 29th International Conference on Data Engineering (ICDE) (2013), pp. 458–469
Google Scholar
X. Chu, J. Morcos, I.F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, Y. Ye, Katara: a data cleaning system powered by knowledge bases and crowdsourcing, in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD) (2015), pp. 1247–1261
Google Scholar
S. Hao, N. Tang, G. Li, J. Li, Cleaning relations using knowledge bases, in 2017 IEEE 33rd International Conference on Data Engineering (ICDE) (2017), pp. 933–944
Google Scholar
D. Gamberger, N. Lavrač, Conditions for Occam’s razor applicability and noise elimination, in European Conference on Machine Learning (Springer, Berlin, 1997), pp. 108–123
Google Scholar
P.J. García-Laencina, J. Sancho-Gómez, A.R. Figueiras-Vidal, Pattern classification with missing data: a review. Neural Comput. Appl. 19(2), 263–282 (2010)
Article Google Scholar
S. Lim, Cleansing noisy city names in spatial data mining, in International Conference on Information Science and Applications (IEEE, New York, 2010), pp. 1–8
Google Scholar
S. Song, C. Li, X. Zhang, Turn waste into wealth: On simultaneous clustering and cleaning over dirty data, in Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2015), pp. 1115–1124
Google Scholar
X. Zhu, X. Wu, Class noise vs. attribute noise: a quantitative study. Artif. Intell. Rev. 22(3), 177–210 (2004)
Google Scholar
B. Frénay, M. Verleysen, Classification in the presence of label noise: a survey. IEEE Trans. Neural Networks Learn. Syst. 25(5), 845–869 (2014)
Article MATH Google Scholar
S. Lomax, S. Vadera, A survey of cost-sensitive decision tree induction algorithms. ACM Comput. Surv. (CSUR) 45(2), 1–35 (2013)
Google Scholar
M. Tan, Cost-sensitive learning of classification knowledge and its applications in robotics. Mach. Learn. 13, 7–33 (1993)
Article Google Scholar
C. Ferri, P. Flach, J. Hernández-Orallo, Learning decision trees using the area under the ROC curve, in ICML, vol. 2 (2002), pp. 139–146
Google Scholar
P. Domingos, Metacost: a general method for making classifiers cost-sensitive, in Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (1999), pp. 155–164
Google Scholar
J. Mingers, An empirical comparison of selection measures for decision-tree induction. Mach. Learn. 3, 319–342 (1989)
Article Google Scholar
J.R. Quinlan, Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
Article Google Scholar
M. Núñez, The use of background knowledge in decision tree induction. Mach. Learn. 6, 231–250 (1991)
Article Google Scholar
P.D. Turney, Cost-sensitive classification: empirical evaluation of a hybrid genetic decision tree induction algorithm. J. Artif. Intell. Res. 2, 369–409 (1994)
Article Google Scholar
K.M. Ting, Z. Zheng, Boosting cost-sensitive trees, in International Conference on Discovery Science (Springer, Berlin, 1998), pp. 244–255
Google Scholar
C.X. Ling, V.S. Sheng, Q. Yang, Test strategies for cost-sensitive decision trees. IEEE Trans. Knowl. Data Eng. 18(8), 1055–1067 (2006)
Article Google Scholar
S. Esmeir, S. Markovitch, Anytime induction of low-cost, low-error classifiers: a sampling-based approach. J. Artif. Intell. Res. 33, 1–31 (2008)
Article MathSciNet MATH Google Scholar
S. Esmeir, S. Markovitch, Anytime learning of anycost classifiers. Mach. Learn. 82, 445–473 (2011)
Article MathSciNet Google Scholar
Y.-L. Chen, C.-C. Wu, K. Tang, Time-constrained cost-sensitive decision tree induction. Inf. Sci. 354, 140–152 (2016)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Transportation Science and Engineering, Harbin Institute of Technology, Harbin, Heilongjiang, China
Zhixin Qi, Hongzhi Wang & Zejiao Dong

Authors

Zhixin Qi
View author publications
You can also search for this author in PubMed Google Scholar
Hongzhi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zejiao Dong
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Qi, Z., Wang, H., Dong, Z. (2024). Introduction. In: Dirty Data Processing for Machine Learning. Springer, Singapore. https://doi.org/10.1007/978-981-99-7657-7_1

Download citation

DOI: https://doi.org/10.1007/978-981-99-7657-7_1
Published: 30 November 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-7656-0
Online ISBN: 978-981-99-7657-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics