Introduction

  • Chapter
  • First Online:
Dirty Data Processing for Machine Learning

Abstract

In recent years, with the development of the information age, the amount of data has grown dramatically. At the same time, dirty data have already existed in various types of databases. Due to the negative impacts of dirty data on data mining and machine learning results, data quality issues have attracted widespread attention. Motivated by this, this book aims to analyze the impacts of dirty data on machine learning models and explore the proper methods for dirty data processing. This chapter discusses the background of dirty data processing for machine learning. In Sect. 1.1, we analyze three basic dimensions of data quality to motivate the necessity of processing dirty data in the database and machine learning communities. In Sect. 1.2, we summarize the existing studies and explain the differences of our research and current work. We conclude the chapter with an overview of the structure of this book in Sect. 1.3.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. J. Wang, T. Kraska, M.J. Franklin, J. Feng, Crowder: Crowdsourcing entity resolution. PVLDB 5(11), 1483–1494 (2012)

    Google Scholar 

  2. G. Beskales, I.F. Ilyas, L. Golab, A. Galiullin, On the relative trust between inconsistent data and inaccurate constraints, in 2013 IEEE 29th International Conference on Data Engineering (ICDE) (IEEE, New York, 2013), pp. 541–552

    Google Scholar 

  3. X. Chu, I.F. Ilyas, P. Papotti, Holistic data cleaning: Putting violations into context, in 2013 IEEE 29th International Conference on Data Engineering (ICDE) (2013), pp. 458–469

    Google Scholar 

  4. X. Chu, J. Morcos, I.F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, Y. Ye, Katara: a data cleaning system powered by knowledge bases and crowdsourcing, in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD) (2015), pp. 1247–1261

    Google Scholar 

  5. S. Hao, N. Tang, G. Li, J. Li, Cleaning relations using knowledge bases, in 2017 IEEE 33rd International Conference on Data Engineering (ICDE) (2017), pp. 933–944

    Google Scholar 

  6. D. Gamberger, N. Lavrač, Conditions for Occam’s razor applicability and noise elimination, in European Conference on Machine Learning (Springer, Berlin, 1997), pp. 108–123

    Google Scholar 

  7. P.J. García-Laencina, J. Sancho-Gómez, A.R. Figueiras-Vidal, Pattern classification with missing data: a review. Neural Comput. Appl. 19(2), 263–282 (2010)

    Article  Google Scholar 

  8. S. Lim, Cleansing noisy city names in spatial data mining, in International Conference on Information Science and Applications (IEEE, New York, 2010), pp. 1–8

    Google Scholar 

  9. S. Song, C. Li, X. Zhang, Turn waste into wealth: On simultaneous clustering and cleaning over dirty data, in Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2015), pp. 1115–1124

    Google Scholar 

  10. X. Zhu, X. Wu, Class noise vs. attribute noise: a quantitative study. Artif. Intell. Rev. 22(3), 177–210 (2004)

    Google Scholar 

  11. B. Frénay, M. Verleysen, Classification in the presence of label noise: a survey. IEEE Trans. Neural Networks Learn. Syst. 25(5), 845–869 (2014)

    Article  MATH  Google Scholar 

  12. S. Lomax, S. Vadera, A survey of cost-sensitive decision tree induction algorithms. ACM Comput. Surv. (CSUR) 45(2), 1–35 (2013)

    Google Scholar 

  13. M. Tan, Cost-sensitive learning of classification knowledge and its applications in robotics. Mach. Learn. 13, 7–33 (1993)

    Article  Google Scholar 

  14. C. Ferri, P. Flach, J. Hernández-Orallo, Learning decision trees using the area under the ROC curve, in ICML, vol. 2 (2002), pp. 139–146

    Google Scholar 

  15. P. Domingos, Metacost: a general method for making classifiers cost-sensitive, in Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (1999), pp. 155–164

    Google Scholar 

  16. J. Mingers, An empirical comparison of selection measures for decision-tree induction. Mach. Learn. 3, 319–342 (1989)

    Article  Google Scholar 

  17. J.R. Quinlan, Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)

    Article  Google Scholar 

  18. M. Núñez, The use of background knowledge in decision tree induction. Mach. Learn. 6, 231–250 (1991)

    Article  Google Scholar 

  19. P.D. Turney, Cost-sensitive classification: empirical evaluation of a hybrid genetic decision tree induction algorithm. J. Artif. Intell. Res. 2, 369–409 (1994)

    Article  Google Scholar 

  20. K.M. Ting, Z. Zheng, Boosting cost-sensitive trees, in International Conference on Discovery Science (Springer, Berlin, 1998), pp. 244–255

    Google Scholar 

  21. C.X. Ling, V.S. Sheng, Q. Yang, Test strategies for cost-sensitive decision trees. IEEE Trans. Knowl. Data Eng. 18(8), 1055–1067 (2006)

    Article  Google Scholar 

  22. S. Esmeir, S. Markovitch, Anytime induction of low-cost, low-error classifiers: a sampling-based approach. J. Artif. Intell. Res. 33, 1–31 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  23. S. Esmeir, S. Markovitch, Anytime learning of anycost classifiers. Mach. Learn. 82, 445–473 (2011)

    Article  MathSciNet  Google Scholar 

  24. Y.-L. Chen, C.-C. Wu, K. Tang, Time-constrained cost-sensitive decision tree induction. Inf. Sci. 354, 140–152 (2016)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Qi, Z., Wang, H., Dong, Z. (2024). Introduction. In: Dirty Data Processing for Machine Learning. Springer, Singapore. https://doi.org/10.1007/978-981-99-7657-7_1

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-7657-7_1

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-7656-0

  • Online ISBN: 978-981-99-7657-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation