Data Cleaning: A Case Study with OpenRefine and Trifacta Wrangler

Petrova-Antonova, Dessislava; Tancheva, Rumyana

doi:10.1007/978-3-030-58793-2_3

Dessislava Petrova-Antonova⁹ &
Rumyana Tancheva⁹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1266))

Included in the following conference series:

International Conference on the Quality of Information and Communications Technology

2207 Accesses
2 Citations

Abstract

Data cleaning is the most time-consuming activity in data science projects aimed at delivery high-quality datasets to provide accuracy of the corresponding trained models. Due to variability of the data types and formats, data origin and acquisition, different data quality problems arise leading to development of variety cleaning techniques and tools. This paper provides a map** between nature, scope and dimension of data quality problems and a comparative analysis of widely used tools dealing with those problems. The existing data cleaning techniques serve as a basis for comparing the cleaning capabilities of the tools. Furthermore, a cases study addressing the presented data quality problems and cleaning techniques is presented utilizing one of the commonly used software products OpenRefine and Trifacta Wrangler. Although the application of the similar data cleaning techniques on the same dataset, the results show that the performance of the tools is different.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

HITCleaner: A Light-Weight Online Data Cleaning System

Data Preparation: A Technological Perspective and Review

Article Open access 02 June 2023

Big Data Cleaning

References

Reinsel, D., Gantz, J., Rydning, J.: The Digitization of the World. IDC White Paper (2018)
Google Scholar
CrowdFlower, Data Science, Report (2016). https://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf. Accessed 17 Mar 2020
Sebestyen, G., Hangan, A., Czako, Z., Kovacs, G.: A taxonomy and platform for anomaly detection. In: International Conference on Automation, Quality and Testing, Robotics, pp. 1–6 (2018)
Google Scholar
Batini, C., Barone, D., Mastrella, M., Maurino, A., Ruffini, C.: A framework and a methodology for data quality assessment and monitoring. In: International Conference on Information Quality, pp. 333–346 (2007)
Google Scholar
Kim, W., Choi, B., Kim, S., Lee, D.: A taxonomy of dirty data. Data Min. Knowl. Disc. 7, 81–99 (2003)
Article MathSciNet Google Scholar
Josko, J.M.B., Oikawa, M.K., Ferreira, J.E.: A formal taxonomy to improve data defect description. In: Gao, H., Kim, J., Sakurai, Y. (eds.) DASFAA 2016. LNCS, vol. 9645, pp. 307–320. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32055-7_25
Chapter Google Scholar
Sidi, F., Panahy, P.H.S., Affendey, L.S., Jabar, M., Ibrahim, H., Mustapha, A.: Data quality: a survey of data quality dimensions. In: International Conference on Information Retrieval & Knowledge Management. IEEE (2012)
Google Scholar
Laranjeiro, N., Soydemir, S.N., Bernardino, J.: A survey on data quality: classifying poor data. In: 21st Pacific Rim International Symposium on Dependable Computing (PRDC), IEEE (2015)
Google Scholar
Sukhobok, D., Nikolov, N., Roman, D.: Tabular data anomaly patterns. In: International Conference on Big Data Innovations and Applications (Innovate-Data), IEEE (2017)
Google Scholar
https://github.com/FlourishOA/Data. Accessed 03 Feb 2020
Chan, K., Vasardani, M., Winter, S.: Getting lost in cities: spatial patterns of phonetically confusing street names. Trans. GIS 19(4), 535–562 (2014)
Article Google Scholar

Download references

Acknowledgements

This research work has been supported by GATE project, funded by the Horizon 2020 WIDESPREAD-2018-2020 TEAMING Phase 2 programme under grant agreement no. 857155 and Big4Smart and ITDGate projects, funded by the Bulgarian National Science fund, under agreement no. DN12/9 and agreement no. DN 02/11.

Author information

Authors and Affiliations

GATE Institute, Sofia University “St. Kl. Ohridski”, Sofia, Bulgaria
Dessislava Petrova-Antonova & Rumyana Tancheva

Authors

Dessislava Petrova-Antonova
View author publications
You can also search for this author in PubMed Google Scholar
Rumyana Tancheva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dessislava Petrova-Antonova .

Editor information

Editors and Affiliations

Brunel University, London, UK
Martin Shepperd
Lisbon University Institute, Lisbon, Portugal
Fernando Brito e Abreu
University of Lisbon, Lisbon, Portugal
Alberto Rodrigues da Silva
University of Castilla-La Mancha, Talavera de la Reina, Spain
Ricardo Pérez-Castillo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Petrova-Antonova, D., Tancheva, R. (2020). Data Cleaning: A Case Study with OpenRefine and Trifacta Wrangler. In: Shepperd, M., Brito e Abreu, F., Rodrigues da Silva, A., Pérez-Castillo, R. (eds) Quality of Information and Communications Technology. QUATIC 2020. Communications in Computer and Information Science, vol 1266. Springer, Cham. https://doi.org/10.1007/978-3-030-58793-2_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-58793-2_3
Published: 31 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58792-5
Online ISBN: 978-3-030-58793-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Data Cleaning: A Case Study with OpenRefine and Trifacta Wrangler

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

HITCleaner: A Light-Weight Online Data Cleaning System

Data Preparation: A Technological Perspective and Review

Big Data Cleaning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Data Cleaning: A Case Study with OpenRefine and Trifacta Wrangler

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

HITCleaner: A Light-Weight Online Data Cleaning System

Data Preparation: A Technological Perspective and Review

Big Data Cleaning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation