Data Source Selection Support in the Big Data Integration Process – Towards a Taxonomy

  • Conference paper
  • First Online:
Innovation Through Information Systems (WI 2021)

Part of the book series: Lecture Notes in Information Systems and Organisation ((LNISO,volume 48))

Included in the following conference series:

  • 3003 Accesses

Abstract

Selecting data sources is a crucial step in providing a useful information base to support decision-makers. While any data source can represent a potential added value in decision making, it’s integration always implies a representative effort. For decision-makers, data sources must contain relevant information in an appropriate scope. The data scientist must assess whether the integration of the data sources is technically possible and how much effort is required. Therefore, a taxonomy was developed to identify the relevant data sources for the decision-maker and minimize the data integration effort. The taxonomy was developed and evaluated with real data sources and six companies from different industries. The final taxonomy consists of sixteen dimensions that support the data scientist and decision-maker in selecting data sources for the big data integration process. An efficient and effective big data integration process can be carried out with a minimum of data sources to be integrated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://sites.google.com/site/anhaidgroup/projects/magellan.

  2. 2.

    https://www.biggorilla.org/.

  3. 3.

    https://developer.uspto.gov/product/patent-grant-bibliographic-dataxml.

  4. 4.

    https://opencorporates.com/.

  5. 5.

    Powered by Crunchbase: https://data.crunchbase.com/docs/open-data-map.

  6. 6.

    Crunchbase 2013 Snapshot ©, Creative Commons Attribution License [CC-BY], https://data.crunchbase.com/docs/2013-snapshot.

  7. 7.

    https://www.gleif.org/de/lei-data/gleif-concatenated-file/download-the-concatenated-file.

  8. 8.

    https://www.upcitemdb.com/.

  9. 9.

    At the time of access still freely available: https://public.enigma.com/browse/collection/stock-exchanges-company-listings/50a2457d-6407-4581-8f14-5d37a9410fa9.

  10. 10.

    https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset.

  11. 11.

    http://www.naturalearthdata.com/downloads/10m-cultural-vectors/.

  12. 12.

    https://www.kaggle.com/juanumusic/countries-iso-codes.

  13. 13.

    https://susanqq.github.io/UTKFace/.

  14. 14.

    https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/.

  15. 15.

    https://de.appanion.com/startups.

References

  1. Stonebraker, M., Ilyas, I.: Data integration: the current status and the way forward. IEEE Data Eng. Bull. 41, 3–9 (2018)

    Google Scholar 

  2. Dong, X.L., Srivastava, D.: Big data integration. Synth. Lect. Data Manag. 7, 1–198 (2015)

    Article  Google Scholar 

  3. Abbasi, A., Sarker, S., Chiang, R.: Big data research in information systems: toward an inclusive research agenda. J. Assoc. Inf. Syst. 17(2), I–XXXII (2016). https://doi.org/10.17705/1jais.00423

    Article  Google Scholar 

  4. Lin, Y., Wang, H., Li, J., Gao, H.: Data Source Selection for Information Integration in Big Data Era (2016)

    Google Scholar 

  5. Christen, P.: Data linkage: the big picture. Harvard Data Sci. Rev. (2019). https://doi.org/10.1162/99608f92.84deb5c4

    Article  Google Scholar 

  6. Wirth, R.: CRISP-DM: towards a standard process model for data mining. In: Proceedings of the Fourth International Conference on the Practical Application of Knowledge Discovery and Data Mining, pp. 29–39 (2000)

    Google Scholar 

  7. Kruse, F., Dmitriyev, V., Marx Gómez, J.: Building a connection between decision maker and data-driven decision process. Archives of Data Science, Series A (Online First) 4, 16 S. online (2018)

    Google Scholar 

  8. Nickerson, R.C., Varshney, U., Muntermann, J.: A method for taxonomy development and its application in information systems. Eur. J. Inf. Syst. 22, 336–359 (2013)

    Article  Google Scholar 

  9. Kruse, F., Hassan, A.P., Awick, J.-P., Marx Gómez, J.: A qualitative literature review on linkage techniques for data integration. In: Bui, T. (ed.) 53rd Hawaii International Conference on System Sciences, HICSS 2020, Grand Wailea, Maui, Hawaii, USA, January 7–10, 2020, pp. 1063–1073. ScholarSpace/AIS Electronic Library (AISeL) (2020)

    Google Scholar 

  10. Konda, P., et al.: Magellan: toward building entity matching management systems. Proc. VLDB Endow. 9(12), 1197–1208 (2016). https://doi.org/10.14778/2994509.2994535

    Article  Google Scholar 

  11. Konda, P., Subramanian Seshadri, S., Segarra, E., Hueth, B., Doan, A.: Executing entity matching end to end: a case study. In: Herschel, M., Galhardas, H., Reinwald, B., Fundulaki, I., Binnig, C., Kaoudi, Z. (eds.) Advances in Database Technology – 22nd International Conference on Extending Database Technology, EDBT 2019, pp. 489–500. OpenProceedings.org, Lisbon, Portugal, 26–29 March 2019 (2019)

    Google Scholar 

  12. Govind, Y., et al.: Entity Matching Meets Data Science: A Progress Report from the Magellan Project (2019)

    Google Scholar 

  13. Safhi, H.M., Frikh, B., Ouhbi, B.: Data source selection in big data context. In: Indrawan-Santiago, M., Pardede, E., Salvadori, I.L., Steinbauer, M., Khalil, I., Anderst-Kotsis, G. (eds.) Proceedings of the 21st International Conference on Information Integration and Web-based Applications and Services, pp. 611–616. ACM, New York, NY, USA (2019)

    Google Scholar 

  14. Assaf, A., Senart, A., Troncy, R.: Towards an objective assessment framework for linked data quality. Int. J. Semant. Web Inf. Syst. 12, 111–133 (2016)

    Article  Google Scholar 

  15. Dong, X.L., Saha, B., Srivastava, D.: Less is more: selecting sources wisely for integration. Proc. VLDB Endow. 6(2), 37–48 (2012). https://doi.org/10.14778/2535568.2448938

    Article  Google Scholar 

  16. Rekatsinas, T., Dong, X.L., Srivastava, D.: Characterizing and selecting fresh data sources. In: Dyreson, C., Li, F., Özsu, M.T. (eds.) Proceedings of the 2014 ACM SIGMOD international conference on Management of data – SIGMOD ‘14, pp. 919–930. ACM Press, New York, New York, USA (2014)

    Google Scholar 

  17. Zrenner, J., Hassan, A.P., Otto, B., Marx Gómez, J.C.: Data source taxonomy for supply network structure visibility. Epubli (2017)

    Google Scholar 

  18. Li, L., Peng, T., Kennedy, J.: A rule based taxonomy of dirty data. GSTF Int. J. Comput. 1 (2011)

    Google Scholar 

  19. Roeder, J., Muntermann, J., Kneib, T.: Towards a taxonomy of data heterogeneity. In: Gronau, N., Heine, M., Poustcchi, K., Krasnova, H. (eds.) WI2020 Zentrale Tracks, pp. 293–308. GITO Verlag (2020)

    Google Scholar 

  20. Szopinski, D., Schoormann, T., Kundisch, D.: Because your taxonomy is worth it: towards a framework for taxonomy evaluation. In: Proceedings of the Twenty-Seventh European Conference on Information Systems (ECIS) (2019)

    Google Scholar 

  21. Wang, R.Y., Strong, D.M.: Beyond accuracy: what data quality means to data consumers. J. Manag. Inf. Syst. 12, 5–33 (1996)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Felix Kruse .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kruse, F., Schröer, C., Gómez, J.M. (2021). Data Source Selection Support in the Big Data Integration Process – Towards a Taxonomy. In: Ahlemann, F., Schütte, R., Stieglitz, S. (eds) Innovation Through Information Systems. WI 2021. Lecture Notes in Information Systems and Organisation, vol 48. Springer, Cham. https://doi.org/10.1007/978-3-030-86800-0_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86800-0_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86799-7

  • Online ISBN: 978-3-030-86800-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation