Abstract
A bibliometric analysis based on records from a single citation database may be limited in its comprehensiveness and, therefore, in the reliability of its results. The process of combining and deduplicating records from multiple citation index databases for the purpose of a bibliometric analysis is often manual and requires significant effort, especially for larger amounts of data. This paper presents an open-source tool for automatically preprocessing and deduplicating records based on similarity and user-configurable strategies. To validate the capabilities of the tool, the authors of this paper first manually deduplicated records from Scopus and Web of Science on a use-case analysis for 11,307 records. The performance of the tool was then evaluated against the manually deduplicated results. From the results of the best performing similarity configuration on a deduplication use case, the tool minimizes the time researchers would spend on data wrangling for combining Scopus and WoS up to 99% precision and 98% F-measure. The tool developed has practical implications for bibliometric studies. For instance, we conducted a bibliometric analysis of the most productive researchers at a university using a single citation database, as well as merged data from multiple citation databases. The study used the VOSviewer tool and showed that utilizing merged data may produce different outcomes compared to those obtained from a study based on a single citation database.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11192-024-05076-2/MediaObjects/11192_2024_5076_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11192-024-05076-2/MediaObjects/11192_2024_5076_Fig2_HTML.png)
Similar content being viewed by others
References
Abdulhayoglu, M. A., & Thijs, B. (2018). Use of locality sensitive hashing (LSH) algorithm to match Web of Science and Scopus. Scientometrics, 116, 1229–1245. https://doi.org/10.1007/s11192-017-2569-6
Aria, M., & Cuccurullo, C. (2017). bibliometrix: An R-tool for comprehensive science map** analysis. Journal of Informetrics, 11(4), 959–975. https://doi.org/10.1016/j.joi.2017.08.007
Aviv-Reuven, S., & Rosenfeld, A. (2023). A logical set theory approach to journal subject classification analysis: Intra-system irregularities and inter-system discrepancies in Web of Science and Scopus. Scientometrics, 128(1), 157–175. https://doi.org/10.1007/s11192-022-04576-3
Broadus, R. N. (1987). Toward a definition of “bibliometrics.” Scientometrics, 12, 373–379. https://doi.org/10.1007/BF02016680
Campbell, D., Picard-Aitken, M., Côté, G., Caruso, J., Valentim, R., Edmonds, S., Williams, G., Macaluso, B., Robitaille, J.-P., Bastien, N., Laframboise, M.-C., & Lebeau, L.-M. (2010). Bibliometrics as a performance measurement tool for research evaluation: The case of research funded by the National Cancer Institute of Canada. American Journal of Evaluation, 31(1), 66–83. https://doi.org/10.1177/1098214009354774
Caputo, A., & Kargina, M. (2022). A user-friendly method to merge Scopus and Web of Science data during bibliometric analysis. Journal of Marketing Analytics, 10(1), 82–88. https://doi.org/10.1057/s41270-021-00142-7
Chen, C. (2006). CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature. Journal of the American Society for Information Science and Technology, 57(3), 359–377. https://doi.org/10.1002/asi.20317
Cobo, M. J., López-Herrera, A. G., Herrera-Viedma, E., & Herrera, F. (2012). SciMAT: A new science map** analysis software tool. Journal of the American Society for Information Science and Technology, 63(8), 1609–1630. https://doi.org/10.1002/asi.22688
Culbert, J., Hobert, A., Jahn, N., Haupka, N., Schmidt, M., Donner, P., & Mayr, P. (2024). Reference coverage analysis of OpenAlex compared to Web of Science and Scopus. ar**v preprint ar**v:2401.16359. https://doi.org/10.48550/ar**v.2401.16359
Donthu, N., Kumar, S., Mukherjee, D., Pandey, N., & Lim, W. M. (2021). How to conduct a bibliometric analysis: An overview and guidelines. Journal of Business Research, 133, 285–296. https://doi.org/10.1016/j.jbusres.2021.04.070
Echchakoui, S. (2020). Why and how to merge Scopus and Web of Science during bibliometric analysis: The case of sales force literature from 1912 to 2019. Journal of Marketing Analytics, 8, 165–184. https://doi.org/10.1057/s41270-020-00081-9
Gagolewski, M. (2011). Bibliometric impact assessment with R and the CITAN package. Journal of Informetrics, 5(4), 678–692. https://doi.org/10.1016/j.joi.2011.06.006
Garfield, E. (1970). Citation indexing for studying science. Nature, 227(5259), 669–671.
Gavel, Y., & Iselid, L. (2008). Web of Science and Scopus: A journal title overlap study. Online Information Review, 32(1), 8–21. https://doi.org/10.1108/14684520810865958
Harzing, A. W., & Alakangas, S. (2016). Google Scholar, Scopus and the Web of Science: A longitudinal and cross-disciplinary comparison. Scientometrics, 106, 787–804. https://doi.org/10.1007/s11192-015-1798-9
Kumpulainen, M., & Seppänen, M. (2022). Combining Web of Science and Scopus datasets in citation-based literature study. Scientometrics, 127(10), 5613–5631. https://doi.org/10.1007/s11192-022-04475-7
Martín-Martín, A., Thelwall, M., Orduna-Malea, E., & Delgado López-Cózar, E. (2021). Google Scholar, Microsoft Academic, Scopus, Dimensions, Web of Science, and OpenCitations’ COCI: A multidisciplinary comparison of coverage via citations. Scientometrics, 126(1), 871–906. https://doi.org/10.1007/s11192-020-03690-4
Mejia, C., Wu, M., Zhang, Y., & Kajikawa, Y. (2021). Exploring topics in bibliometric research through citation networks and semantic analysis. Frontiers in Research Metrics and Analytics, 6, 742311. https://doi.org/10.3389/frma.2021.742311
Mongeon, P., & Paul-Hus, A. (2016). The journal coverage of Web of Science and Scopus: A comparative analysis. Scientometrics, 106, 213–228. https://doi.org/10.1007/s11192-015-1765-5
Moral-Muñoz, J. A., Herrera-Viedma, E., Santisteban-Espejo, A., & Cobo, M. J. (2020). Software tools for conducting bibliometric analysis in science: An up-to-date review. Profesional de la Información, 29(1). https://doi.org/10.3145/epi.2020.ene.03
de Oliveira, O. J., da Silva, F. F., Juliani, F., Barbosa, L. C. F. M., & Nunhes, T. V. (2019). Bibliometric method for map** the state-of-the-art and identifying research gaps and trends in literature: An essential instrument to support the development of scientific projects. In Scientometrics recent advances. IntechOpen. https://doi.org/10.5772/intechopen.85856
Osinska, V., & Klimas, R. (2021). Map** science: Tools for bibliometric and altmetric studies. https://doi.org/10.47989/irpaper909
Pereira, V., Basilio, M. P., & Santos, C. H. T. (2023). pyBibX—A python library for bibliometric and scientometric analysis powered with artificial intelligence tools. ar**v preprint ar**v:2304.14516https://doi.org/10.48550/ar**v.2304.14516.
Persson, O., Danell, R., & Schneider, J. W. (2009). How to use Bibexcel for various types of bibliometric analysis. Celebrating scholarly communication studies: A Festschrift for Olle Persson at his 60th Birthday, 5, 9–24.
Pranckutė, R. (2021). Web of Science (WoS) and Scopus: The titans of bibliographic information in today’s academic world. Publications, 9(1), 12. https://doi.org/10.3390/publications9010012
Ruiz-Rosero, J., Ramírez-González, G., & Viveros-Delgado, J. (2019). Software survey: ScientoPy, a scientometric tool for topics trend analysis in scientific publications. Scientometrics, 121(2), 1165–1188. https://doi.org/10.1007/s11192-019-03213-w
Sánchez, A. D., Del Río, M. D. L. C., & García, J. Á. (2017). Bibliometric analysis of publications on wine tourism in the databases Scopus and WoS. European Research on Management and Business Economics, 23(1), 8–15. https://doi.org/10.1016/j.iedeen.2016.02.001
Singh, V. K., Singh, P., Karmakar, M., Leta, J., & Mayr, P. (2021). The journal coverage of Web of Science, Scopus and Dimensions: A comparative analysis. Scientometrics, 126, 5113–5142. https://doi.org/10.1007/s11192-021-03948-5
Team S. (2009). Science of science (Sci2) tool (p. 379). Indiana University and SciTech Strategies.
Ullah, R., Asghar, I., & Griffiths, M. G. (2022). An integrated methodology for bibliometric analysis: A case study of internet of things in healthcare applications. Sensors, 23(1), 67. https://doi.org/10.3390/s23010067
Van Eck, N., & Waltman, L. (2010). Software survey: VOSviewer, a computer program for bibliometric map**. Scientometrics, 84(2), 523–538. https://doi.org/10.1007/s11192-009-0146-3
Van Eck, N. J., & Waltman, L. (2014). CitNetExplorer: A new software tool for analyzing and visualizing citation networks. Journal of Informetrics, 8(4), 802–823. https://doi.org/10.1016/j.joi.2014.07.006
Vera-Baceta, M. A., Thelwall, M., & Kousha, K. (2019). Web of Science and Scopus language coverage. Scientometrics, 121(3), 1803–1813. https://doi.org/10.1007/s11192-019-03264-z
Visser, M., Van Eck, N. J., & Waltman, L. (2021). Large-scale comparison of bibliographic data sources: Scopus, Web of Science, Dimensions, Crossref, and Microsoft Academic. Quantitative Science Studies, 2(1), 20–41. https://doi.org/10.1162/qss_a_00112
Yang, J., Cheng, C., Shen, S., & Yang, S. (2017). Comparison of complex network analysis software: Citespace, SCI 2 and Gephi. In 2017 IEEE 2nd International conference on Big data analysis (ICBDA) (pp. 169–172). IEEE. https://doi.org/10.1109/ICBDA.2017.8078800
Zhu, J., & Liu, W. (2020). A tale of two databases: The use of Web of Science and Scopus in academic papers. Scientometrics, 123(1), 321–335. https://doi.org/10.1007/s11192-020-03387-8
Funding
This research has been supported by the Ministry of Science, Technological Development and Innovation (Contract No. 451-03-65/2024-03/200156) and the Faculty of Technical Sciences, University of Novi Sad through project “Scientific and Artistic Research Work of Researchers in Teaching and Associate Positions at the Faculty of Technical Sciences, University of Novi Sad” (No. 01-3394/1).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Nikolić, D., Ivanović, D. & Ivanović, L. An open-source tool for merging data from multiple citation databases. Scientometrics (2024). https://doi.org/10.1007/s11192-024-05076-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11192-024-05076-2