Abstract
Pull requests (PR) are a fundamental aspect of collaborative software development, allowing developers to propose changes to a codebase hosted on platforms like GitHub. They serve as a mechanism for peer review, enabling team members to assess the proposed changes before merging them into the main code repository. Duplicate pull requests occur when multiple contributors submit similar or identical proposed changes to a code repository. Such duplicate pull requests can be problematic because they create redundancy, waste developers’ time, and complicate the review process. In this paper, we propose an approach which is based on a pre-trained language model, namely BERT (Bidirectional Encoder Representations from Transformers) to automatically detect duplicate PRs in GitHub repositories. A dataset of 3328 labeled PRs collected from 26 GitHub repositories is built. This data is then fed to a BERT model in order to get the embeddings which represent the contextual relationships between the words used in pairs of pull requests. Then, the BERT’s classification outputs are fed to a Multilayer Perceptron (MLP) classifier which represents our final duplicate pull requests detector. Experiments have shown that BERT provided good performance and achieved an accuracy of 92% with MLP classifier. Results have proven that BERT’s word representation features achieved an increase of 13% (resp., 17 and 23%) compared to Siamese-BERT model (resp., DC-CNN and Word2Vec) in term of accuracy.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13198-024-02361-4/MediaObjects/13198_2024_2361_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13198-024-02361-4/MediaObjects/13198_2024_2361_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13198-024-02361-4/MediaObjects/13198_2024_2361_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13198-024-02361-4/MediaObjects/13198_2024_2361_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13198-024-02361-4/MediaObjects/13198_2024_2361_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13198-024-02361-4/MediaObjects/13198_2024_2361_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13198-024-02361-4/MediaObjects/13198_2024_2361_Fig7_HTML.png)
Similar content being viewed by others
References
Abualigah L, Elaziz MA, Sumari P, Geem ZW, Gandomi AH (2022) Reptile search algorithm (RSA): a nature-inspired meta-heuristic optimizer. Expert Syst Appl 191:116158. https://www.sciencedirect.com/science/article/pii/S0957417421014810
Arqub OA, Abo-Hammour Z (2014) Numerical solution of systems of second-order boundary value problems using continuous genetic algorithm. Inf Sci 279:396–415. https://www.sciencedirect.com/science/article/pii/S0020025514004253
Ciborowska A, Damevski K (2021) Fast changeset-based bug localization with BERT. CoRR. ar**v:2112.14169
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). Association for Computational Linguistics, Minneapolis, pp 4171–4186. https://aclanthology.org/N19-1423
Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T (eds) Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, volume 1 (long and short papers). Association for Computational Linguistics, pp 4171–4186. https://doi.org/10.18653/v1/n19-1423
Eyal Salman H, Alshara Z, Seriai A-D (2022) Automatic identification of similar pull-requests in GitHub’s repositories using machine learning. Information 13(2). https://www.mdpi.com/2078-2489/13/2/73
Feifei X, Shuting Z, Yu T (2020) Bert-based Siamese network for semantic similarity. J Phys Conf Ser 1684(1):012074. https://doi.org/10.1088/1742-6596/1684/1/012074
Ghadhab L, Jenhani I, Mkaouer MW, Messaoud MB (2021) Augmenting commit classification by using fine-grained source code changes and a pre-trained deep neural language model. Inf Softw Technol 135:106566. https://doi.org/10.1016/j.infsof.2021.106566
Gousios G, Pinzger M, Deursen AV (2014) An exploratory study of the pull-based software development model. In: Proceedings of the 36th international conference on software engineering, ser. ICSE 2014. Association for Computing Machinery, New York, pp 345–355. https://doi.org/10.1145/2568225.2568260
He J, Xu L, Yan M, **a X, Lei Y (2020) Duplicate bug report detection using dual-channel convolutional neural networks. In: Guéhéneuc Y, Hayashi S (eds)Proceedings—2020 IEEE/ACM 28th international conference on program comprehension, ICPC 2020. United States of America: IEEE, Institute of Electrical and Electronics Engineers, 2020, pp 117–127, international Conference on Program Comprehension 2020, ICPC; Conference date: 13-07-2020 Through 15-07-2020. https://dl.acm.org/doi/proceedings/10.1145/3387904, https://conf.researchr.org/home/icpc-2020
Hinton GE, Roweis S (2002) Stochastic neighbor embedding. In: Becker S, Thrun S, Obermayer K (eds) Advances in neural information processing systems, vol 15. MIT Press. https://proceedings.neurips.cc/paper/2002/file/6150ccc6069bea6b5716254057a194ef-Paper.pdf
Kingma DP, Ba, J (2014) Adam: A method for stochastic optimization. ar**v:1412.6980
Li Z, Yin G, Yu Y, Wang T, Wang H (2017) Detecting duplicate pull-requests in GitHub. In: Mei H, Lyu J, ** Z, Zhao W (eds) Internetware. ACM, pp. 20:1–20:6. http://dblp.uni-trier.de/db/conf/internetware/internetware2017.html#LiYYWW17
Li Z, Yu Y, Zhou M, Wang T, Yin G, Lan L, Wang H (2020) Redundancy, context, and preference: an empirical study of duplicate pull requests in oss projects. IEEE Trans Softw Eng 1–1
Li Z, Yu Y, Wang T, Yin G, jun Mao X, Wang H (2021) Detecting duplicate contributions in pull-based model combining textual and change similarities. J Comput Sci Technol 36:191–206
Maayah B, Moussaoui A, Bushnaq S, Arqub OA (2022) The multistep Laplace optimized decomposition method for solving fractional-order coronavirus disease model (covid-19) via the Caputo fractional approach. Demonstratio Mathematica 55(1):963–977. https://doi.org/10.1515/dema-2022-0183
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. ar**v:1301.3781
Nugroho KS, Sukmadewa AY, Yudistira N (2021) Large-scale news classification using BERT language model: spark NLP approach. CoRR. ar**v:2107.06785
Oyelade ON, Ezugwu AE, Mohamed TIA, Abualigah LM (2022) Ebola optimization search algorithm: a new nature-inspired metaheuristic optimization algorithm. IEEE Access 10:16 150-16 177
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, VanderPlas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2012) Scikit-learn: machine learning in python. CoRR. ar**v:1201.0490
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Walker MA, Ji H, Stent A (eds) NAACL-HLT. Association for Computational Linguistics, pp 2227–2237. http://dblp.uni-trier.de/db/conf/naacl/naacl2018-1.html#PetersNIGCLZ18
Radford A, Narasimhan K (2018) Improving language understanding by generative pre-training
Ren L, Zhou S, Kästner C, Wasowski A (2019) Identifying redundancies in fork-based development. In: 2019 IEEE 26th International conference on software analysis, evolution and reengineering (SANER), pp 230–241
Robbins H, Monro S (1951) A stochastic approximation method. In: The annals of mathematical statistics, pp 400–407
van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(86):2579–2605. http://jmlr.org/papers/v9/vandermaaten08a.html
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser LU, Polosukhin I (2017a) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017b) Attention is all you need. CoRR. ar**v:1706.03762
Wang Q, Xu B, **a X, Wang T, Li S (2019) Duplicate pull request detection: when time matters. In: Proceedings of the 11th Asia-pacific symposium on internetware, ser. Internetware’19. Association for Computing Machinery, New York. https://doi.org/10.1145/3361242.3361254
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X, Kaiser L, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M, Dean J (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR. ar**v:1609.08144
Yu Y, Wang H, Yin G, Wang T (2016) Reviewer recommendation for pull-requests in GitHub: what can we learn from code review and bug assignment? Inf Softw Technol 74:204–218. https://www.sciencedirect.com/science/article/pii/S0950584916000069
Yu Y, Li Z, Yin G, Wang T, Wang H (2018) A dataset of duplicate pull-requests in GitHub. In: Zaidman A, Kamei Y, Hill E (eds) MSR. ACM, pp 22–25. http://dblp.uni-trier.de/db/conf/msr/msr2018.html#YuLYWW18
Zhu Y, Kiros R, Zemel R, Salakhutdinov R, Urtasun R, Torralba A, Fidler S (2015) Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: 2015 IEEE International conference on computer vision (ICCV), pp 19–27
Funding
This study has been funded by the Tunisian Young Researchers’ Encouragement Program (Ed. 2022) (22PEJC-D3P2).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Authors have no conflict of interest to declare.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Messaoud, M.B., Chekaya, R.B., Mkaouer, M.W. et al. PR-DupliChecker: detecting duplicate pull requests in Fork-based workflows. Int J Syst Assur Eng Manag 15, 3538–3550 (2024). https://doi.org/10.1007/s13198-024-02361-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13198-024-02361-4