Log in

PR-DupliChecker: detecting duplicate pull requests in Fork-based workflows

  • Original Article
  • Published:
International Journal of System Assurance Engineering and Management Aims and scope Submit manuscript

Abstract

Pull requests (PR) are a fundamental aspect of collaborative software development, allowing developers to propose changes to a codebase hosted on platforms like GitHub. They serve as a mechanism for peer review, enabling team members to assess the proposed changes before merging them into the main code repository. Duplicate pull requests occur when multiple contributors submit similar or identical proposed changes to a code repository. Such duplicate pull requests can be problematic because they create redundancy, waste developers’ time, and complicate the review process. In this paper, we propose an approach which is based on a pre-trained language model, namely BERT (Bidirectional Encoder Representations from Transformers) to automatically detect duplicate PRs in GitHub repositories. A dataset of 3328 labeled PRs collected from 26 GitHub repositories is built. This data is then fed to a BERT model in order to get the embeddings which represent the contextual relationships between the words used in pairs of pull requests. Then, the BERT’s classification outputs are fed to a Multilayer Perceptron (MLP) classifier which represents our final duplicate pull requests detector. Experiments have shown that BERT provided good performance and achieved an accuracy of 92% with MLP classifier. Results have proven that BERT’s word representation features achieved an increase of 13% (resp., 17 and 23%) compared to Siamese-BERT model (resp., DC-CNN and Word2Vec) in term of accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Canada)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. https://github.com/rails/rails/pulls.

  2. https://github.com/whystar/MSR2018-DupPR/blob/master/dup_prs.md.

  3. https://github.com/kubernetes/kubernetes/pull/28966.

  4. https://github.com/kubernetes/kubernetes/pull/25342.

  5. https://github.com/kubernetes/kubernetes/.

References

  • Abualigah L, Elaziz MA, Sumari P, Geem ZW, Gandomi AH (2022) Reptile search algorithm (RSA): a nature-inspired meta-heuristic optimizer. Expert Syst Appl 191:116158. https://www.sciencedirect.com/science/article/pii/S0957417421014810

  • Arqub OA, Abo-Hammour Z (2014) Numerical solution of systems of second-order boundary value problems using continuous genetic algorithm. Inf Sci 279:396–415. https://www.sciencedirect.com/science/article/pii/S0020025514004253

  • Ciborowska A, Damevski K (2021) Fast changeset-based bug localization with BERT. CoRR. ar**v:2112.14169

  • Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). Association for Computational Linguistics, Minneapolis, pp 4171–4186. https://aclanthology.org/N19-1423

  • Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T (eds) Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, volume 1 (long and short papers). Association for Computational Linguistics, pp 4171–4186. https://doi.org/10.18653/v1/n19-1423

  • Eyal Salman H, Alshara Z, Seriai A-D (2022) Automatic identification of similar pull-requests in GitHub’s repositories using machine learning. Information 13(2). https://www.mdpi.com/2078-2489/13/2/73

  • Feifei X, Shuting Z, Yu T (2020) Bert-based Siamese network for semantic similarity. J Phys Conf Ser 1684(1):012074. https://doi.org/10.1088/1742-6596/1684/1/012074

    Article  Google Scholar 

  • Ghadhab L, Jenhani I, Mkaouer MW, Messaoud MB (2021) Augmenting commit classification by using fine-grained source code changes and a pre-trained deep neural language model. Inf Softw Technol 135:106566. https://doi.org/10.1016/j.infsof.2021.106566

  • Gousios G, Pinzger M, Deursen AV (2014) An exploratory study of the pull-based software development model. In: Proceedings of the 36th international conference on software engineering, ser. ICSE 2014. Association for Computing Machinery, New York, pp 345–355. https://doi.org/10.1145/2568225.2568260

  • He J, Xu L, Yan M, **a X, Lei Y (2020) Duplicate bug report detection using dual-channel convolutional neural networks. In: Guéhéneuc Y, Hayashi S (eds)Proceedings—2020 IEEE/ACM 28th international conference on program comprehension, ICPC 2020. United States of America: IEEE, Institute of Electrical and Electronics Engineers, 2020, pp 117–127, international Conference on Program Comprehension 2020, ICPC; Conference date: 13-07-2020 Through 15-07-2020. https://dl.acm.org/doi/proceedings/10.1145/3387904, https://conf.researchr.org/home/icpc-2020

  • Hinton GE, Roweis S (2002) Stochastic neighbor embedding. In: Becker S, Thrun S, Obermayer K (eds) Advances in neural information processing systems, vol 15. MIT Press. https://proceedings.neurips.cc/paper/2002/file/6150ccc6069bea6b5716254057a194ef-Paper.pdf

  • Kingma DP, Ba, J (2014) Adam: A method for stochastic optimization. ar**v:1412.6980

  • Li Z, Yin G, Yu Y, Wang T, Wang H (2017) Detecting duplicate pull-requests in GitHub. In: Mei H, Lyu J, ** Z, Zhao W (eds) Internetware. ACM, pp. 20:1–20:6. http://dblp.uni-trier.de/db/conf/internetware/internetware2017.html#LiYYWW17

  • Li Z, Yu Y, Zhou M, Wang T, Yin G, Lan L, Wang H (2020) Redundancy, context, and preference: an empirical study of duplicate pull requests in oss projects. IEEE Trans Softw Eng 1–1

  • Li Z, Yu Y, Wang T, Yin G, jun Mao X, Wang H (2021) Detecting duplicate contributions in pull-based model combining textual and change similarities. J Comput Sci Technol 36:191–206

    Article  Google Scholar 

  • Maayah B, Moussaoui A, Bushnaq S, Arqub OA (2022) The multistep Laplace optimized decomposition method for solving fractional-order coronavirus disease model (covid-19) via the Caputo fractional approach. Demonstratio Mathematica 55(1):963–977. https://doi.org/10.1515/dema-2022-0183

    Article  MathSciNet  Google Scholar 

  • Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. ar**v:1301.3781

  • Nugroho KS, Sukmadewa AY, Yudistira N (2021) Large-scale news classification using BERT language model: spark NLP approach. CoRR. ar**v:2107.06785

  • Oyelade ON, Ezugwu AE, Mohamed TIA, Abualigah LM (2022) Ebola optimization search algorithm: a new nature-inspired metaheuristic optimization algorithm. IEEE Access 10:16 150-16 177

    Article  Google Scholar 

  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, VanderPlas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2012) Scikit-learn: machine learning in python. CoRR. ar**v:1201.0490

  • Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Walker MA, Ji H, Stent A (eds) NAACL-HLT. Association for Computational Linguistics, pp 2227–2237. http://dblp.uni-trier.de/db/conf/naacl/naacl2018-1.html#PetersNIGCLZ18

  • Radford A, Narasimhan K (2018) Improving language understanding by generative pre-training

  • Ren L, Zhou S, Kästner C, Wasowski A (2019) Identifying redundancies in fork-based development. In: 2019 IEEE 26th International conference on software analysis, evolution and reengineering (SANER), pp 230–241

  • Robbins H, Monro S (1951) A stochastic approximation method. In: The annals of mathematical statistics, pp 400–407

  • van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(86):2579–2605. http://jmlr.org/papers/v9/vandermaaten08a.html

  • Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser LU, Polosukhin I (2017a) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  • Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017b) Attention is all you need. CoRR. ar**v:1706.03762

  • Wang Q, Xu B, **a X, Wang T, Li S (2019) Duplicate pull request detection: when time matters. In: Proceedings of the 11th Asia-pacific symposium on internetware, ser. Internetware’19. Association for Computing Machinery, New York. https://doi.org/10.1145/3361242.3361254

  • Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X, Kaiser L, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M, Dean J (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR. ar**v:1609.08144

  • Yu Y, Wang H, Yin G, Wang T (2016) Reviewer recommendation for pull-requests in GitHub: what can we learn from code review and bug assignment? Inf Softw Technol 74:204–218. https://www.sciencedirect.com/science/article/pii/S0950584916000069

  • Yu Y, Li Z, Yin G, Wang T, Wang H (2018) A dataset of duplicate pull-requests in GitHub. In: Zaidman A, Kamei Y, Hill E (eds) MSR. ACM, pp 22–25. http://dblp.uni-trier.de/db/conf/msr/msr2018.html#YuLYWW18

  • Zhu Y, Kiros R, Zemel R, Salakhutdinov R, Urtasun R, Torralba A, Fidler S (2015) Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: 2015 IEEE International conference on computer vision (ICCV), pp 19–27

Download references

Funding

This study has been funded by the Tunisian Young Researchers’ Encouragement Program (Ed. 2022) (22PEJC-D3P2).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohamed Wiem Mkaouer.

Ethics declarations

Conflict of interest

Authors have no conflict of interest to declare.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Messaoud, M.B., Chekaya, R.B., Mkaouer, M.W. et al. PR-DupliChecker: detecting duplicate pull requests in Fork-based workflows. Int J Syst Assur Eng Manag 15, 3538–3550 (2024). https://doi.org/10.1007/s13198-024-02361-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13198-024-02361-4

Keywords

Navigation