Log in

18 million links in commit messages: purpose, evolution, and decay

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Commit messages contain diverse and valuable types of knowledge in all aspects of software maintenance and evolution. Links are an example of such knowledge. Previous work on “9.6 million links in source code comments” showed that links are prone to decay, become outdated, and lack bidirectional traceability. We conducted a large-scale study of 18,201,165 links from commits in 23,110 GitHub repositories to investigate whether they suffer the same fate. Results show that referencing external resources is prevalent and that the most frequent domains other than github.com are the external domains of Stack Overflow and Google Code. Similarly, links serve as source code context to commit messages, with inaccessible links being frequent. Although repeatedly referencing links is rare (4%), 14% of links that are prone to evolve become unavailable over time; e.g., tutorials or articles and software homepages become unavailable over time. Furthermore, we find that 70% of the distinct links suffer from decay; the domains that occur the most frequently are related to Subversion repositories. We summarize that links in commits share the same fate as links in code, opening up avenues for future work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data Availability

The datasets generated during and/or analysed during the current study are available in the Zenodo repository, https://doi.org/10.5281/zenodo.7536500.

Notes

  1. https://doi.org/10.5281/zenodo.7536500

  2. MySQL database dump 2019-02-01 from http://ghtorrent.org/downloads.html.

  3. https://github.com/sbaltes/git-log-extractor

  4. https://archive.org/help/wayback_api.php

  5. https://github.com/sbaltes/wayback-machine-retriever

  6. https://cran.r-project.org/web/packages/arules/index.html

  7. https://www.surveysystem.com/sscalc.htm

  8. https://stackoverflow.com/q/1008019

References

  • Abdalkareem R, Mujahid S, Shihab E (2020) A machine learning approach to improve the detection of ci skip commits. IEEE Trans Softw Eng 47:2740–2754

  • Aghajani E, Nagy C, Vega-Márquez OL, Linares-Vásquez M, Moreno L, Bavota G, Lanza M (2019) Software documentation issues unveiled. In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, pp 1199–1210

  • Agrawal R, Srikant R et al (1994) Fast algorithms for mining association rules. In: Proc. 20th int. conf. very large data bases, VLDB, Citeseer, vol 1215. pp 487–499

  • Alali A, Kagdi H, Maletic JI (2008) What’s a typical commit? a characterization of open source software repositories. In: 2008 16th IEEE international conference on program comprehension. IEEE, pp 182–191

  • Aniche M, Treude C, Steinmacher I, Wiese I, Pinto G, Storey MA, Gerosa MA (2018) How modern news aggregators help development communities shape and share knowledge. In: 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, pp 499–510

  • Baltes S, Diehl S (2019) Usage and attribution of stack overflow code snippets in github projects. Empir Softw Eng 24(3):1259–1295

    Article  Google Scholar 

  • Baltes S, Dumani L, Treude C, Diehl S (2018) Sotorrent: reconstructing and analyzing the evolution of stack overflow posts. In: Proceedings of the 15th international conference on mining software repositories. pp 319–330

  • Baltes S, Treude C, Robillard MP (2022) Contextual documentation referencing on stack overflow. IEEE Trans Software Eng 48(2):135–149. https://doi.org/10.1109/TSE.2020.2981898

    Article  Google Scholar 

  • Barrie JM, Presti DE (2000) Digital plagiarism-the web giveth and the web shall taketh. J Med Internet Res 2(1):e6

    Article  Google Scholar 

  • Buse RP, Weimer WR (2010) Automatically documenting program changes. In: Proceedings of the IEEE/ACM international conference on Automated software engineering. pp 33–42

  • Dabbish L, Stuart C, Tsay J, Herbsleb J (2012) Social coding in github: transparency and collaboration in an open software repository. In: Proceedings of the ACM 2012 conference on computer supported cooperative work. pp 1277–1286

  • D’Ambros M, Lanza M, Robbes R (2010) Commit 2.0. In: Proceedings of the 1st Workshop on Web 2.0 for Software Engineering. pp 14–19

  • Dragan N, Collard ML, Hammad M, Maletic JI (2011) Using stereotypes to help characterize commits. In: 2011 27th IEEE International Conference on Software Maintenance (ICSM). IEEE, pp 520–523

  • Dyer R, Nguyen HA, Rajan H, Nguyen TN (2013) Boa: a language and infrastructure for analyzing ultra-large-scale software repositories. In: 2013 35th International Conference on Software Engineering (ICSE). IEEE, pp 422–431

  • Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378

    Article  Google Scholar 

  • Forte A, Kittur N, Larco V, Zhu H, Bruckman A, Kraut RE (2012) Coordination and beyond: social functions of groups in open content production. In: Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work. pp 417–426

  • Fu Y, Yan M, Zhang X, Xu L, Yang D, Kymer JD (2015) Automated classification of software change messages by semi-supervised latent dirichlet allocation. Inf Softw Technol 57:369–377

    Article  Google Scholar 

  • Girba T, Kuhn A, Seeberger M, Ducasse S (2005) How developers drive software evolution. In: Eighth international workshop on principles of software evolution (IWPSE’05). IEEE, pp 113–122

  • Gómez C, Cleary B, Singer L (2013) A study of innovation diffusion through link sharing on stack overflow. In: 2013 10th Working Conference on Mining Software Repositories (MSR). IEEE, pp 81–84

  • Gousios G (2013) The ghtorent dataset and tool suite. In: 2013 10th Working Conference on Mining Software Repositories (MSR). IEEE, pp 233–236

  • Hassan AE (2008) The road ahead for mining software repositories. In: 2008 Frontiers of Software Maintenance. IEEE, pp 48–57

  • Hata H, Treude C, Kula RG, Ishio T (2019) 9.6 million links in source code comments: Purpose, evolution, and decay. In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, pp 1211–1221

  • Hata H, Novielli N, Baltes S, Kula RG, Treude C (2022) Github discussions: an exploratory study of early adoption. Empir Softw Eng 27(1):1–32

    Article  Google Scholar 

  • Huang Y, Jia N, Zhou HJ, Chen XP, Zheng ZB, Tang MD (2020) Learning human-written commit messages to document code changes. J Comput Sci Technol 35(6):1258–1277

    Article  Google Scholar 

  • Kehoe C, Pitkow J, Rogers J (1998) Gvu’s ninth www user survey report. Office of Technology Licensing, Georgia Tech Research Corp., Atlanta

  • Kittur A, Kraut RE (2010) Beyond wikipedia: coordination and conflict in online production groups. In: Proceedings of the 2010 ACM conference on Computer supported cooperative work. pp 215–224

  • Krasniqi R, Cleland-Huang J (2020) Enhancing source code refactoring detection with explanations from commit messages. In: 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, pp 512–516

  • Krejcie RV, Morgan DW (1970) Determining sample size for research activities. Educ Psychol Meas 30(3):607–610

    Article  Google Scholar 

  • Liu B, Zhang L, Jiang J, Wang L (2022) A method for identifying references between projects in github. Sci Comput Program 222:102858

    Article  Google Scholar 

  • Liu J, **a X, Lo D, Zhang H, Zou Y, Hassan AE, Li S (2021) Broken external links on stack overflow. IEEE Trans Softw Eng 48:3242–3267

  • Liu J, Zhang H, **a X, Lo D, Zou Y, Hassan AE, Li S (2022) An exploratory study on the repeatedly shared external links on stack overflow. Empir Softw Eng 27(1):1–32

    Article  Google Scholar 

  • Liu S, Gao C, Chen S, Yiu NL, Liu Y (2020) Atom: commit message generation based on abstract syntax tree and hybrid ranking. IEEE Trans Softw Eng 48:1800–1817

  • Maalej W, Happel HJ (2009) From work to word: how do software developers describe their work? In: 2009 6th IEEE International Working Conference on Mining Software Repositories. IEEE, pp 121–130

  • Maalej W, Happel HJ (2010) Can development work describe itself? In: 2010 7th IEEE working conference on mining software repositories (MSR 2010). IEEE, pp 191–200

  • Mockus A, Votta LG (2000) Identifying reasons for software changes using historic databases. In: icsm. pp 120–130

  • Moreno L, Bavota G, Di Penta M, Oliveto R, Marcus A, Canfora G (2014) Automatic generation of release notes. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. pp 484–495

  • Movshovitz-Attias D, Movshovitz-Attias Y, Steenkiste P, Faloutsos C (2013) Analysis of the reputation system and user contributions on a question answering website: stackoverflow. In: 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013). IEEE, pp 886–893

  • Murphy G (2009) Attacking information overload in software development. In: 2009 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, pp 4–4

  • Murphy J, Hashim NH, O’Connor P (2007) Take me back: validating the wayback machine. J Comput-Mediated Commun 13(1):60–75

    Article  Google Scholar 

  • Nagar Y (2012) What do you think? the structuring of an online community as a collective-sensemaking process. In: Proceedings of the ACM 2012 conference on computer supported cooperative work. pp 393–402

  • O’mahony S, Ferraro F (2007) The emergence of governance in an open source community. Acad Manag J 50(5):1079–1106

    Article  Google Scholar 

  • Rath M, Rendall J, Guo JL, Cleland-Huang J, Mäder P (2018) Traceability in the wild: automatically augmenting incomplete trace links. In: Proceedings of the 40th International Conference on Software Engineering. pp 834–845

  • Rebai S, Kessentini M, Alizadeh V, Sghaier OB, Kazman R (2020) Recommending refactorings via commit message analysis. Inf Softw Technol 126:106332

    Article  Google Scholar 

  • Santos EA, Hindle A (2016) Judging a commit by its cover. In: Proceedings of the 13th International Workshop on Mining Software Repositories-MSR, vol 16. pp 504–507

  • Sarwar MU, Zafar S, Mkaouer MW, Walia GS, Malik MZ (2020) Multi-label classification of commit messages using transfer learning. In: 2020 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW). IEEE, pp 37–42

  • Schermann G, Brandtner M, Panichella S, Leitner P, Gall H (2015) Discovering loners and phantoms in commit and issue data. In: 2015 IEEE 23rd International Conference on Program Comprehension. IEEE, pp 4–14

  • Sun Y, Wang Q, Yang Y (2017) Frlink: Improving the recovery of missing issue-commit links by revisiting file relevance. Inf Softw Technol 84:33–47

    Article  Google Scholar 

  • Tian Y, Zhang Y, Stol KJ, Jiang L, Liu H (2022) What makes a good commit message? In: Proceedings of the 44th International Conference on Software Engineering, pp 2389–2401. https://doi.org/10.1145/3510003.3510205

  • Vasilescu B, Filkov V, Serebrenik A (2013) Stackoverflow and github: Associations between software development and crowdsourced knowledge. In: 2013 International Conference on Social Computing. IEEE, pp 188–195

  • Vasilescu B, Serebrenik A, Devanbu P, Filkov V (2014) How social q &a sites are changing knowledge sharing in open source software communities. In: Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing, pp 342–354

  • Viera A, Garrett J (2005) Understanding interobserver agreement: the kappa statistic. Family Med 37:360–363

  • Wang D, **ao T, Thongtanunam P, Kula RG, Matsumoto K (2021) Understanding shared links and their intentions to meet information needs in modern code review. Empir Softw Eng 26(5):1–32

    Article  Google Scholar 

  • Wattanakriengkrai S, Chinthanet B, Hata H, Kula RG, Treude C, Guo J, Matsumoto K (2022) Github repositories with links to academic papers: public access, traceability, and evolution. J Syst Softw 183:111117

    Article  Google Scholar 

  • Wu J, He H, **ao W, Gao K, Zhou M (2022) Demystifying software release note issues on github. In: 2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC). pp 602–613. https://doi.org/10.1145/3524610.3527919

  • **ao T, Wang D, Mcintosh S, Hata H, Kula RG, Ishio T, Matsumoto K (2021) Characterizing and mitigating self-admitted technical debt in build systems. IEEE Trans Softw Eng 48:4214–4228

  • **e R, Chen L, Ye W, Li Z, Hu T, Du D, Zhang S (2019) Deeplink: a code knowledge graph based deep learning approach for issue-commit link recovery. In: 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, pp 434–444

  • **ong Y, Meng Z, Shen B, Yin W (2017) Mining developer behavior across github and stackoverflow. In: SEKE. pp 578–583

  • Ye D, **ng Z, Kapre N (2017) The structure and dynamics of knowledge network in domain-specific q &a sites: a case study of stack overflow. Empir Softw Eng 22(1):375–406

    Article  Google Scholar 

  • Zampetti F, Ponzanelli L, Bavota G, Mocci A, Di Penta M, Lanza M (2017) How developers document pull requests with external references. In: 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC). IEEE, pp 23–33

  • Zhang Y, Yu Y, Wang H, Vasilescu B, Filkov V (2018) Within-ecosystem issue linking: a large-scale study of rails. In: Proceedings of the 7th International Workshop on Software Mining. pp 12–19

  • Zhang Y, Wu Y, Wang T, Wang H (2020) ilinker: a novel approach for issue knowledge acquisition in github projects. World Wide Web 23(3):1589–1619

    Article  Google Scholar 

  • Zhou Y, Sharma A (2017) Automated identification of security issues from commit messages and bug reports. In: Proceedings of the 2017 11th joint meeting on foundations of software engineering. pp 914–919

Download references

Acknowledgements

This work was inspired by the International Workshop series on Dynamic Software Documentation, held at McGill’s Bellairs Research Institute, and was supported by JSPS KAKENHI under Grants JP18H04094, JP20K19774, JP20H05706, and JP22K11970.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tao **ao.

Ethics declarations

Conflicts of interest

The authors declare that Sebastian Baltes, Hideaki Hata, Christoph Treude, and Raula Gaikovina Kula are members of the EMSE Editorial Board. All co-authors have seen and agree with the contents of the manuscript and there is no financial interest to report.

Additional information

Communicated by: Sven Apel.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

**ao, T., Baltes, S., Hata, H. et al. 18 million links in commit messages: purpose, evolution, and decay. Empir Software Eng 28, 91 (2023). https://doi.org/10.1007/s10664-023-10325-8

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10664-023-10325-8

Keywords

Navigation