Log in

Cross-project defect prediction via semantic and syntactic encoding

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Cross-Project Defect Prediction (CPDP) is a promising research field that focuses on detecting defects in projects with limited labeled data by utilizing prediction models trained on projects with abundant data. However, previous CPDP approaches based on Abstract Syntax Tree (AST) have often encountered challenges in effectively acquiring semantic and syntactic information, resulting in their limited ability to combine both productively. This issue arises primarily from the practice of flattening the AST into a linear sequence in many AST-based methods, leading to the loss of hierarchical syntactic structure and structural information within the code. Besides, other AST-based methods use a recursive way to traverse the tree-structured AST, which is susceptible to gradient vanishing. To alleviate these concerns, we introduce a novel CPDP method named defect prediction via Semantic and Syntactic Encoding (SSE) that enhances Zhang’s approach by encoding semantic and syntactic information while retaining and considering AST structure. Specifically, we perform pre-training on a large corpus using a language model to learn semantic information. Next, we present a new rule for splitting AST into subtrees to avoid vanishing gradients. Then, the absolute paths originating from the root node and leading to the leaf nodes are encoded as hierarchical syntactic information. Finally, we design an encoder to integrate syntactic information into semantic information and leverage Bi-directional Long-Short Term Memory to learn the entire tree representation for prediction. Experimental results on 12 benchmark projects illustrate that the SSE method we proposed surpasses current state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability Statements

The PROMISE dataset we use in this work is available at https://doi.org/10.1145/1868328.1868342 and https://github.com/feiwww/PROMISE-backup. Our code and data are available at https://github.com/18ywchen2/SSE and https://zenodo.org/records/10932549.

Notes

  1. https://pypi.org/project/tree-sitter/0.20.1/

  2. https://github.com/microsoft/CodeBERT/tree/master/GraphCodeBERT/clonedetection/parser

  3. https://huggingface.co/microsoft/graphcodebert-base/tree/main

  4. https://github.com/github/CodeSearchNet

References

  • Alon U, Brody S, Levy O, Yahav E (2018a) code2seq: Generating sequences from structured representations of code. ar**v:1808.01400

  • Alon U, Zilberstein M, Levy O, Yahav E (2018) A general path-based representation for predicting program properties. ACM SIGPLAN Notices 53(4):404–419

    Google Scholar 

  • Amasaki S, Takagi Y, Mizuno O, Kikuno T (2003) A bayesian belief network for assessing the likelihood of fault content. In: 14th International symposium on software reliability engineering, 2003. ISSRE 2003., IEEE, pp 215–226

  • Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166

    Google Scholar 

  • Boetticher G (2007) The promise repository of empirical software engineering data. http://promisedata.org/repository

  • Cabral GG, Minku LL, Oliveira AL, Pessoa DA, Tabassum S (2023) An investigation of online and offline learning models for online just-in-time software defect prediction. Empirical Soft Eng 28(5):121

    Google Scholar 

  • Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Google Scholar 

  • Chen J, Hu K, Yu Y, Chen Z, Xuan Q, Liu Y, Filkov V (2020) Software visualization and deep transfer learning for effective software defect prediction. In: Proceedings of the ACM/IEEE 42nd international conference on software engineering, pp 578–589

  • Chen X, Zhao Y, Wang Q, Yuan Z (2018) Multi: Multi-objective effort-aware just-in-time software defect prediction. Inf Soft Technol 93:1–13

    Google Scholar 

  • Chidamber SR, Kemerer CF (1994) A metrics suite for object oriented design. IEEE Trans Soft Eng 20(6):476–493

    Google Scholar 

  • Cho K, Van Merriënboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: Encoder-decoder approaches. ar**v:1409.1259

  • Dam HK, Pham T, Ng SW, Tran T, Grundy J, Ghose A, Kim T, Kim CJ (2018) A deep tree-based model for software defect prediction. ar**v:1802.00921

  • Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. ar**v:1810.04805

  • Ding Z, Li H, Shang W, Chen THP (2022) Can pre-trained code embeddings improve model performance? revisiting the use of code embeddings in software engineering tasks. Empirical Soft Eng 27(3):63

    Google Scholar 

  • Elish KO, Elish MO (2008) Predicting defect-prone software modules using support vector machines. J Syst Soft 81(5):649–660

    Google Scholar 

  • Faiz Rb, Shaheen S, Sharaf M, Rauf HT (2023) Optimal feature selection through search-based optimizer in cross project. Electronics 12(3):514

  • Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D et al (2020) Codebert: A pre-trained model for programming and natural languages. ar**v:2002.08155

  • Fisher RA (1919) Xv.—the correlation between relatives on the supposition of mendelian inheritance. Earth Environ Sci Trans Royal Soc Edinburgh 52(2):399–433

  • Guo D, Ren S, Lu S, Feng Z, Tang D, Liu S, Zhou L, Duan N, Svyatkovskiy A, Fu S et al (2020) Graphcodebert: Pre-training code representations with data flow. ar**v:2009.08366

  • Guo L, Ma Y, Cukic B, Singh H (2004) Robust prediction of fault-proneness by random forests. In: 15th international symposium on software reliability engineering, IEEE, pp 417–428

  • Hadi MA, Fard FH (2023) Evaluating pre-trained models for user feedback analysis in software engineering: A study on classification of app-reviews. Empirical Soft Eng 28(4):88

    Google Scholar 

  • Herbold S (2017) Comments on scottknottesd in response to" an empirical comparison of model validation techniques for defect prediction models". IEEE Trans Soft Eng 43(11):1091–1094

    Google Scholar 

  • Herbold S, Trautsch A, Grabowski J (2018) A comparative study to benchmark cross-project defect prediction approaches. In: Proceedings of the 40th international conference on software engineering, pp 1063–1063

  • Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertainty, Fuzziness and Knowl-Based Syst 6(02):107–116

    Google Scholar 

  • Huang J, Gretton A, Borgwardt K, Schölkopf B, Smola A (2006) Correcting sample selection bias by unlabeled data. Adv Neural Inf Process Syst 19

  • Huang Q, Ma L, Jiang S, Wu G, Song H, Jiang L, Zheng C (2022) A cross-project defect prediction method based on multi-adaptation and nuclear norm. IET Soft 16(2):200–213

    Google Scholar 

  • Jiang S, Xu Y, Song H, Wu Q, Ng MK, Min H, Qiu S (2018) Multi-instance transfer metric learning by weighted distribution and consistent maximum likelihood estimation. Neurocomputing 321:49–60

    Google Scholar 

  • Jiang S, Xu Y, Wang T, Yang H, Qiu S, Yu H, Song H (2019) Multi-label metric transfer learning jointly considering instance space and label space distribution divergence. IEEE Access 7:10362–10373

    Google Scholar 

  • Kim S, Whitehead EJ, Zhang Y (2008) Classifying software changes: Clean or buggy? IEEE Trans Soft Eng 34(2):181–196

    Google Scholar 

  • Kim S, Zhao J, Tian Y, Chandra S (2021) Code prediction by feeding trees to transformers. In: 2021 IEEE/ACM 43rd International conference on software engineering (ICSE), IEEE, pp 150–162

  • Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. ar**v:1412.6980

  • Laradji IH, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Inf Soft Technol 58:388–402

    Google Scholar 

  • Le P, Zuidema W (2016) Quantifying the vanishing gradient and long distance dependency problem in recursive neural networks and recursive lstms. ar**v:1603.00423

  • Li J, He P, Zhu J, Lyu MR (2017) Software defect prediction via convolutional neural network. In: 2017 IEEE international conference on software quality, reliability and security (QRS), IEEE, pp 318–328

  • Lin C, Ouyang Z, Zhuang J, Chen J, Li H, Wu R (2021) Improving code summarization with block-wise abstract syntax tree splitting. In: 2021 IEEE/ACM 29th International conference on program comprehension (ICPC), IEEE, pp 184–195

  • Lin J, Lu L (2021) Semantic feature learning via dual sequences for defect prediction. IEEE Access 9:13112–13124

    Google Scholar 

  • Liu F, Li G, Wei B, **a X, Fu Z (2022) ** Z (2022) A unified multi-task learning model for ast-level and token-level code completion. Empirical Soft Eng 27(4):91

    Google Scholar 

  • Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: A robustly optimized bert pretraining approach. ar**v:1907.11692

  • Long M, Cao Y, Cao Z, Wang J, Jordan MI (2018) Transferable representation learning with deep adaptation networks. IEEE Trans Pattern Anal Mach Intell 41(12):3071–3085

    Google Scholar 

  • López JAH, Weyssow M, Cuadrado JS, Sahraoui H (2022) Ast-probe: Recovering abstract syntax trees from hidden representations of pre-trained language models. ar**v:2206.11719

  • Ma Y, Luo G, Zeng X, Chen A (2012) Transfer learning for cross-company software defect prediction. Inf Soft Technol 54(3):248–256

    Google Scholar 

  • Malhotra R, Meena S (2023) Empirical validation of feature selection techniques for cross-project defect prediction. Int J Syst Assurance Eng Manag 1–13

  • McCabe TJ (1976) A complexity measure. IEEE Trans Soft Eng 2(4):308–320

    MathSciNet  Google Scholar 

  • Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. ar**v:1301.3781

  • Nam J, Pan SJ, Kim S (2013) Transfer defect learning. In: 2013 35th international conference on software engineering (ICSE), IEEE, pp 382–391

  • Pan C, Lu M, Xu B (2021) An empirical study on software defect prediction using codebert model. Appl Sci 11(11):4793

    Google Scholar 

  • Peng H, Li G, Wang W, Zhao Y, ** Z (2021) Integrating tree path in transformer for code representation. Adv Neural Inf Process Syst 34:9343–9354

    Google Scholar 

  • Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543

  • Qiu S, Huang H, Jiang W, Zhang F, Zhou W (2023) Defect prediction via tree-based encoding with hybrid granularity for software sustainability. IEEE Trans Sustainable Comput

  • Qiu S, Xu H, Deng J, Jiang S, Lu L (2019) Transfer convolutional neural network for cross-project defect prediction. Appl Sci 9(13):2660

    Google Scholar 

  • Reena P, Binu R (2014) Software defect prediction system–decision tree algorithm with two level data pre-processing. Int J Eng Res & Technol (IJERT) 3(3)

  • Ryu D, Choi O, Baik J (2016) Value-cognitive boosting with a support vector machine for cross-project defect prediction. Empirical Softw Eng 21:43–71

    Google Scholar 

  • Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681

    Google Scholar 

  • Shepperd M, Bowes D, Hall T (2014) Researcher bias: The use of machine learning in software defect prediction. IEEE Trans Softw Eng 40(6):603–616

    Google Scholar 

  • Song Q, Guo Y, Shepperd M (2018) A comprehensive investigation of the role of imbalanced learning for software defect prediction. IEEE Trans Soft Eng 45(12):1253–1269

    Google Scholar 

  • Tabassum S, Minku LL, Feng D (2022) Cross-project online just-in-time software defect prediction. IEEE Trans Soft Eng 49(1):268–287

    Google Scholar 

  • Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2016) An empirical comparison of model validation techniques for defect prediction models. IEEE Trans Soft Eng 43(1):1–18

    Google Scholar 

  • Turhan B, Menzies T, Bener AB, Di Stefano J (2009) On the relative value of cross-company and within-company data for defect prediction. Empirical Soft Eng 14:540–578

    Google Scholar 

  • Uddin MN, Li B, Ali Z, Kefalas P, Khan I, Zada I (2022) Software defect prediction employing bilstm and bert-based semantic feature. Soft Comput 26(16):7877–7891

    Google Scholar 

  • Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30

  • Wang S, Liu T, Nam J, Tan L (2018) Deep semantic feature learning for software defect prediction. IEEE Trans Soft Eng 46(12):1267–1293

    Google Scholar 

  • Wang T, Li Wh (2010) Naive bayes software defect prediction model. In: 2010 International conference on computational intelligence and software engineering, Ieee, pp 1–4

  • Wang W, Zhang K, Li G, Liu S, ** Z, Liu Y (2022a) A tree-structured transformer for program representation learning. ar**v:2208.08643

  • Wang X, Wu Q, Zhang H, Lyu C, Jiang X, Zheng Z, Lyu L, Hu S (2022b) Heloc: Hierarchical contrastive learning of source code representation. In: Proceedings of the 30th IEEE/ACM international conference on program comprehension, pp 354–365

  • Wang Y, Li H (2021) Code completion by modeling flattened abstract syntax trees as graphs. Proceedings of the AAAI conference on artificial intelligence 35:14015–14023

    Google Scholar 

  • Wei SE, Ramakrishna V, Kanade T, Sheikh Y (2016) Convolutional pose machines. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4724–4732

  • Wilcoxon F (1992) Individual comparisons by ranking methods. In: Breakthroughs in Statistics: Methodology and Distribution, Springer, pp 196–202

  • Wong WE, Li X, Laplante PA (2017) Be more familiar with our enemies and pave the way forward: A review of the roles bugs played in software failures. J Syst Soft 133:68–94

    Google Scholar 

  • Wu B, Liang B, Zhang X (2022) Turn tree into graph: Automatic code review via simplified ast driven graph convolutional network. Knowl-Based Syst 252:109450

    Google Scholar 

  • **a X, Lo D, Pan SJ, Nagappan N, Wang X (2016) Hydra: Massively compositional model for cross-project defect prediction. IEEE Trans Soft Eng 42(10):977–998

    Google Scholar 

  • Xu J, Ai J, Liu J, Shi T (2022) Acgdp: An augmented code graph-based system for software defect prediction. IEEE Trans Reliability 71(2):850–864

    Google Scholar 

  • Xu J, Wang F, Ai J (2020) Defect prediction with semantics and context features of codes based on graph representation learning. IEEE Trans Reliability 70(2):613–625

    Google Scholar 

  • Xu Z, Pang S, Zhang T, Luo XP, Liu J, Tang YT, Yu X, Xue L (2019) Cross project defect prediction via balanced distribution adaptation based transfer learning. J Comput Sci Technol 34:1039–1062

    Google Scholar 

  • Yan J, Qi Y, Rao Q (2018) Lstm-based hierarchical denoising network for android malware detection. Sec Commun Net 2018:1–18

    Google Scholar 

  • Yang J, **ao G, Shen Y, Jiang W, Hu X, Zhang Y, Peng J (2021) A survey of knowledge enhanced pre-trained models. ar**v:2110.00269

  • Zhang J, Wang X, Zhang H, Sun H, Wang K, Liu X (2019) A novel neural source code representation based on abstract syntax tree. In: 2019 IEEE/ACM 41st International conference on software engineering (ICSE), IEEE, pp 783–794

  • Zhang T, Wu F, Katiyar A, Weinberger KQ, Artzi Y (2020) Revisiting few-sample bert fine-tuning. ar**v:2006.05987

  • Zhao K, Xu Z, Yan M, Xue L, Li W, Catolino G (2022) A compositional model for effort-aware just-in-time defect prediction on android apps. IET Software 16(3):259–278

    Google Scholar 

  • Zhu K, Zhang N, Ying S, Zhu D (2020) Within-project and cross-project just-in-time defect prediction based on denoising autoencoder and convolutional neural network. IET Software 14(3):185–195

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Siyu Jiang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Communicated by: Andrea De Lucia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiang, S., Chen, Y., He, Z. et al. Cross-project defect prediction via semantic and syntactic encoding. Empir Software Eng 29, 80 (2024). https://doi.org/10.1007/s10664-024-10495-z

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10664-024-10495-z

Keywords

Navigation