Log in

An effective approach to improve the performance of eCPDP (early cross-project defect prediction) via data-transformation and parameter optimization

  • Published:
Software Quality Journal Aims and scope Submit manuscript

Abstract

Cross-project defect prediction (CPDP) utilizes other finished projects (i.e., source project) data to predict defects of the current working project. Transfer learning (TL) has been mainly applied at CPDP to improve prediction performance by alleviating the data distribution discrepancy between different projects. However, existing TL-based CPDP techniques are not applicable at the unit testing phase since they require the entire historical target project data. As a result, they lose the chance to increase the product’s reliability in the early phase by applying the prediction results. The objective of the present study is to increase the product’s reliability in the early phase by proposing a novel TL-based CPDP technique applicable at the unit testing phase (i.e., eCPDP). We utilize singular value decomposition (SVD), which only requires source project data for TL. eCPDP performs similarly or better than the 8 state-of-the-art TL-based CPDP techniques on 9 different performance metrics over 24 projects. In conclusion, (1) we show that eCPDP is an applicable CPDP model at the unit testing phase. (2) It can help practitioners find and fix defects in an earlier phase than other TL-based CPDP techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. https://github.com/cadet6465/eCPDP

  2. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html

  3. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

  4. https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html

  5. https://github.com/sherbold/autorank/issues

  6. https://docs.github.com/en/rest

References

  • Arcuri, A., & Briand, L. (2011). A practical guide for using statistical tests to assess randomized algorithms in software engineering. In: 2011 33rd International Conference on Software Engineering (ICSE), IEEE, pp 1–10.

  • Ba, Q., Li, X., & Bai, Z. (2013). Clustering collaborative filtering recommendation system based on svd algorithm. In: 2013 IEEE 4th International Conference on Software Engineering and Service Science, IEEE, pp 963–967.

  • Bennin, K. E., Toda, K., Kamei, Y., Keung, J., Monden, A., & Ubayashi, N. (2016). Empirical evalua- tion of cross-release effort-aware defect prediction models. In: 2016 IEEE International Conference on Software Quality, Reliability and Security (QRS), IEEE, pp 214–221.

  • Brunton, S. L., & Kutz, J. N. (2019). Data-driven science and engineering: Machine learning, dy- namical systems, and control. Cambridge University Press.

    Book  MATH  Google Scholar 

  • Chen, L., Fang, B., Shang, Z., & Tang, Y. (2015). Negative samples reduction in cross-company software defects prediction. Information and Software Technology, 62, 67–77.

    Article  Google Scholar 

  • Cohen, J. (1988). Statistical power analysis for the behavioral sciences–second edition. 12 lawrence erlbaum associates inc. Hillsdale, New Jersey 13.

  • Cruz, A. E. C., & Ochimizu, K. (2009). Towards logistic regression models for predicting fault-prone code across software projects. In: 2009 3rd international symposium on empirical software engineering and measurement, IEEE, pp 460–463.

  • D’Ambros, M., Lanza, M., & Robbes, R. (2010). An extensive comparison of bug prediction ap- proaches. In: 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010), IEEE, pp 31–41.

  • Demsar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7, 1–30.

    MathSciNet  MATH  Google Scholar 

  • Friedman, M. (1940). A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics, 11(1), 86–92.

    Article  MathSciNet  MATH  Google Scholar 

  • Gong, L., Jiang, S., Bo, L., Jiang, L., & Qian, J. (2020). A novel class-imbalance learning approach for both within-project and cross-project defect prediction. IEEE Transactions on Reli- Ability, 69(1), 40–54.

    Article  Google Scholar 

  • Gretton, A., Borgwardt, K. M., Rasch, M. J., & Sch¨olkopf B, Smola A,. (2012). A kernel two-sample test. The Journal of Machine Learning Research, 13(1), 723–773.

    MathSciNet  MATH  Google Scholar 

  • He, Z., Shu, F., Yang, Y., Li, M., & Wang, Q. (2012). An investigation on the feasibility of cross project defect prediction. Automated Software Engineering, 1.

  • Han, J., Kamber, M., & Pei, J. (2012). Data mining: Concepts and techniques (3rd ed.). Morgan Kaufmann.

    MATH  Google Scholar 

  • Herbold, S. (2013). Training data selection for cross-project defect prediction. Proceedings of the 9th International Conference on Predictive Models in Software Engineering, pp 1–10.

  • Herbold, S., Trautsch, A., & Grabowski, J. (2018). A comparative study to benchmark cross-project defect prediction approaches. In: Proceedings of the 40th International Conference on Software Engineering, pp 1063–1063.

  • Hosseini, S., Turhan, B., & Gunarathna, D. (2017). A systematic literature review and meta-analysis on cross project defect prediction. IEEE Transactions on Software Engineering, 45(2), 111–147.

    Article  Google Scholar 

  • Hosseini, S., Turhan, B., & M¨antyla¨ M,. (2018). A benchmark study on the effectiveness of search-based data selection and feature selection for cross project defect prediction. Informationand Software Technology, 95, 296–312.

    Article  Google Scholar 

  • Jureczko, M., & Madeyski, L. (2010). Towards identifying software project clusters with regard to defect prediction. In: Proceedings of the 6th International Conference on Predictive Models in Software Engineering, pp 1–10.

  • Kang, J., Kwon, S., Ryu, D., & Baik, J. (2021). Haspo: Harmony search-based parameter optimiza- tion for just-in-time software defect prediction in maritime software. Applied Sciences, 11(5), 2002.

  • Kawata, K., Amasaki, S., & Yokogawa, T. (2015). Improving relevancy filter methods for cross- project defect prediction. In: 2015 3rd International Conference on Applied Computing and Information Technology/2nd International Conference on Computational Science and Intelligence, IEEE, pp 2–7.

  • Kwon, S., Ryu, D., & Baik, J. (2021). eCPDP: Early cross-project defect prediction. In: 2021 21th IEEE international Conference on Software Quality, Reliability, and Security (QRS), IEEE, pp 470–481.

  • Li, K., **ang, Z., Chen, T., & Tan, K. C. (2020a). Bilo-cpdp: bi-level programming for automated model discovery in cross-project defect prediction. In: 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE), IEEE, pp 573–584.

  • Li. K., **ang, Z., Chen, T., Wang, S., & Tan, K. C. (2020b). Understanding the automated param- eter optimization on transfer learning for cross-project defect prediction: an empirical study. In: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pp 566–577.

  • Li. Z., **g. X. Y., Zhu, X., & Zhang, H. (2017). Heterogeneous defect prediction through multi- ple kernel learning and ensemble learning. In: 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME), IEEE, pp 91–102.

  • Li, Z., Niu, J., **g, X. Y., Yu, W., & Qi, C. (2021). Cross-project defect prediction via landmark selection-based kernelized discriminant subspace alignment. IEEE Transactions on Reliability, 70(3), 996–1013.

    Article  Google Scholar 

  • Limsettho, N., Bennin, K. E., Keung, J. W., Hata, H., & Matsumoto, K. (2018). Cross project defect pre- diction using class distribution estimation and oversampling. Information and Software Technology, 100, 87–102.

    Article  Google Scholar 

  • Liu, C., Yang, D., **a, X., Yan, M., & Zhang, X. (2019). A two-phase transfer learning model for cross-project defect prediction. Information and Software Technology, 107, 125–136.

    Article  Google Scholar 

  • Mende, T., & Koschke, R. (2010). Effort-aware defect prediction models. In: 2010 14th European Conference on Software Maintenance and Reengineering, IEEE, pp 107–116.

  • Menzies, T., Dekhtyar, A., Distefano, J., & Greenwald, J. (2007). Problems with precision: response to comments on data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering 33(9), 637–640. https://doi.org/10.1109/TSE.2007

  • Misra, S., Adewumi, A., & Maskeliunas, R., Damaˇseviˇcius, R., Cafer, F. (2017). Unit testing in global software development environment. International Conference on Recent De- velopments in Science (pp. 309–317). Springer.

    Google Scholar 

  • Nam. J., Pan, S. J., & Kim, S. (2013). Transfer defect learning. In: 2013 35th international conference on software engineering (ICSE), IEEE, pp 382–391.

  • Nemenyi, P. B. (1963). Distribution-free multiple comparisons. Princeton University.

  • Ni, C., Liu, W., Gu, Q., Chen, X., & Chen, D. (2017). Fesch: a feature selection method using clusters of hybrid-data for cross-project defect prediction. In: 2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC), IEEE, vol 1, pp 51–56.

  • Panichella, A., Alexandru, C. V., Panichella, S., Bacchelli, A., & Gall, H. C. (2016). A search-based training algorithm for cost-aware defect prediction. Proceedings of the Genetic and Evolutionary Computation Conference, 2016, 1077–1084.

    Google Scholar 

  • Pascarella, L., Palomba, F., & Bacchelli, A. (2019). Fine-grained just-in-time defect prediction. Journal of Systems and Software, 150, 22–36.

    Article  Google Scholar 

  • Planning, S. (2002). The economic impacts of inadequate infrastructure for software testing. National Institute of Standards and Technology.

  • Reddy, M. S., & Adilakshmi, T. (2014). Music recommendation system based on matrix factor- ization technique-svd. In: 2014 International Conference on Computer Communication and Informatics, IEEE, pp 1–6.

  • Shin, Y., Meneely, A., Williams, L., & Osborne, J. A. (2010). Evaluating complexity, code churn, and developer activity metrics as indicators of software vulnerabilities. IEEE Transactions on Software Engineering, 37(6), 772–787.

    Article  Google Scholar 

  • Sun, Z., Li, J., Sun, H., & He, L. (2021). Cfps: Collaborative filtering based source projects selection for cross-project defect prediction. Applied Soft Computing, 99, 106940.

    Article  Google Scholar 

  • Tantithamthavorn, C., McIntosh, S., Hassan, A. E., & Matsumoto, K. (2018). The impact of au- tomated parameter optimization on defect prediction models. IEEE Transactions on Software Engineering, 45(7), 683–711.

    Article  Google Scholar 

  • Turhan, B., Menzies, T., Bener, A. B., & Di Stefano, J. (2009). On the relative value of cross- company and within-company data for defect prediction. Empirical Software Engineer- Ing, 14(5), 540–578.

    Article  Google Scholar 

  • Wilcoxon, F. (1946). Individual comparisons of grouped data by ranking methods. Journal of Economic Entomology, 39(2), 269–270.

    Article  Google Scholar 

  • **a, X., Lo, D., Pan, S. J., Nagappan, N., & Wang, X. (2016). Hydra: Massively compositional model for cross-project defect prediction. IEEE Transactions on Software Engineering, 42(10), 977–998.

    Article  Google Scholar 

  • Xu, Z., Pang, S., Zhang, T., Luo, X. P., Liu, J., Tang, Y. T., Yu, X., & Xue, L. (2019). Cross project defect prediction via balanced distribution adaptation based transfer learning. Journal of Computer Science and Technology, 34(5), 1039–1062.

    Article  Google Scholar 

  • Yatish, S., Jiarpakdee, J., Thongtanunam, P., & Tantithamthavorn, C. (2019). Mining software defects: should we consider affected releases? In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), IEEE, pp 654–665.

  • Yuan, X., Han, L., Qian, S., Xu, G., & Yan, H. (2019). Singular value decomposition based recom- mendation using imputed data. Knowledge-Based Systems, 163, 485–494.

    Article  Google Scholar 

  • Zhang, F., Keivanloo, I., & Zou, Y. (2017). Data transformation in cross-project defect prediction. Empirical Software Engineering, 22(6), 3186–3218.

    Article  Google Scholar 

  • Zhang, H., & Cheung, S. C. (2013). A cost-effectiveness criterion for applying software defect pre- diction models. In: Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, pp 643–646.

  • Zhou, Y., Yang, Y., Lu, H., Chen, L., Li, Y., Zhao, Y., Qian, J., & Xu, B. (2018). How far we have progressed in the journey? An examination of cross-project defect prediction. ACM Transactions on Software Engineering and Methodology (TOSEM), 27(1), 1–51.

    Article  Google Scholar 

  • Zimmermann, T., Nagappan, N., Gall, H., Giger, E., & Murphy, B. (2009). Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the 7th joint meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, pp 91–10.

Download references

Acknowledgements

The authors thank the Editor-in-Chief and the anonymous reviewers for their thoughtful comments and suggestions.

Funding

This research was supported by the National Research Foundation of Korea (NRF-2020R1F1A1071888), the Ministry of Science and ICT (MSIT), Korea, under the Information Technology Research Center (ITRC) support program supervised by the Institute of Information & Communications Technology Planning & Evaluation (IITP-2021-2020-0-01795), and the National Research Foundation of Korea (NRF) funded by the Korean Government through the Ministry of Education under Grant (NRF-2022R1I1A3069233).

Author information

Authors and Affiliations

Authors

Contributions

All the authors contributed equally to this work.

Corresponding author

Correspondence to Jongmoon Baik.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kwon, S., Ryu, D. & Baik, J. An effective approach to improve the performance of eCPDP (early cross-project defect prediction) via data-transformation and parameter optimization. Software Qual J 31, 1009–1044 (2023). https://doi.org/10.1007/s11219-023-09624-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11219-023-09624-6

Keywords

Navigation