Log in

Stacked Generalization Architecture for Predicting Publisher Behaviour from Highly Imbalanced User-Click Data Set for Click Fraud Detection

  • Published:
New Generation Computing Aims and scope Submit manuscript

Abstract

In online advertising, a change in the publisher’s actual status label with every generated click shows the suspicious behaviour of the publisher. Furthermore, only a small proportion of the clicks generated by the publishers are invalid, resulting in class skewness in the dataset and a challenging issue for the conventional classification methods as they get biased towards the outnumbered class. This suspicious behaviour of publishers with an uneven class distribution ratio adversely affects the classifier’s performance and increases model complexities. Thus, develo** machine-learning methods capable of producing efficacious predictive models towards detecting fraudulent publishers is pivotal. This paper’s novel stacked generalization framework comprises two stacked generalization architectures, one for resampling and the second for classification. The framework employs a stacked generalization approach using generalizers to improve the learning model’s performance in two steps: first, reducing the error rate of algorithms towards reducing the bias in a learning set. Second, the results obtained through level-0 generalizers are fed as input to the level-1 generalizer with stacked integrated output towards combining the predictions for improving the predictive performance. Broad experimentations are conducted on FDMA 2012 user click dataset using ten-fold cross-validation. The performance of the proposed architecture is generalized by performing experiments on eight other highly imbalanced benchmark datasets, and performance is measured using average precision, recall, and F1-score. Results empirically prove the superiority of the proposed architecture in the publisher's behaviour prediction and classification as legitimate or illegitimate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Brazil)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Mathew, J., Luo, M., Pang, C.K., Chan, H.L.: Kernel-based SMOTE for SVM classification of imbalanced datasets. IEEE (2015)

    Book  Google Scholar 

  2. Mathew, J., Pang, C.K., Luo, M., Leong, W.H.: Classification of imbalanced data by oversampling in kernel space of support vector machines. IEEE Trans. Neural Net Learn. Syst. 29(9), 4065–4076 (2018). https://doi.org/10.1109/TNNLS.2017.2751612

    Article  Google Scholar 

  3. Choi, D., Lee, K.: An artificial intelligence approach to financial fraud detection under IoT environment: a survey and implementation. Secur. Commun. Net 2018, 1–15 (2018). https://doi.org/10.1155/2018/5483472

    Article  Google Scholar 

  4. Haider, C.M.R., Iqbal, A., Rahman, A.H., Rahman, M.S.: An ensemble learning based approach for impression fraud detection in mobile advertising. J. Netw. Comput. Appl. 112, 126–141 (2018). https://doi.org/10.1016/j.jnca.2018.02.021

    Article  Google Scholar 

  5. Springborn, K., Barford, P.: Impression fraud in on-line advertising via pay-per-view networks. Sec Symp (2013). https://doi.org/10.4995/Thesis/10251/8685

    Article  Google Scholar 

  6. Li, Z., Zhang, K., **e, Y., Yu, F., and Wang X.: Knowing Your Enemy: Understanding and Detecting Malicious Web Advertising in Proceedings of the 2012 ACM conference on Computer and communications security 674–686 (2012), doi: https://doi.org/10.1145/2382196.2382267.

  7. Perera, K.S., Neupane, B., Faisal, M.A., Aung, Z., Woon, W.L.: A novel ensemble learning-based approach for click fraud detection in mobile advertising. In: Neupane, M. (ed.) mining intelligence and knowledge exploration (MIKE). Springer International Publishing, Cham (2013)

    Google Scholar 

  8. Xu, H., Liu, D., Koehl, A., Wang, H., Stavrou, A.: Click fraud detection on the advertiser side, in 19th European symposium on research in computer security. Wroclaw, Poland (2014)

    Google Scholar 

  9. Haddadi, H.: Fighting online click-fraud using bluff ads. ACM SIGCOMM Comput. Commun. Rev. 40(2), 21–25 (2010)

    Article  Google Scholar 

  10. Nagaraja, S., and Shah, R.: Clicktok: Click Fraud Detection using Traffic Analysis in Proceedings of the 12th Conference on Security and Privacy in Wireless and Mobile Networks, Miami Florida. 20: 105–116 (2019)

  11. Sisodia, D.S., Verma, U.: Distinct multiple learner-based ensemble smotebagging (ML-ESB) method for classification of binary class imbalance problems. Int. J. Technol. 10(4), 721–730 (2019)

    Article  Google Scholar 

  12. Sisodia, D., Sisodia, D.S.: Gradient boosting learning for fraudulent publisher detection in online advertising. Data Technol. Appl. 55(2), 216–232 (2020). https://doi.org/10.1108/DTA-04-2020-0093

    Article  Google Scholar 

  13. Haixiang, G., Yi**g, L., Shang, J., Mingyun, G., Yuanyue, H., Bing, G.: Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl. 73, 220–239 (2017). https://doi.org/10.1016/j.eswa.2016.12.035

    Article  Google Scholar 

  14. Georgios Sigletos, M.H., Paliouras, G., Spyropoulos, C.D.: Combining information extraction systems using voting and stacked generalization. J. Mach. Learn. Res. 6, 1751–1782 (2005)

    MathSciNet  MATH  Google Scholar 

  15. Sisodia, D., Sisodia, D.S.: Data sampling strategies for click fraud detection using imbalanced user click data of online advertising : an empirical review. IETE Tech. Rev. 39(4), 1–10 (2021). https://doi.org/10.1080/02564602.2021.1915892

    Article  MathSciNet  Google Scholar 

  16. Taneja, M., Garg, K., Purwar, A., Sharma, S.: Prediction of click frauds in mobile advertising, in International Conference on Contemporary Computing, IC3 Noida India 162–166 (2015), doi: https://doi.org/10.1109/IC3.2015.7346672.

  17. Berrar, D.: Random forests for the detection of click fraud in online mobile advertising, In: Proceedings of 2012 international workshop on fraud detection in mobile advertising (FDMA), Singapore. 1–10, [Online] (2012). Available: http://berrar.com/resources/Berrar_FDMA2012.pdf.

  18. Sisodia, D., Sisodia, D.S.: Quad division prototype selection-based k-nearest neighbor classifier for click fraud detection from highly skewed user click dataset. Eng. Sci. Technol. an Int. J. 28, 1–12 (2022). https://doi.org/10.1016/J.JESTCH.2021.05.015

    Article  Google Scholar 

  19. Sisodia, D., Sisodia, D.S.: A hybrid data-level sampling approach in learning from skewed user-click data for click fraud detection in online advertising. Expert Syst. 40, 1–17 (2022). https://doi.org/10.1111/exsy.13147

    Article  Google Scholar 

  20. Sisodia, D., Sisodia, D.S.: Feature distillation and accumulated selection for automated fraudulent publisher classification from user click data of online advertising. Data Technol. Appl. 56(4), 1–24 (2022). https://doi.org/10.1108/dta-09-2021-0233

    Article  MathSciNet  Google Scholar 

  21. Sisodia, D., Sisodia, D. S.: Data Sampling Methods for Analyzing Publishers Conduct from Highly Imbalanced Dataset in Web Advertising, in International Conference on Information Systems and Management Science. 22: 428–441 (2023)

  22. Sisodia, D., Sisodia, D.S.: Feature space transformation of user-clicks and deep transfer learning framework for fraudulent publisher detection in online advertising. Appl. Soft Comput. 125, 109142 (2022). https://doi.org/10.1016/j.asoc.2022.109142

    Article  Google Scholar 

  23. Batool, A., Byun, Y.C.: an ensemble architecture based on deep learning model for click fraud detection in Pay-Per-click advertisement campaign. IEEE Access 10, 113410–113426 (2022). https://doi.org/10.1109/ACCESS.2022.3211528

    Article  Google Scholar 

  24. Lyu, Q., Li, H., Zhou, R., Zhang, J., Zhao, N., Liu, Y.: A Blockchain-based click fraud detection and prevention scheme for online advertising. Secur. Commun. Net 2022, 1–20 (2022)

    Article  Google Scholar 

  25. Oentaryo, R., et al.: Detecting click fraud in online advertising a data mining approach. J. Mach. Learn. Res. 15(1), 99–140 (2014). https://doi.org/10.1145/2623330.2623718

    Article  MathSciNet  Google Scholar 

  26. Wolpert, D.H.: Original contribution: stacked generalization. Neural Netw. 5(2), 241–259 (1992)

    Article  Google Scholar 

  27. Ting, K.M., Witten, I.H.: Stacked generalization: When does it work? IJCAI Int Joint Conf Artifi Intel 2, 866–871 (1997)

    Google Scholar 

  28. Badan Standarisasi Nasional: Issues in stacked generalization. J. Arti?cial Intell Res. 10, 271–289 (1999)

    Google Scholar 

  29. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002). https://doi.org/10.1613/jair.953

    Article  MATH  Google Scholar 

  30. Hui Han, B.-H. M., Wen-Yuan Wang.: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In IEEE International Joint Conference on Neural Networks. 17: 144 (2007)

  31. Ae, H.: An introduction to logistic regression: from basic concepts to interpretation with particular attention to nursing domain. J. Korean Acad. Nurs. 43(2), 154–164 (2013)

    Article  Google Scholar 

  32. King, G., Zeng, L.: Logistic regression in rare events data. Polit. Anal. 9(02), 137–163 (2012). https://doi.org/10.1093/oxfordjournals.pan.a004868

    Article  Google Scholar 

  33. Sperandei, S.: Understanding logistic regression analysis. Biochem. Medica 24(1), 12–18 (2014). https://doi.org/10.11613/BM.2014.003

    Article  Google Scholar 

  34. Breiman, L.: Statistical modeling: The two cultures (with comments and a rejoinder by the author). Stat. Sci. 16, 199–231 (2001)

    Article  MATH  Google Scholar 

  35. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324

    Article  MATH  Google Scholar 

  36. Todeschini, R.: k-nearest neighbour method: The influence of data transformations and metrics. Chemom. Intell. Lab. Syst. 6(3), 213–220 (1989)

    Article  Google Scholar 

  37. Zhang, S.: KNN-CF Approach: Incorporating Certainty Factor to kNN Classification. IEEE Intell. Informatics Bull. 11: 24–33, 2010, [Online] (2010). Available: http://www.comp.hkbu.edu.hk/~iib/2010/Dec/article4/iib_vol11no1_article4.pdf.

  38. Utgoff, P.E.: Incremental induction of decision trees. Mach. Learn. 4(2), 161–186 (1989). https://doi.org/10.1023/A:1022699900025

    Article  Google Scholar 

  39. Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986). https://doi.org/10.1023/A:1022643204877

    Article  Google Scholar 

  40. Friedman, N., Geiger, D., Goldszmit, M.: Bayesian network classifiers. Mach. Learn. 29, 131–163 (1997). https://doi.org/10.1023/a:1007465528199

    Article  MATH  Google Scholar 

  41. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). https://doi.org/10.1007/BF00994018

    Article  MATH  Google Scholar 

  42. Vapnik, V.N.: “Statistical Learning theory. adapt. learn. syst. signal process. commun Control 2, 1–740 (1998). https://doi.org/10.2307/1271368

    Article  MATH  Google Scholar 

  43. Vapnik, V. N.: The nature of statistical learning theory In Springer science & business media. 226 (2013)

  44. D. Sisodia, S. K. Shrivastava, and R. C. Jain, 2010 ISVM for face recognition, In International Conference on Computational Intelligence and Communication Networks, (CICN ). doi: https://doi.org/10.1109/CICN.2010.109.

  45. Friedman, J.H.: Regularized discriminant analysis. J. Am. Stat. Assoc. 84(405), 165–175 (1989)

    Article  MathSciNet  Google Scholar 

  46. Ramayah, T., Ahmad, N.H., Halim, H.A., Rohaida, S., Zainal, M., Lo, M.: Discriminant analysis : An illustrated example. African J. Bus. Manag. 4(9), 1654–1667 (2010)

    Google Scholar 

  47. Li, Y., Zhang, X. J.: “Improving k Nearest Neighbor with Exemplar Generalization for Imbalanced Classification,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2011, no. August, pp. 321–332, doi: https://doi.org/10.1007/978-3-642-20847-8.

  48. Liu, S., Wei C.: “Class confidence weighted knn algorithms for imbalanced data sets,” in Pacific-Asia conference on knowledge discovery and data mining, pp. 345–356 (2011)

  49. Friedman, J.: Greedy Function Approximation : A Gradient Boosting Machine. Ann. Stat. 29(5), 1189–1232 (2001). https://doi.org/10.1214/009053606000000795

    Article  MathSciNet  MATH  Google Scholar 

  50. Elrahman, S.M.A., Abraham, A.: A Review of Class Imbalance Problem. Netw. Innov. Comput. 1, 332–340 (2013)

    Google Scholar 

  51. Tharwat, A.: “Classification assessment methods. Comput. Informatics, Appl (2018). https://doi.org/10.1016/j.aci.2018.08.003

    Book  Google Scholar 

  52. Powers, D.M.W.: Evaluation: from precision, recall and f-measure to roc, informedness, markedness & correlation. J. Mach. Learn. Technol. 2(1), 37–63 (2011)

    MathSciNet  Google Scholar 

  53. Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 45(4), 427–437 (2009). https://doi.org/10.1016/j.ipm.2009.03.002

    Article  Google Scholar 

  54. Indola, R.P., Ebecken, N.F.F.: On extending F-measure and G-mean metrics to multi-class problems. Sixth Int Conf Data Mining, Text Mining Business Appl UK 35, 25–34 (2005)

    Google Scholar 

  55. “NumPy Reference—NumPy v1.19 Manual.” https://numpy.org/doc/stable/reference/ (Accessed Aug. 22, 2020)

  56. “Documentation—SciPy.org.” https://www.scipy.org/docs.html (Accessed Aug. 22, 2020)

  57. “scikit-learn: machine learning in Python—scikit-learn 0.23.2 documentation.” https://scikit-learn.org/stable/ (Accessed Aug. 22, 2020)

  58. “scikit-learn.” https://pypi.org/project/imblearn/ (accessed Aug. 22, 2020)

  59. “pandas documentation — pandas 1.1.1 documentation.” https://pandas.pydata.org/docs/ (accessed Aug. 22, 2020)

  60. “matplotlib · PyPI.” https://pypi.org/project/matplotlib/ (accessed May 18, 2021)

  61. “csv—CSV File Reading and Writing — Python 3.9.5 documentation.” https://docs.python.org/3/library/csv.html (accessed May 18, 2021)

  62. Richard Oentaryo, W.L.W., Lim, Ee-Peng., Finegold, Michael, Lo, David, Zhu, Feida, Phua, Clifton, Cheu, Eng-Yeow., Yap, Ghim-Eng., Sim, Kelvin, Nguyen, Minh Nhut, Perera, Kasun, Neupane, Bijay, Faisal, Mustafa, Aung, Zeyar: Detecting click fraud in online advertising : a data mining approach. J. Mach. Learn. Res. 15(1), 99–140 (2014). https://doi.org/10.1145/2623330.2623718

    Article  MathSciNet  Google Scholar 

  63. Alcalá-Fdez, J., et al.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Mult. Log. Soft Comput. 17(2–3), 255–287 (2011)

    Google Scholar 

  64. Fern, A., Garc, S., Bernad, E., Herrera, F.: Genetics-based machine learning for rule induction : taxonomy, experimental study and state of the art. IEEE Trans. Evol. Comput. 4(6), 913–941 (2010)

    Google Scholar 

  65. Fernández, A., José, M., Herrera, F.: On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced datasets. Inf. Sci. (Ny) 180(8), 1268–1291 (2010). https://doi.org/10.1016/j.ins.2009.12.014

    Article  Google Scholar 

  66. Kubat, M., Holte, R.C., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30(2–3), 195–215 (1998). https://doi.org/10.1023/A:1007452223027

    Article  Google Scholar 

  67. Fernández, A., José, M., Herrera, F.: Hierarchical fuzzy rule based classification systems with genetic rule selection for imbalanced datasets. Int. J. Approx. Reason. 50(3), 561–577 (2009). https://doi.org/10.1016/j.ijar.2008.11.004

    Article  MATH  Google Scholar 

  68. Berrar, D.: Learning from automatically labeled data: case study on click fraud prediction. Knowl. Inf. Syst. 46(2), 477–490 (2016). https://doi.org/10.1007/s10115-015-0827-6

    Article  Google Scholar 

  69. Vasumati, D., Vani, M. S., Bhramaramba, R., Babu, O. Y.: Data Mining Approach to Filter Click-spam in Mobile Ad Networks, In Int’l Conference on Computer Science, Data Mining & Mechanical Engg 90–94 (2015)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Deepti Sisodia.

Ethics declarations

Conflict of interest

All authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 51 KB)

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sisodia, D., Sisodia, D.S. Stacked Generalization Architecture for Predicting Publisher Behaviour from Highly Imbalanced User-Click Data Set for Click Fraud Detection. New Gener. Comput. 41, 581–606 (2023). https://doi.org/10.1007/s00354-023-00218-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00354-023-00218-1

Keywords

Navigation