Stacked Generalization Architecture for Predicting Publisher Behaviour from Highly Imbalanced User-Click Data Set for Click Fraud Detection

Sisodia, Deepti; Sisodia, Dilip Singh

doi:10.1007/s00354-023-00218-1

Stacked Generalization Architecture for Predicting Publisher Behaviour from Highly Imbalanced User-Click Data Set for Click Fraud Detection

Published: 29 May 2023

Volume 41, pages 581–606, (2023)
Cite this article

New Generation Computing Aims and scope Submit manuscript

143 Accesses
2 Citations
Explore all metrics

Abstract

In online advertising, a change in the publisher’s actual status label with every generated click shows the suspicious behaviour of the publisher. Furthermore, only a small proportion of the clicks generated by the publishers are invalid, resulting in class skewness in the dataset and a challenging issue for the conventional classification methods as they get biased towards the outnumbered class. This suspicious behaviour of publishers with an uneven class distribution ratio adversely affects the classifier’s performance and increases model complexities. Thus, develo** machine-learning methods capable of producing efficacious predictive models towards detecting fraudulent publishers is pivotal. This paper’s novel stacked generalization framework comprises two stacked generalization architectures, one for resampling and the second for classification. The framework employs a stacked generalization approach using generalizers to improve the learning model’s performance in two steps: first, reducing the error rate of algorithms towards reducing the bias in a learning set. Second, the results obtained through level-0 generalizers are fed as input to the level-1 generalizer with stacked integrated output towards combining the predictions for improving the predictive performance. Broad experimentations are conducted on FDMA 2012 user click dataset using ten-fold cross-validation. The performance of the proposed architecture is generalized by performing experiments on eight other highly imbalanced benchmark datasets, and performance is measured using average precision, recall, and F1-score. Results empirically prove the superiority of the proposed architecture in the publisher's behaviour prediction and classification as legitimate or illegitimate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Brazil)

Instant access to the full article PDF.

Institutional subscriptions

A Novel Ensemble Learning-Based Approach for Click Fraud Detection in Mobile Advertising

Forecasting Click Fraud via Machine Learning Algorithms

Learning from automatically labeled data: case study on click fraud prediction

Article 26 February 2015

References

Mathew, J., Luo, M., Pang, C.K., Chan, H.L.: Kernel-based SMOTE for SVM classification of imbalanced datasets. IEEE (2015)
Book Google Scholar
Mathew, J., Pang, C.K., Luo, M., Leong, W.H.: Classification of imbalanced data by oversampling in kernel space of support vector machines. IEEE Trans. Neural Net Learn. Syst. 29(9), 4065–4076 (2018). https://doi.org/10.1109/TNNLS.2017.2751612
Article Google Scholar
Choi, D., Lee, K.: An artificial intelligence approach to financial fraud detection under IoT environment: a survey and implementation. Secur. Commun. Net 2018, 1–15 (2018). https://doi.org/10.1155/2018/5483472
Article Google Scholar
Haider, C.M.R., Iqbal, A., Rahman, A.H., Rahman, M.S.: An ensemble learning based approach for impression fraud detection in mobile advertising. J. Netw. Comput. Appl. 112, 126–141 (2018). https://doi.org/10.1016/j.jnca.2018.02.021
Article Google Scholar
Springborn, K., Barford, P.: Impression fraud in on-line advertising via pay-per-view networks. Sec Symp (2013). https://doi.org/10.4995/Thesis/10251/8685
Article Google Scholar
Li, Z., Zhang, K., **e, Y., Yu, F., and Wang X.: Knowing Your Enemy: Understanding and Detecting Malicious Web Advertising in Proceedings of the 2012 ACM conference on Computer and communications security 674–686 (2012), doi: https://doi.org/10.1145/2382196.2382267.
Perera, K.S., Neupane, B., Faisal, M.A., Aung, Z., Woon, W.L.: A novel ensemble learning-based approach for click fraud detection in mobile advertising. In: Neupane, M. (ed.) mining intelligence and knowledge exploration (MIKE). Springer International Publishing, Cham (2013)
Google Scholar
Xu, H., Liu, D., Koehl, A., Wang, H., Stavrou, A.: Click fraud detection on the advertiser side, in 19th European symposium on research in computer security. Wroclaw, Poland (2014)
Google Scholar
Haddadi, H.: Fighting online click-fraud using bluff ads. ACM SIGCOMM Comput. Commun. Rev. 40(2), 21–25 (2010)
Article Google Scholar
Nagaraja, S., and Shah, R.: Clicktok: Click Fraud Detection using Traffic Analysis in Proceedings of the 12th Conference on Security and Privacy in Wireless and Mobile Networks, Miami Florida. 20: 105–116 (2019)
Sisodia, D.S., Verma, U.: Distinct multiple learner-based ensemble smotebagging (ML-ESB) method for classification of binary class imbalance problems. Int. J. Technol. 10(4), 721–730 (2019)
Article Google Scholar
Sisodia, D., Sisodia, D.S.: Gradient boosting learning for fraudulent publisher detection in online advertising. Data Technol. Appl. 55(2), 216–232 (2020). https://doi.org/10.1108/DTA-04-2020-0093
Article Google Scholar
Haixiang, G., Yi**g, L., Shang, J., Mingyun, G., Yuanyue, H., Bing, G.: Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl. 73, 220–239 (2017). https://doi.org/10.1016/j.eswa.2016.12.035
Article Google Scholar
Georgios Sigletos, M.H., Paliouras, G., Spyropoulos, C.D.: Combining information extraction systems using voting and stacked generalization. J. Mach. Learn. Res. 6, 1751–1782 (2005)
MathSciNet MATH Google Scholar
Sisodia, D., Sisodia, D.S.: Data sampling strategies for click fraud detection using imbalanced user click data of online advertising : an empirical review. IETE Tech. Rev. 39(4), 1–10 (2021). https://doi.org/10.1080/02564602.2021.1915892
Article MathSciNet Google Scholar
Taneja, M., Garg, K., Purwar, A., Sharma, S.: Prediction of click frauds in mobile advertising, in International Conference on Contemporary Computing, IC3 Noida India 162–166 (2015), doi: https://doi.org/10.1109/IC3.2015.7346672.
Berrar, D.: Random forests for the detection of click fraud in online mobile advertising, In: Proceedings of 2012 international workshop on fraud detection in mobile advertising (FDMA), Singapore. 1–10, [Online] (2012). Available: http://berrar.com/resources/Berrar_FDMA2012.pdf.
Sisodia, D., Sisodia, D.S.: Quad division prototype selection-based k-nearest neighbor classifier for click fraud detection from highly skewed user click dataset. Eng. Sci. Technol. an Int. J. 28, 1–12 (2022). https://doi.org/10.1016/J.JESTCH.2021.05.015
Article Google Scholar
Sisodia, D., Sisodia, D.S.: A hybrid data-level sampling approach in learning from skewed user-click data for click fraud detection in online advertising. Expert Syst. 40, 1–17 (2022). https://doi.org/10.1111/exsy.13147
Article Google Scholar
Sisodia, D., Sisodia, D.S.: Feature distillation and accumulated selection for automated fraudulent publisher classification from user click data of online advertising. Data Technol. Appl. 56(4), 1–24 (2022). https://doi.org/10.1108/dta-09-2021-0233
Article MathSciNet Google Scholar
Sisodia, D., Sisodia, D. S.: Data Sampling Methods for Analyzing Publishers Conduct from Highly Imbalanced Dataset in Web Advertising, in International Conference on Information Systems and Management Science. 22: 428–441 (2023)
Sisodia, D., Sisodia, D.S.: Feature space transformation of user-clicks and deep transfer learning framework for fraudulent publisher detection in online advertising. Appl. Soft Comput. 125, 109142 (2022). https://doi.org/10.1016/j.asoc.2022.109142
Article Google Scholar
Batool, A., Byun, Y.C.: an ensemble architecture based on deep learning model for click fraud detection in Pay-Per-click advertisement campaign. IEEE Access 10, 113410–113426 (2022). https://doi.org/10.1109/ACCESS.2022.3211528
Article Google Scholar
Lyu, Q., Li, H., Zhou, R., Zhang, J., Zhao, N., Liu, Y.: A Blockchain-based click fraud detection and prevention scheme for online advertising. Secur. Commun. Net 2022, 1–20 (2022)
Article Google Scholar
Oentaryo, R., et al.: Detecting click fraud in online advertising a data mining approach. J. Mach. Learn. Res. 15(1), 99–140 (2014). https://doi.org/10.1145/2623330.2623718
Article MathSciNet Google Scholar
Wolpert, D.H.: Original contribution: stacked generalization. Neural Netw. 5(2), 241–259 (1992)
Article Google Scholar
Ting, K.M., Witten, I.H.: Stacked generalization: When does it work? IJCAI Int Joint Conf Artifi Intel 2, 866–871 (1997)
Google Scholar
Badan Standarisasi Nasional: Issues in stacked generalization. J. Arti?cial Intell Res. 10, 271–289 (1999)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002). https://doi.org/10.1613/jair.953
Article MATH Google Scholar
Hui Han, B.-H. M., Wen-Yuan Wang.: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In IEEE International Joint Conference on Neural Networks. 17: 144 (2007)
Ae, H.: An introduction to logistic regression: from basic concepts to interpretation with particular attention to nursing domain. J. Korean Acad. Nurs. 43(2), 154–164 (2013)
Article Google Scholar
King, G., Zeng, L.: Logistic regression in rare events data. Polit. Anal. 9(02), 137–163 (2012). https://doi.org/10.1093/oxfordjournals.pan.a004868
Article Google Scholar
Sperandei, S.: Understanding logistic regression analysis. Biochem. Medica 24(1), 12–18 (2014). https://doi.org/10.11613/BM.2014.003
Article Google Scholar
Breiman, L.: Statistical modeling: The two cultures (with comments and a rejoinder by the author). Stat. Sci. 16, 199–231 (2001)
Article MATH Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Article MATH Google Scholar
Todeschini, R.: k-nearest neighbour method: The influence of data transformations and metrics. Chemom. Intell. Lab. Syst. 6(3), 213–220 (1989)
Article Google Scholar
Zhang, S.: KNN-CF Approach: Incorporating Certainty Factor to kNN Classification. IEEE Intell. Informatics Bull. 11: 24–33, 2010, [Online] (2010). Available: http://www.comp.hkbu.edu.hk/~iib/2010/Dec/article4/iib_vol11no1_article4.pdf.
Utgoff, P.E.: Incremental induction of decision trees. Mach. Learn. 4(2), 161–186 (1989). https://doi.org/10.1023/A:1022699900025
Article Google Scholar
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986). https://doi.org/10.1023/A:1022643204877
Article Google Scholar
Friedman, N., Geiger, D., Goldszmit, M.: Bayesian network classifiers. Mach. Learn. 29, 131–163 (1997). https://doi.org/10.1023/a:1007465528199
Article MATH Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). https://doi.org/10.1007/BF00994018
Article MATH Google Scholar
Vapnik, V.N.: “Statistical Learning theory. adapt. learn. syst. signal process. commun Control 2, 1–740 (1998). https://doi.org/10.2307/1271368
Article MATH Google Scholar
Vapnik, V. N.: The nature of statistical learning theory In Springer science & business media. 226 (2013)
D. Sisodia, S. K. Shrivastava, and R. C. Jain, 2010 ISVM for face recognition, In International Conference on Computational Intelligence and Communication Networks, (CICN ). doi: https://doi.org/10.1109/CICN.2010.109.
Friedman, J.H.: Regularized discriminant analysis. J. Am. Stat. Assoc. 84(405), 165–175 (1989)
Article MathSciNet Google Scholar
Ramayah, T., Ahmad, N.H., Halim, H.A., Rohaida, S., Zainal, M., Lo, M.: Discriminant analysis : An illustrated example. African J. Bus. Manag. 4(9), 1654–1667 (2010)
Google Scholar
Li, Y., Zhang, X. J.: “Improving k Nearest Neighbor with Exemplar Generalization for Imbalanced Classification,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2011, no. August, pp. 321–332, doi: https://doi.org/10.1007/978-3-642-20847-8.
Liu, S., Wei C.: “Class confidence weighted knn algorithms for imbalanced data sets,” in Pacific-Asia conference on knowledge discovery and data mining, pp. 345–356 (2011)
Friedman, J.: Greedy Function Approximation : A Gradient Boosting Machine. Ann. Stat. 29(5), 1189–1232 (2001). https://doi.org/10.1214/009053606000000795
Article MathSciNet MATH Google Scholar
Elrahman, S.M.A., Abraham, A.: A Review of Class Imbalance Problem. Netw. Innov. Comput. 1, 332–340 (2013)
Google Scholar
Tharwat, A.: “Classification assessment methods. Comput. Informatics, Appl (2018). https://doi.org/10.1016/j.aci.2018.08.003
Book Google Scholar
Powers, D.M.W.: Evaluation: from precision, recall and f-measure to roc, informedness, markedness & correlation. J. Mach. Learn. Technol. 2(1), 37–63 (2011)
MathSciNet Google Scholar
Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 45(4), 427–437 (2009). https://doi.org/10.1016/j.ipm.2009.03.002
Article Google Scholar
Indola, R.P., Ebecken, N.F.F.: On extending F-measure and G-mean metrics to multi-class problems. Sixth Int Conf Data Mining, Text Mining Business Appl UK 35, 25–34 (2005)
Google Scholar
“NumPy Reference—NumPy v1.19 Manual.” https://numpy.org/doc/stable/reference/ (Accessed Aug. 22, 2020)
“Documentation—SciPy.org.” https://www.scipy.org/docs.html (Accessed Aug. 22, 2020)
“scikit-learn: machine learning in Python—scikit-learn 0.23.2 documentation.” https://scikit-learn.org/stable/ (Accessed Aug. 22, 2020)
“scikit-learn.” https://pypi.org/project/imblearn/ (accessed Aug. 22, 2020)
“pandas documentation — pandas 1.1.1 documentation.” https://pandas.pydata.org/docs/ (accessed Aug. 22, 2020)
“matplotlib · PyPI.” https://pypi.org/project/matplotlib/ (accessed May 18, 2021)
“csv—CSV File Reading and Writing — Python 3.9.5 documentation.” https://docs.python.org/3/library/csv.html (accessed May 18, 2021)
Richard Oentaryo, W.L.W., Lim, Ee-Peng., Finegold, Michael, Lo, David, Zhu, Feida, Phua, Clifton, Cheu, Eng-Yeow., Yap, Ghim-Eng., Sim, Kelvin, Nguyen, Minh Nhut, Perera, Kasun, Neupane, Bijay, Faisal, Mustafa, Aung, Zeyar: Detecting click fraud in online advertising : a data mining approach. J. Mach. Learn. Res. 15(1), 99–140 (2014). https://doi.org/10.1145/2623330.2623718
Article MathSciNet Google Scholar
Alcalá-Fdez, J., et al.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Mult. Log. Soft Comput. 17(2–3), 255–287 (2011)
Google Scholar
Fern, A., Garc, S., Bernad, E., Herrera, F.: Genetics-based machine learning for rule induction : taxonomy, experimental study and state of the art. IEEE Trans. Evol. Comput. 4(6), 913–941 (2010)
Google Scholar
Fernández, A., José, M., Herrera, F.: On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced datasets. Inf. Sci. (Ny) 180(8), 1268–1291 (2010). https://doi.org/10.1016/j.ins.2009.12.014
Article Google Scholar
Kubat, M., Holte, R.C., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30(2–3), 195–215 (1998). https://doi.org/10.1023/A:1007452223027
Article Google Scholar
Fernández, A., José, M., Herrera, F.: Hierarchical fuzzy rule based classification systems with genetic rule selection for imbalanced datasets. Int. J. Approx. Reason. 50(3), 561–577 (2009). https://doi.org/10.1016/j.ijar.2008.11.004
Article MATH Google Scholar
Berrar, D.: Learning from automatically labeled data: case study on click fraud prediction. Knowl. Inf. Syst. 46(2), 477–490 (2016). https://doi.org/10.1007/s10115-015-0827-6
Article Google Scholar
Vasumati, D., Vani, M. S., Bhramaramba, R., Babu, O. Y.: Data Mining Approach to Filter Click-spam in Mobile Ad Networks, In Int’l Conference on Computer Science, Data Mining & Mechanical Engg 90–94 (2015)

Download references

Author information

Deepti Sisodia
Present address: Alliance University, Bangalore, India

Authors and Affiliations

National Institute of Technology, Raipur, India
Deepti Sisodia & Dilip Singh Sisodia

Authors

Deepti Sisodia
View author publications
You can also search for this author in PubMed Google Scholar
Dilip Singh Sisodia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Deepti Sisodia.

Ethics declarations

Conflict of interest

All authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 51 KB)

About this article

Cite this article

Sisodia, D., Sisodia, D.S. Stacked Generalization Architecture for Predicting Publisher Behaviour from Highly Imbalanced User-Click Data Set for Click Fraud Detection. New Gener. Comput. 41, 581–606 (2023). https://doi.org/10.1007/s00354-023-00218-1

Download citation

Received: 19 June 2021
Accepted: 04 May 2023
Published: 29 May 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s00354-023-00218-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Brazil)

Instant access to the full article PDF.

Institutional subscriptions

Stacked Generalization Architecture for Predicting Publisher Behaviour from Highly Imbalanced User-Click Data Set for Click Fraud Detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Novel Ensemble Learning-Based Approach for Click Fraud Detection in Mobile Advertising

Forecasting Click Fraud via Machine Learning Algorithms

Learning from automatically labeled data: case study on click fraud prediction

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (DOCX 51 KB)

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Stacked Generalization Architecture for Predicting Publisher Behaviour from Highly Imbalanced User-Click Data Set for Click Fraud Detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Novel Ensemble Learning-Based Approach for Click Fraud Detection in Mobile Advertising

Forecasting Click Fraud via Machine Learning Algorithms

Learning from automatically labeled data: case study on click fraud prediction

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (DOCX 51 KB)

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation