Abstract
In online advertising, a change in the publisher’s actual status label with every generated click shows the suspicious behaviour of the publisher. Furthermore, only a small proportion of the clicks generated by the publishers are invalid, resulting in class skewness in the dataset and a challenging issue for the conventional classification methods as they get biased towards the outnumbered class. This suspicious behaviour of publishers with an uneven class distribution ratio adversely affects the classifier’s performance and increases model complexities. Thus, develo** machine-learning methods capable of producing efficacious predictive models towards detecting fraudulent publishers is pivotal. This paper’s novel stacked generalization framework comprises two stacked generalization architectures, one for resampling and the second for classification. The framework employs a stacked generalization approach using generalizers to improve the learning model’s performance in two steps: first, reducing the error rate of algorithms towards reducing the bias in a learning set. Second, the results obtained through level-0 generalizers are fed as input to the level-1 generalizer with stacked integrated output towards combining the predictions for improving the predictive performance. Broad experimentations are conducted on FDMA 2012 user click dataset using ten-fold cross-validation. The performance of the proposed architecture is generalized by performing experiments on eight other highly imbalanced benchmark datasets, and performance is measured using average precision, recall, and F1-score. Results empirically prove the superiority of the proposed architecture in the publisher's behaviour prediction and classification as legitimate or illegitimate.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00354-023-00218-1/MediaObjects/354_2023_218_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00354-023-00218-1/MediaObjects/354_2023_218_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00354-023-00218-1/MediaObjects/354_2023_218_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00354-023-00218-1/MediaObjects/354_2023_218_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00354-023-00218-1/MediaObjects/354_2023_218_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00354-023-00218-1/MediaObjects/354_2023_218_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00354-023-00218-1/MediaObjects/354_2023_218_Fig7_HTML.png)
Similar content being viewed by others
References
Mathew, J., Luo, M., Pang, C.K., Chan, H.L.: Kernel-based SMOTE for SVM classification of imbalanced datasets. IEEE (2015)
Mathew, J., Pang, C.K., Luo, M., Leong, W.H.: Classification of imbalanced data by oversampling in kernel space of support vector machines. IEEE Trans. Neural Net Learn. Syst. 29(9), 4065–4076 (2018). https://doi.org/10.1109/TNNLS.2017.2751612
Choi, D., Lee, K.: An artificial intelligence approach to financial fraud detection under IoT environment: a survey and implementation. Secur. Commun. Net 2018, 1–15 (2018). https://doi.org/10.1155/2018/5483472
Haider, C.M.R., Iqbal, A., Rahman, A.H., Rahman, M.S.: An ensemble learning based approach for impression fraud detection in mobile advertising. J. Netw. Comput. Appl. 112, 126–141 (2018). https://doi.org/10.1016/j.jnca.2018.02.021
Springborn, K., Barford, P.: Impression fraud in on-line advertising via pay-per-view networks. Sec Symp (2013). https://doi.org/10.4995/Thesis/10251/8685
Li, Z., Zhang, K., **e, Y., Yu, F., and Wang X.: Knowing Your Enemy: Understanding and Detecting Malicious Web Advertising in Proceedings of the 2012 ACM conference on Computer and communications security 674–686 (2012), doi: https://doi.org/10.1145/2382196.2382267.
Perera, K.S., Neupane, B., Faisal, M.A., Aung, Z., Woon, W.L.: A novel ensemble learning-based approach for click fraud detection in mobile advertising. In: Neupane, M. (ed.) mining intelligence and knowledge exploration (MIKE). Springer International Publishing, Cham (2013)
Xu, H., Liu, D., Koehl, A., Wang, H., Stavrou, A.: Click fraud detection on the advertiser side, in 19th European symposium on research in computer security. Wroclaw, Poland (2014)
Haddadi, H.: Fighting online click-fraud using bluff ads. ACM SIGCOMM Comput. Commun. Rev. 40(2), 21–25 (2010)
Nagaraja, S., and Shah, R.: Clicktok: Click Fraud Detection using Traffic Analysis in Proceedings of the 12th Conference on Security and Privacy in Wireless and Mobile Networks, Miami Florida. 20: 105–116 (2019)
Sisodia, D.S., Verma, U.: Distinct multiple learner-based ensemble smotebagging (ML-ESB) method for classification of binary class imbalance problems. Int. J. Technol. 10(4), 721–730 (2019)
Sisodia, D., Sisodia, D.S.: Gradient boosting learning for fraudulent publisher detection in online advertising. Data Technol. Appl. 55(2), 216–232 (2020). https://doi.org/10.1108/DTA-04-2020-0093
Haixiang, G., Yi**g, L., Shang, J., Mingyun, G., Yuanyue, H., Bing, G.: Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl. 73, 220–239 (2017). https://doi.org/10.1016/j.eswa.2016.12.035
Georgios Sigletos, M.H., Paliouras, G., Spyropoulos, C.D.: Combining information extraction systems using voting and stacked generalization. J. Mach. Learn. Res. 6, 1751–1782 (2005)
Sisodia, D., Sisodia, D.S.: Data sampling strategies for click fraud detection using imbalanced user click data of online advertising : an empirical review. IETE Tech. Rev. 39(4), 1–10 (2021). https://doi.org/10.1080/02564602.2021.1915892
Taneja, M., Garg, K., Purwar, A., Sharma, S.: Prediction of click frauds in mobile advertising, in International Conference on Contemporary Computing, IC3 Noida India 162–166 (2015), doi: https://doi.org/10.1109/IC3.2015.7346672.
Berrar, D.: Random forests for the detection of click fraud in online mobile advertising, In: Proceedings of 2012 international workshop on fraud detection in mobile advertising (FDMA), Singapore. 1–10, [Online] (2012). Available: http://berrar.com/resources/Berrar_FDMA2012.pdf.
Sisodia, D., Sisodia, D.S.: Quad division prototype selection-based k-nearest neighbor classifier for click fraud detection from highly skewed user click dataset. Eng. Sci. Technol. an Int. J. 28, 1–12 (2022). https://doi.org/10.1016/J.JESTCH.2021.05.015
Sisodia, D., Sisodia, D.S.: A hybrid data-level sampling approach in learning from skewed user-click data for click fraud detection in online advertising. Expert Syst. 40, 1–17 (2022). https://doi.org/10.1111/exsy.13147
Sisodia, D., Sisodia, D.S.: Feature distillation and accumulated selection for automated fraudulent publisher classification from user click data of online advertising. Data Technol. Appl. 56(4), 1–24 (2022). https://doi.org/10.1108/dta-09-2021-0233
Sisodia, D., Sisodia, D. S.: Data Sampling Methods for Analyzing Publishers Conduct from Highly Imbalanced Dataset in Web Advertising, in International Conference on Information Systems and Management Science. 22: 428–441 (2023)
Sisodia, D., Sisodia, D.S.: Feature space transformation of user-clicks and deep transfer learning framework for fraudulent publisher detection in online advertising. Appl. Soft Comput. 125, 109142 (2022). https://doi.org/10.1016/j.asoc.2022.109142
Batool, A., Byun, Y.C.: an ensemble architecture based on deep learning model for click fraud detection in Pay-Per-click advertisement campaign. IEEE Access 10, 113410–113426 (2022). https://doi.org/10.1109/ACCESS.2022.3211528
Lyu, Q., Li, H., Zhou, R., Zhang, J., Zhao, N., Liu, Y.: A Blockchain-based click fraud detection and prevention scheme for online advertising. Secur. Commun. Net 2022, 1–20 (2022)
Oentaryo, R., et al.: Detecting click fraud in online advertising a data mining approach. J. Mach. Learn. Res. 15(1), 99–140 (2014). https://doi.org/10.1145/2623330.2623718
Wolpert, D.H.: Original contribution: stacked generalization. Neural Netw. 5(2), 241–259 (1992)
Ting, K.M., Witten, I.H.: Stacked generalization: When does it work? IJCAI Int Joint Conf Artifi Intel 2, 866–871 (1997)
Badan Standarisasi Nasional: Issues in stacked generalization. J. Arti?cial Intell Res. 10, 271–289 (1999)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002). https://doi.org/10.1613/jair.953
Hui Han, B.-H. M., Wen-Yuan Wang.: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In IEEE International Joint Conference on Neural Networks. 17: 144 (2007)
Ae, H.: An introduction to logistic regression: from basic concepts to interpretation with particular attention to nursing domain. J. Korean Acad. Nurs. 43(2), 154–164 (2013)
King, G., Zeng, L.: Logistic regression in rare events data. Polit. Anal. 9(02), 137–163 (2012). https://doi.org/10.1093/oxfordjournals.pan.a004868
Sperandei, S.: Understanding logistic regression analysis. Biochem. Medica 24(1), 12–18 (2014). https://doi.org/10.11613/BM.2014.003
Breiman, L.: Statistical modeling: The two cultures (with comments and a rejoinder by the author). Stat. Sci. 16, 199–231 (2001)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Todeschini, R.: k-nearest neighbour method: The influence of data transformations and metrics. Chemom. Intell. Lab. Syst. 6(3), 213–220 (1989)
Zhang, S.: KNN-CF Approach: Incorporating Certainty Factor to kNN Classification. IEEE Intell. Informatics Bull. 11: 24–33, 2010, [Online] (2010). Available: http://www.comp.hkbu.edu.hk/~iib/2010/Dec/article4/iib_vol11no1_article4.pdf.
Utgoff, P.E.: Incremental induction of decision trees. Mach. Learn. 4(2), 161–186 (1989). https://doi.org/10.1023/A:1022699900025
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986). https://doi.org/10.1023/A:1022643204877
Friedman, N., Geiger, D., Goldszmit, M.: Bayesian network classifiers. Mach. Learn. 29, 131–163 (1997). https://doi.org/10.1023/a:1007465528199
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). https://doi.org/10.1007/BF00994018
Vapnik, V.N.: “Statistical Learning theory. adapt. learn. syst. signal process. commun Control 2, 1–740 (1998). https://doi.org/10.2307/1271368
Vapnik, V. N.: The nature of statistical learning theory In Springer science & business media. 226 (2013)
D. Sisodia, S. K. Shrivastava, and R. C. Jain, 2010 ISVM for face recognition, In International Conference on Computational Intelligence and Communication Networks, (CICN ). doi: https://doi.org/10.1109/CICN.2010.109.
Friedman, J.H.: Regularized discriminant analysis. J. Am. Stat. Assoc. 84(405), 165–175 (1989)
Ramayah, T., Ahmad, N.H., Halim, H.A., Rohaida, S., Zainal, M., Lo, M.: Discriminant analysis : An illustrated example. African J. Bus. Manag. 4(9), 1654–1667 (2010)
Li, Y., Zhang, X. J.: “Improving k Nearest Neighbor with Exemplar Generalization for Imbalanced Classification,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2011, no. August, pp. 321–332, doi: https://doi.org/10.1007/978-3-642-20847-8.
Liu, S., Wei C.: “Class confidence weighted knn algorithms for imbalanced data sets,” in Pacific-Asia conference on knowledge discovery and data mining, pp. 345–356 (2011)
Friedman, J.: Greedy Function Approximation : A Gradient Boosting Machine. Ann. Stat. 29(5), 1189–1232 (2001). https://doi.org/10.1214/009053606000000795
Elrahman, S.M.A., Abraham, A.: A Review of Class Imbalance Problem. Netw. Innov. Comput. 1, 332–340 (2013)
Tharwat, A.: “Classification assessment methods. Comput. Informatics, Appl (2018). https://doi.org/10.1016/j.aci.2018.08.003
Powers, D.M.W.: Evaluation: from precision, recall and f-measure to roc, informedness, markedness & correlation. J. Mach. Learn. Technol. 2(1), 37–63 (2011)
Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 45(4), 427–437 (2009). https://doi.org/10.1016/j.ipm.2009.03.002
Indola, R.P., Ebecken, N.F.F.: On extending F-measure and G-mean metrics to multi-class problems. Sixth Int Conf Data Mining, Text Mining Business Appl UK 35, 25–34 (2005)
“NumPy Reference—NumPy v1.19 Manual.” https://numpy.org/doc/stable/reference/ (Accessed Aug. 22, 2020)
“Documentation—SciPy.org.” https://www.scipy.org/docs.html (Accessed Aug. 22, 2020)
“scikit-learn: machine learning in Python—scikit-learn 0.23.2 documentation.” https://scikit-learn.org/stable/ (Accessed Aug. 22, 2020)
“scikit-learn.” https://pypi.org/project/imblearn/ (accessed Aug. 22, 2020)
“pandas documentation — pandas 1.1.1 documentation.” https://pandas.pydata.org/docs/ (accessed Aug. 22, 2020)
“matplotlib · PyPI.” https://pypi.org/project/matplotlib/ (accessed May 18, 2021)
“csv—CSV File Reading and Writing — Python 3.9.5 documentation.” https://docs.python.org/3/library/csv.html (accessed May 18, 2021)
Richard Oentaryo, W.L.W., Lim, Ee-Peng., Finegold, Michael, Lo, David, Zhu, Feida, Phua, Clifton, Cheu, Eng-Yeow., Yap, Ghim-Eng., Sim, Kelvin, Nguyen, Minh Nhut, Perera, Kasun, Neupane, Bijay, Faisal, Mustafa, Aung, Zeyar: Detecting click fraud in online advertising : a data mining approach. J. Mach. Learn. Res. 15(1), 99–140 (2014). https://doi.org/10.1145/2623330.2623718
Alcalá-Fdez, J., et al.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Mult. Log. Soft Comput. 17(2–3), 255–287 (2011)
Fern, A., Garc, S., Bernad, E., Herrera, F.: Genetics-based machine learning for rule induction : taxonomy, experimental study and state of the art. IEEE Trans. Evol. Comput. 4(6), 913–941 (2010)
Fernández, A., José, M., Herrera, F.: On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced datasets. Inf. Sci. (Ny) 180(8), 1268–1291 (2010). https://doi.org/10.1016/j.ins.2009.12.014
Kubat, M., Holte, R.C., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30(2–3), 195–215 (1998). https://doi.org/10.1023/A:1007452223027
Fernández, A., José, M., Herrera, F.: Hierarchical fuzzy rule based classification systems with genetic rule selection for imbalanced datasets. Int. J. Approx. Reason. 50(3), 561–577 (2009). https://doi.org/10.1016/j.ijar.2008.11.004
Berrar, D.: Learning from automatically labeled data: case study on click fraud prediction. Knowl. Inf. Syst. 46(2), 477–490 (2016). https://doi.org/10.1007/s10115-015-0827-6
Vasumati, D., Vani, M. S., Bhramaramba, R., Babu, O. Y.: Data Mining Approach to Filter Click-spam in Mobile Ad Networks, In Int’l Conference on Computer Science, Data Mining & Mechanical Engg 90–94 (2015)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
About this article
Cite this article
Sisodia, D., Sisodia, D.S. Stacked Generalization Architecture for Predicting Publisher Behaviour from Highly Imbalanced User-Click Data Set for Click Fraud Detection. New Gener. Comput. 41, 581–606 (2023). https://doi.org/10.1007/s00354-023-00218-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00354-023-00218-1