Abstract
Spam mail filtering is a classic problem to automatically recognize irrelevance between incoming emails and user contexts. This paper proposes a novel proxy server architecture for (i) collaboratively integrating useful features sent from personal email clients. (ii) Improving the filtering performance of SMTP servers. Given a set of spam mails marked by multiple email users, the proxy server can extract two kinds of textual features, which are apriori terms and concept terms based on key phrases. More importantly, by taking into account the semantics and statistical associations, the proxy can aggregate them in a hierarchical cluster structure. As a result, spam ontology can be built, and also, incrementally enriched. Hence, the email clients can be supported to improve their performances of spam filtering by referring to the semantic information from the ontology. For evaluating the proposed system, we have collected a large number of spam mails within a same intranet environment. The system has shown 17.4% lower error rate of filtering than the single email clients.
Similar content being viewed by others
References
Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. Reading: Addison-Wesley.
Boone, G. (1998). Concept features in re:agent, an intelligent email agent. In Proceedings of the second international conference on autonomous agents (pp. 141–148).
Cohen, W. W. (1996). Learning rules that classify e-mail. In Proceeding of the AAAI spring symposium on machine learning in information access (pp. 18–25).
Delany, S. J., & Cunningham, P. (2004). An analysis of case-base editing in a spam filtering system. In Lecture notes in computer science: Vol. 3155. Proceedings of the 7th European conference on case-based reasoning (pp. 128–141). Berlin: Springer.
Drucker, H., Wu, D., & Vapnik, V. (1999). Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10(5), 1048–1054.
Eyharabide, V., & Amandi, A. (2008). Semantic spam filtering from personalized ontologies. Journal of Web Engineering, 7(2), 158–176.
Fdez-Riverola, F., Iglesias, E. L., Díaz, F., Méndez, J. R., & Corchado, J. M. (2007). Spamhunting: An instance-based reasoning system for spam labelling and filtering. Decision Support Systems, 43(4), 722–736.
Ferber, J. (1999). Multi-agent systems—an introduction to distributed artificial intelligence. Reading: Addison-Wesley.
Giraud-Carrier, C. (2000). A note on the utility of incremental learning. AI Communications, 13(4), 215–223.
Gordillo, J., & Conde, E. (2007). An hmm for detecting spam mail. Expert Systems With Applications, 33(3), 667–682.
Gruber, T. R. (1993). A translation approach to portable ontologies. Knowledge Acquisition, 5(2), 199–220.
Joachims, T. (1997). A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Proceedings of 14th international conference on machine learning (pp. 143–151). San Mateo: Morgan Kaufmann.
Jung, J. J. (2005). Collaborative web browsing based on semantic extraction of user interests with bookmarks. Journal of Universal Computer Science, 11(2), 213–228.
Jung, J. J. (2008). Ontology-based context synchronization for ad-hoc social collaborations. Knowledge-Based Systems, 21(7), 573–580.
Jung, J. J. (2008). Query transformation based on semantic centrality in semantic social network. Journal of Universal Computer Science, 14(7), 1031–1047.
Jung, J. J. (2008). Taxonomy alignment for interoperability between heterogeneous virtual organizations. Expert Systems With Applications, 34(4), 2721–2731.
Jung, J. J. (2009). Semantic business process integration based on ontology alignment. Expert Systems With Applications, 36(8), 11013–11020.
Jung, J. J. (2009). Social grid platform for collaborative online learning on blogosphere: a case study of eLearning@BlogGrid. Expert Systems With Applications, 36(2), 2177–2186.
Jung, J. J. (2010). Ontology map** composition for query transformation on distributed environments. Expert Systems With Applications, 37(12), 8401–8405.
Jung, J. J. (2010). Reusing ontology map**s for query segmentation and routing in semantic peer-to-peer environment. Information Sciences, 180(17), 3248–3257.
Jung, J. J. (2010). On sustainability of context-aware services among heterogeneous smart spaces. Journal of Universal Computer Science, 16(13), 1745–1760.
Jung, J. J. (2011). Service chain-based business alliance formation in service-oriented architecture. Expert Systems With Applications, 38(3), 2206–2211.
Kim, H. J., Kim, H. N., Jung, J. J., & Jo, G. (2004). Spam mail filtering system using semantic enrichment. In Lecture notes in computer science: Vol. 3306. Proceedings of the 5th international conference on web information systems engineering (pp. 619–628).
Koprinska, I., Poon, J., Clark, J., & Chan, J. (2007). Learning to classify e-mail. Information Sciences, 177(10), 2167–2187.
Maes, P. (1994). Agents that reduce work and information overload. Communications of the ACM, 37(7), 31–40.
Metzger, J., Schillo, M., & Fischer, K. (2003). A multiagent-based peer-to-peer network in java for distributed spam filtering. In Lecture notes in computer science: Vol. 2691. Proceedings of the 3rd international central and eastern European conference on multi-agent systems (pp. 616–625).
Moon, J., Shon, T., Seo, J. T., Kim, J., & Seo, J. (2004). An approach for spam e-mail detection with support vector machine and n-gram indexing. In Lecture notes in computer science: Vol. 3280. Proceedings of the 19th international symposium on computer and information sciences (pp. 351–362).
Ollerenshaw, Z. (2000). Spam, spam, spam, spam…. Computer Fraud & Security, 20, 13–14.
Pampapathi, R., Mirkin, B., & Levene, M. (2006). A suffix tree approach to anti-spam email filtering. Machine Learning, 65(1), 309–338.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.
Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Mateo: Morgan Kaufmann.
Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). A bayesian approach to filtering junk e-mail. In Proceeding of the AAAI workshop on learning for text classification.
Schapire, R. E., & Singer, Y. (2000). Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2–3), 135–168.
Trudgian, D. C. (2004). Spam classification using nearest neighbour techniques. In Lecture notes in computer science: Vol. 3177. Proceedings of the 5th international conference on intelligent data engineering and automated learning (pp. 578–585).
Turney, P. D. (2000). Learning algorithms for keyphrase extraction. Information Retrieval, 2(4), 303–336.
Weiss, G. (1999). Multiagent systems—a modern approach to distributed artificial intelligence. Cambridge: MIT Press.
Yu, B., & Singh, M. P. (2000). A social mechanism of reputation management in electronic communities. In Lecture notes in computer science: Vol. 1860. Proceedings of the 4th international workshop on cooperative information agents (pp. 154–165).
Zhou, Y., Mulekar, M. S., & Nerellapalli, P. (2007). Adaptive spam filtering using dynamic feature spaces. International Journal on Artificial Intelligence Tools, 16(4), 627–646.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pham, X.H., Lee, NH., Jung, J.J. et al. Collaborative spam filtering based on incremental ontology learning. Telecommun Syst 52, 693–700 (2013). https://doi.org/10.1007/s11235-011-9513-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11235-011-9513-5