Abstract
Although efficient identification of user access sessions from very large web logs is an unavoidable data preparation task for the success of higher level web log mining, little attention has been paid to algorithmic study of this problem. In this paper we consider two types of user access sessions, interval sessions and gap sessions. We design two efficient algorithms for finding respectively those two types of sessions with the help of some proposed structures. We present theoretical analysis of the algorithms and prove that both algorithms have optimal time complexity and certain error-tolerant properties as well. We conduct empirical performance analysis of the algorithms with web logs ranging from 100 megabytes to 500 megabytes. The empirical analysis shows that the algorithms just take several seconds more than the baseline time, i.e., the time needed for reading the web log once sequentially from disk to RAM, testing whether each user access record is valid or not, and writing each valid user access record back to disk. The empirical analysis also shows that our algorithms are substantially faster than the sorting based session finding algorithms. Finally, optimal algorithms for finding user access sessions from distributed web logs are also presented.
Similar content being viewed by others
References
R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Inkeri Verkamo, “Fast discovery of association rules,” Advances in Knowledge Discovery and Data Mining, 1996, 307-328.
B. Berendt and M. Spiliopoulou, “Analysis of navigation behavior in web sites integrating multiple information systems,” The VLDB Journal 9, 2000, 56-75.
B. Berendt, B. Mobasher, M. Spiliopoulou, and J. Wiltshire, “Measuring the accuracy of sessionizers for web usage analysis,” in Proceedings of the Workshop on Web Mining at the First SIAM International Conference on Data Mining, April 2001, pp. 7-14.
J. Borges and M. Levene, “Data mining of user navigation patterns,” MS99, 1999.
A. G. Buchner, M. Baumgarten, and S. S. Anand, “Navigation pattern discovery from internet data,” MS99, 1999.
A. G. Buchner and M. D. Mulvenna, “Discovering internet marketing intelligence through online analytical web usage mining,” ACM SIGMOD RECORD, Dec. 1998, 54-61.
L. Catledge and J. Pitkow, “Characterizing browsing behaviors on the world wide web,” Computer Networks and ISDN Systems 27, 1995.
M. S. Chen, J. S. Park, and P. S. Yu, “Efficient data mining for path traversal patterns,” IEEE Transactions on Knowledge and Data Engineering 10(2), 1998, 209-221.
R. Cooley, B. Mobasher, and J. Srivastava, “Web mining: Information and pattern discovery on the world wide web,” in Proc. IEEE Intl. Conference Tools with AI, 1997.
R. Cooley, B. Mobasher, and J. Srivastava, “Data preparation for mining world wide web browsing patterns,” Journal of Knowledge and Information Systems 1(1), 1999.
M. Perkowitz and O. Etzioni, “Adaptive web pages: Automatically synthesizing web pages,” in Proceedings of AAAI/IAAI'98 1998, pp. 727-732.
J. Pitkow, “In search of reliable usage data on the WWW,” in Proceedings of the Sixth World Wide Web Conference, Santa Clara, CA, 1997, pp. 451-463.
P. Pirolli, J. Pitkow, and R. Rao, “Silk from sow's ear: Extracting usable structures from the Web,” in Proceedings of the 1996 Conference on Human Factors in Computing Systems (CHI'96), Vancouver, British Columbia, Canada, 1996.
C. Shababi, A. M. Zarkesh, J. Abidi, and V. Shah, “Knowledge discovery from user's web page navigation,” in Proceedings of the Seventh IEEE Intl. Workshop on Research Issues in Data Engineering (RIDE), 1997, pp. 20-29.
M. Spiliopoulou and L. C. Faulstich, “Wum: A tool for web utilization analysis,” in Proceedings of EDBT Workshop WebDB'98, LNCS1590, Springer Verlag, 1999, pp. 184-203.
M. Spiliopoulou, C. Pohle, and L. C. Faulstich, “Improving the effectiveness of a web site with web usage mining,” in KDD'99 Workshop on Web Usage Analysis and User Profiling WEBKDD'99, Aug. 1999.
L. Tauscher and S. Greenberg, “Revisitation patterns in world wide web navigation,” in Proceedings of Int. Conf. CHI'97, 1997.
W3C. World wide web committee web usage characterization activity, W3C Working Draft: Web Characterization Terminology and Definitions Sheet, pages www.w3.org/1999/05/WCA-terms/, 1999.
Y.-H. Wu and A. L. P. Chen, “Prediction of Web page accesses by proxy server log,” World Wide Web 5(1), 2002, 67-88.
Q. Yang and H. H. Zhang, “Integrating Web prefetching and caching using prediction models,” World Wide Web 4(4), 2001, 299-321.
O. Zaïane, M. **n, and J. Han, “Discovering web access patterns and trends by applying olap and data mining technology on web logs,” in Advances in Digital Libraries, April, 1998, pp. 19-29.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Chen, Z., Fu, A.WC. & Tong, F.CH. Optimal Algorithms for Finding User Access Sessions from Very Large Web Logs. World Wide Web 6, 259–279 (2003). https://doi.org/10.1023/A:1024606901978
Issue Date:
DOI: https://doi.org/10.1023/A:1024606901978