Log in

Optimal Algorithms for Finding User Access Sessions from Very Large Web Logs

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Although efficient identification of user access sessions from very large web logs is an unavoidable data preparation task for the success of higher level web log mining, little attention has been paid to algorithmic study of this problem. In this paper we consider two types of user access sessions, interval sessions and gap sessions. We design two efficient algorithms for finding respectively those two types of sessions with the help of some proposed structures. We present theoretical analysis of the algorithms and prove that both algorithms have optimal time complexity and certain error-tolerant properties as well. We conduct empirical performance analysis of the algorithms with web logs ranging from 100 megabytes to 500 megabytes. The empirical analysis shows that the algorithms just take several seconds more than the baseline time, i.e., the time needed for reading the web log once sequentially from disk to RAM, testing whether each user access record is valid or not, and writing each valid user access record back to disk. The empirical analysis also shows that our algorithms are substantially faster than the sorting based session finding algorithms. Finally, optimal algorithms for finding user access sessions from distributed web logs are also presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Canada)

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Inkeri Verkamo, “Fast discovery of association rules,” Advances in Knowledge Discovery and Data Mining, 1996, 307-328.

  2. B. Berendt and M. Spiliopoulou, “Analysis of navigation behavior in web sites integrating multiple information systems,” The VLDB Journal 9, 2000, 56-75.

    Google Scholar 

  3. B. Berendt, B. Mobasher, M. Spiliopoulou, and J. Wiltshire, “Measuring the accuracy of sessionizers for web usage analysis,” in Proceedings of the Workshop on Web Mining at the First SIAM International Conference on Data Mining, April 2001, pp. 7-14.

  4. J. Borges and M. Levene, “Data mining of user navigation patterns,” MS99, 1999.

  5. A. G. Buchner, M. Baumgarten, and S. S. Anand, “Navigation pattern discovery from internet data,” MS99, 1999.

  6. A. G. Buchner and M. D. Mulvenna, “Discovering internet marketing intelligence through online analytical web usage mining,” ACM SIGMOD RECORD, Dec. 1998, 54-61.

  7. L. Catledge and J. Pitkow, “Characterizing browsing behaviors on the world wide web,” Computer Networks and ISDN Systems 27, 1995.

  8. M. S. Chen, J. S. Park, and P. S. Yu, “Efficient data mining for path traversal patterns,” IEEE Transactions on Knowledge and Data Engineering 10(2), 1998, 209-221.

    Google Scholar 

  9. R. Cooley, B. Mobasher, and J. Srivastava, “Web mining: Information and pattern discovery on the world wide web,” in Proc. IEEE Intl. Conference Tools with AI, 1997.

  10. R. Cooley, B. Mobasher, and J. Srivastava, “Data preparation for mining world wide web browsing patterns,” Journal of Knowledge and Information Systems 1(1), 1999.

  11. M. Perkowitz and O. Etzioni, “Adaptive web pages: Automatically synthesizing web pages,” in Proceedings of AAAI/IAAI'98 1998, pp. 727-732.

  12. J. Pitkow, “In search of reliable usage data on the WWW,” in Proceedings of the Sixth World Wide Web Conference, Santa Clara, CA, 1997, pp. 451-463.

    Google Scholar 

  13. P. Pirolli, J. Pitkow, and R. Rao, “Silk from sow's ear: Extracting usable structures from the Web,” in Proceedings of the 1996 Conference on Human Factors in Computing Systems (CHI'96), Vancouver, British Columbia, Canada, 1996.

    Google Scholar 

  14. C. Shababi, A. M. Zarkesh, J. Abidi, and V. Shah, “Knowledge discovery from user's web page navigation,” in Proceedings of the Seventh IEEE Intl. Workshop on Research Issues in Data Engineering (RIDE), 1997, pp. 20-29.

  15. M. Spiliopoulou and L. C. Faulstich, “Wum: A tool for web utilization analysis,” in Proceedings of EDBT Workshop WebDB'98, LNCS1590, Springer Verlag, 1999, pp. 184-203.

    Google Scholar 

  16. M. Spiliopoulou, C. Pohle, and L. C. Faulstich, “Improving the effectiveness of a web site with web usage mining,” in KDD'99 Workshop on Web Usage Analysis and User Profiling WEBKDD'99, Aug. 1999.

  17. L. Tauscher and S. Greenberg, “Revisitation patterns in world wide web navigation,” in Proceedings of Int. Conf. CHI'97, 1997.

  18. W3C. World wide web committee web usage characterization activity, W3C Working Draft: Web Characterization Terminology and Definitions Sheet, pages www.w3.org/1999/05/WCA-terms/, 1999.

  19. Y.-H. Wu and A. L. P. Chen, “Prediction of Web page accesses by proxy server log,” World Wide Web 5(1), 2002, 67-88.

    Google Scholar 

  20. Q. Yang and H. H. Zhang, “Integrating Web prefetching and caching using prediction models,” World Wide Web 4(4), 2001, 299-321.

    Google Scholar 

  21. O. Zaïane, M. **n, and J. Han, “Discovering web access patterns and trends by applying olap and data mining technology on web logs,” in Advances in Digital Libraries, April, 1998, pp. 19-29.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, Z., Fu, A.WC. & Tong, F.CH. Optimal Algorithms for Finding User Access Sessions from Very Large Web Logs. World Wide Web 6, 259–279 (2003). https://doi.org/10.1023/A:1024606901978

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1024606901978

Navigation