Sketching for Big Data Recommender Systems Using Fast Pseudo-random Fingerprints

  • Conference paper
Automata, Languages, and Programming (ICALP 2013)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7966))

Included in the following conference series:

Abstract

A key building block for collaborative filtering recommender systems is finding users with similar consumption patterns. Given access to the full data regarding the items consumed by each user, one can directly compute the similarity between any two users. However, for massive recommender systems such a naive approach requires a high running time and may be intractable in terms of the space required to store the full data. One way to overcome this is using sketching, a technique that represents massive datasets concisely, while still allowing calculating properties of these datasets. Sketching methods maintain very short fingerprints of the item sets of users, which allow approximately computing the similarity between sets of different users.

The state of the art sketch [22] has a very low space complexity, and a recent technique [14] shows how to exponentially speed up the computation time involved in building the fingerprints. Unfortunately, these methods are incompatible, forcing a choice between low running time or a small sketch size. We propose an alternative sketching approach, which achieves both a low space complexity similar to that of [22] and a low time complexity similar to [14]. We empirically evaluate our algorithm using the Netflix dataset. We analyze the running time and the sketch size of our approach and compare them to alternatives. Further, we show that in practice the accuracy achieved by our approach is even better than the accuracy guaranteed by the theoretical bounds, so it suffices to use even shorter fingerprints to obtain high quality results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
EUR 29.95
Price includes VAT (France)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 42.79
Price includes VAT (France)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 52.74
Price includes VAT (France)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Aggarwal, C.C.: Data streams: models and algorithms. Springer-Verlag New York Inc. (2007)

    Google Scholar 

  2. Alon, N., Matias, Y., Szegedy, M.: The Space Complexity of Approximating the Frequency Moments. J. Computer and System Sciences 58(1), 137–147 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  3. Bachrach, Y., Herbrich, R.: Fingerprinting Ratings for Collaborative Filtering — Theoretical and Empirical Analysis. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 25–36. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  4. Bachrach, Y., Herbrich, R., Porat, E.: Sketching algorithms for approximating rank correlations in collaborative filtering systems. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 344–352. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  5. Bachrach, Y., Porat, E., Rosenschein, J.S.: Sketching techniques for collaborative filtering. In: IJCAI, Pasadena, California (July 2009)

    Google Scholar 

  6. Bennett, J., Lanning, S.: The netflix prize. In: KDD Cup and Workshop (2007)

    Google Scholar 

  7. Broder, A.Z.: On the resemblance and containment of documents. Sequences (1998)

    Google Scholar 

  8. Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. Journal of Computer and System Sciences 60(3), 630–659 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  9. Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55(1), 58–75 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  10. Cormode, G., Muthukrishnan, S., Rozenbaum, I.: Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In: VLDB (2005)

    Google Scholar 

  11. Das, A.S., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: WWW. ACM (2007)

    Google Scholar 

  12. Dasgupta, A., Kumar, R., Sarlos, T.: Fast locality-sensitive hashing. In: SIGKDD (2011)

    Google Scholar 

  13. Datar, M., Muthukrishnan, S.: Estimating rarity and similarity over data stream windows. In: Möhring, R., Raman, R. (eds.) ESA 2002. LNCS, vol. 2461, pp. 323–335. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  14. Feigenblat, G., Shiftan, A., Porat, E.: Exponential time improvement for min-wise based algorithms. In: SODA (2011)

    Google Scholar 

  15. Hoeffding, W.: Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58(301), 13–30 (1963)

    Article  MathSciNet  MATH  Google Scholar 

  16. Indyk, P.: A Small Approximately Min-Wise Independent Family of Hash Functions. Journal of Algorithms 38(1), 84–90 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  17. Indyk, P.: Stable distributions, pseudorandom generators, embeddings, and data stream computation. Journal of the ACM (JACM) 53(3), 323 (2006)

    Article  MathSciNet  Google Scholar 

  18. Kane, D.M., Nelson, J., Porat, E., Woodruff, D.P.: Fast moment estimation in data streams in optimal space. In: STOC (2011)

    Google Scholar 

  19. Kane, D.M., Nelson, J., Woodruff, D.P.: An optimal algorithm for the distinct elements problem. In: PODS, pp. 41–52. ACM (2010)

    Google Scholar 

  20. Karp, R.M., Shenker, S., Papadimitriou, C.H.: A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems (TODS) 28(1), 51–55 (2003)

    Article  Google Scholar 

  21. Kirsch, A., Mitzenmacher, M.: Less hashing, same performance: a better Bloom filter. In: Azar, Y., Erlebach, T. (eds.) ESA 2006. LNCS, vol. 4168, pp. 456–467. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  22. Li, P., Koenig, C.: b-Bit minwise hashing. In: WWW (2010)

    Google Scholar 

  23. Mulmuley, K.: Randomized geometric algorithms and pseudorandom generators. Algorithmica (1996)

    Google Scholar 

  24. Pǎtraşcu, M., Thorup, M.: On the k-Independence Required by Linear Probing and Minwise Independence. In: Abramsky, S., Gavoille, C., Kirchner, C., Meyer auf der Heide, F., Spirakis, P.G. (eds.) ICALP 2010. LNCS, vol. 6198, pp. 715–726. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  25. Pavan, A., Tirthapura, S.: Range-efficient counting of distinct elements in a massive data stream. SIAM Journal on Computing 37(2), 359–379 (2008)

    Article  MathSciNet  Google Scholar 

  26. Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., Riedl, J.: Grouplens: an open architecture for collaborative filtering of netnews. In: Computer Supported Cooperative Work (1994)

    Google Scholar 

  27. Sarwar, B., Karypis, G., Konstan, J., Reidl, J.: Item-based collaborative filtering recommendation algorithms. In: WWW (2001)

    Google Scholar 

  28. Su, X., Khoshgoftaar, T.M.: A survey of collaborative filtering techniques. Advances in Artificial Intelligence 2009, 4 (2009)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bachrach, Y., Porat, E. (2013). Sketching for Big Data Recommender Systems Using Fast Pseudo-random Fingerprints. In: Fomin, F.V., Freivalds, R., Kwiatkowska, M., Peleg, D. (eds) Automata, Languages, and Programming. ICALP 2013. Lecture Notes in Computer Science, vol 7966. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39212-2_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-39212-2_41

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-39211-5

  • Online ISBN: 978-3-642-39212-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation