Sketching for Big Data Recommender Systems Using Fast Pseudo-random Fingerprints

Bachrach, Yoram; Porat, Ely

doi:10.1007/978-3-642-39212-2_41

Yoram Bachrach²⁰ &
Ely Porat²¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7966))

Included in the following conference series:

International Colloquium on Automata, Languages, and Programming

1832 Accesses
10 Citations

Abstract

A key building block for collaborative filtering recommender systems is finding users with similar consumption patterns. Given access to the full data regarding the items consumed by each user, one can directly compute the similarity between any two users. However, for massive recommender systems such a naive approach requires a high running time and may be intractable in terms of the space required to store the full data. One way to overcome this is using sketching, a technique that represents massive datasets concisely, while still allowing calculating properties of these datasets. Sketching methods maintain very short fingerprints of the item sets of users, which allow approximately computing the similarity between sets of different users.

The state of the art sketch [22] has a very low space complexity, and a recent technique [14] shows how to exponentially speed up the computation time involved in building the fingerprints. Unfortunately, these methods are incompatible, forcing a choice between low running time or a small sketch size. We propose an alternative sketching approach, which achieves both a low space complexity similar to that of [22] and a low time complexity similar to [14]. We empirically evaluate our algorithm using the Netflix dataset. We analyze the running time and the sketch size of our approach and compare them to alternatives. Further, we show that in practice the accuracy achieved by our approach is even better than the accuracy guaranteed by the theoretical bounds, so it suffices to use even shorter fingerprints to obtain high quality results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: EUR 29.95; Price includes VAT (France)

eBook: EUR 42.79; Price includes VAT (France)

Softcover Book: EUR 52.74; Price includes VAT (France)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Speeding Up Continuous kNN Join by Binary Sketches

Discussion on Fast and Accurate Sketches for Skewed Data Streams: A Case Study

FID-sketch: an accurate sketch to store frequencies in data streams

Article 11 April 2018

References

Aggarwal, C.C.: Data streams: models and algorithms. Springer-Verlag New York Inc. (2007)
Google Scholar
Alon, N., Matias, Y., Szegedy, M.: The Space Complexity of Approximating the Frequency Moments. J. Computer and System Sciences 58(1), 137–147 (1999)
Article MathSciNet MATH Google Scholar
Bachrach, Y., Herbrich, R.: Fingerprinting Ratings for Collaborative Filtering — Theoretical and Empirical Analysis. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 25–36. Springer, Heidelberg (2010)
Chapter Google Scholar
Bachrach, Y., Herbrich, R., Porat, E.: Sketching algorithms for approximating rank correlations in collaborative filtering systems. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 344–352. Springer, Heidelberg (2009)
Chapter Google Scholar
Bachrach, Y., Porat, E., Rosenschein, J.S.: Sketching techniques for collaborative filtering. In: IJCAI, Pasadena, California (July 2009)
Google Scholar
Bennett, J., Lanning, S.: The netflix prize. In: KDD Cup and Workshop (2007)
Google Scholar
Broder, A.Z.: On the resemblance and containment of documents. Sequences (1998)
Google Scholar
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. Journal of Computer and System Sciences 60(3), 630–659 (2000)
Article MathSciNet MATH Google Scholar
Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55(1), 58–75 (2005)
Article MathSciNet MATH Google Scholar
Cormode, G., Muthukrishnan, S., Rozenbaum, I.: Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In: VLDB (2005)
Google Scholar
Das, A.S., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: WWW. ACM (2007)
Google Scholar
Dasgupta, A., Kumar, R., Sarlos, T.: Fast locality-sensitive hashing. In: SIGKDD (2011)
Google Scholar
Datar, M., Muthukrishnan, S.: Estimating rarity and similarity over data stream windows. In: Möhring, R., Raman, R. (eds.) ESA 2002. LNCS, vol. 2461, pp. 323–335. Springer, Heidelberg (2002)
Chapter Google Scholar
Feigenblat, G., Shiftan, A., Porat, E.: Exponential time improvement for min-wise based algorithms. In: SODA (2011)
Google Scholar
Hoeffding, W.: Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58(301), 13–30 (1963)
Article MathSciNet MATH Google Scholar
Indyk, P.: A Small Approximately Min-Wise Independent Family of Hash Functions. Journal of Algorithms 38(1), 84–90 (2001)
Article MathSciNet MATH Google Scholar
Indyk, P.: Stable distributions, pseudorandom generators, embeddings, and data stream computation. Journal of the ACM (JACM) 53(3), 323 (2006)
Article MathSciNet Google Scholar
Kane, D.M., Nelson, J., Porat, E., Woodruff, D.P.: Fast moment estimation in data streams in optimal space. In: STOC (2011)
Google Scholar
Kane, D.M., Nelson, J., Woodruff, D.P.: An optimal algorithm for the distinct elements problem. In: PODS, pp. 41–52. ACM (2010)
Google Scholar
Karp, R.M., Shenker, S., Papadimitriou, C.H.: A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems (TODS) 28(1), 51–55 (2003)
Article Google Scholar
Kirsch, A., Mitzenmacher, M.: Less hashing, same performance: a better Bloom filter. In: Azar, Y., Erlebach, T. (eds.) ESA 2006. LNCS, vol. 4168, pp. 456–467. Springer, Heidelberg (2006)
Chapter Google Scholar
Li, P., Koenig, C.: b-Bit minwise hashing. In: WWW (2010)
Google Scholar
Mulmuley, K.: Randomized geometric algorithms and pseudorandom generators. Algorithmica (1996)
Google Scholar
Pǎtraşcu, M., Thorup, M.: On the k-Independence Required by Linear Probing and Minwise Independence. In: Abramsky, S., Gavoille, C., Kirchner, C., Meyer auf der Heide, F., Spirakis, P.G. (eds.) ICALP 2010. LNCS, vol. 6198, pp. 715–726. Springer, Heidelberg (2010)
Chapter Google Scholar
Pavan, A., Tirthapura, S.: Range-efficient counting of distinct elements in a massive data stream. SIAM Journal on Computing 37(2), 359–379 (2008)
Article MathSciNet Google Scholar
Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., Riedl, J.: Grouplens: an open architecture for collaborative filtering of netnews. In: Computer Supported Cooperative Work (1994)
Google Scholar
Sarwar, B., Karypis, G., Konstan, J., Reidl, J.: Item-based collaborative filtering recommendation algorithms. In: WWW (2001)
Google Scholar
Su, X., Khoshgoftaar, T.M.: A survey of collaborative filtering techniques. Advances in Artificial Intelligence 2009, 4 (2009)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Microsoft Research, Cambridge, UK
Yoram Bachrach
Bar-Ilan University, Ramat-Gan, Israel
Ely Porat

Authors

Yoram Bachrach
View author publications
You can also search for this author in PubMed Google Scholar
Ely Porat
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Informatics, University of Bergen, Postboks 7803, 5020, Bergen, Norway
Fedor V. Fomin
Faculty of Computing, University of Latvia, Raina bulv. 19, 1586, Riga, Latvia
Rūsiņš Freivalds
Department of Computer Science, Wolfson Building, Parks Road, University of Oxford, OX1 3QD, Oxford, UK
Marta Kwiatkowska
Faculty of Mathematics and Computer Science, Weizmann Institute of Science, POB 26, 76100, Rehovot, Israel
David Peleg

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bachrach, Y., Porat, E. (2013). Sketching for Big Data Recommender Systems Using Fast Pseudo-random Fingerprints. In: Fomin, F.V., Freivalds, R., Kwiatkowska, M., Peleg, D. (eds) Automata, Languages, and Programming. ICALP 2013. Lecture Notes in Computer Science, vol 7966. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39212-2_41

Download citation

DOI: https://doi.org/10.1007/978-3-642-39212-2_41
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39211-5
Online ISBN: 978-3-642-39212-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Sketching for Big Data Recommender Systems Using Fast Pseudo-random Fingerprints

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Speeding Up Continuous kNN Join by Binary Sketches

Discussion on Fast and Accurate Sketches for Skewed Data Streams: A Case Study

FID-sketch: an accurate sketch to store frequencies in data streams

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Sketching for Big Data Recommender Systems Using Fast Pseudo-random Fingerprints

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Speeding Up Continuous kNN Join by Binary Sketches

Discussion on Fast and Accurate Sketches for Skewed Data Streams: A Case Study

FID-sketch: an accurate sketch to store frequencies in data streams

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation