LOAD: LSH-Based \(\ell _0\)-Sampling over Stream Data with Near-Duplicates

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2020)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12457))

  • 1676 Accesses

Abstract

Massive amounts of stream data nowadays almost make any real-time analysis impossible. To overcome the challenge of processing this huge amount of data, previous works typically use sampling to extract representatives and conduct analysis on this sampled dataset. In this paper, we propose LOAD, a Locality-Sensitive Hashing (LSH) based \(\ell _0\)-sampling over stream data. Instead of having the same diameter for all dimensions, LOAD utilizes the dimension-specific diameters which could fit the distribution of groups better. Therefore, LOAD always generates a better representative identification result. To facilitate the real-time analysis, we further optimize LOAD by applying LSH. Since nearest items are hashed into the same bucket with high probability, hence distinguishing the representatives becomes lightning fast. Extensive experiments show that LOAD is not only more accurate than other state-of-the-art algorithms, but also faster by an order of magnitude.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://archive.ics.uci.edu/ml/datasets/seeds.

  2. 2.

    https://archive.ics.uci.edu/ml/datasets/Yacht+Hydrodynamics.

References

  1. Chen, J., Zhang, Q.: Distinct sampling on streaming data with near-duplicates. In: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pp. 369–382. ACM (2018)

    Google Scholar 

  2. Chen, D., Zhang, Q.: Streaming algorithms for robust distinct elements. In: Proceedings of the 2016 International Conference on Management of Data, pp. 1433–1447. ACM (2016)

    Google Scholar 

  3. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604–613. ACM (1998)

    Google Scholar 

  4. Slaney, M., He, J., Lifshits, Y.: Optimal parameters for locality-sensitive hashing. Proc. IEEE 100(9), 2604–2623 (2012)

    Article  Google Scholar 

  5. Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical report (2009)

    Google Scholar 

  6. Mukherjee, S., Asnani, H., Lin, E., Kannan, S.: Clustergan: latent space clustering in generative adversarial networks. In: Proceedings of the AAAI Conference on Artificial Intelligence 33, 4610–4617 (2019)

    Google Scholar 

  7. Cormode, G., Firmani, D.: A unifying framework for l0-sampling algorithms. Distrib. Parallel Databases 32(3), 315–335 (2014)

    Article  Google Scholar 

  8. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Engineering 19(1), 1–16 (2006)

    Article  Google Scholar 

  9. Frahling, G., Indyk, P., Sohler, C.: Sampling in dynamic data streams and applications. In: Symposium on Computational Geometry (2005)

    Google Scholar 

  10. Gibbons, P.B., Tirthapura., S.: Estimating simple functions on the union of data streams. In: Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 281–291. ACM (2001)

    Google Scholar 

  11. Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 633–634. Society for Industrial and Applied Mathematics (2002)

    Google Scholar 

  12. Chung, Y.-Y., Tirthapura, S.: Distinct random sampling from a distributed stream. In: 2015 IEEE International Parallel and Distributed Processing Symposium, pp. 532–541. IEEE (2015)

    Google Scholar 

  13. Ba, K.D., Indyk, P., Price, E., Woodruff, D.P.: Lower bounds for sparse recovery. In: Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1190–1197. SIAM (2010)

    Google Scholar 

  14. Jowhari, H., Sağlam, M., Tardos, G.: Tight bounds for LP samplers, finding duplicates in streams, and related problems. In: Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 49–58. ACM (2011)

    Google Scholar 

  15. Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31(2), 182–209 (1985)

    Article  MathSciNet  Google Scholar 

  16. Beyer, K., Haas, P.J., Reinwald, B., Sismanis, Y., Gemulla, R.: On synopses for distinct-value estimation under multiset operations. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 199–210. ACM (2007)

    Google Scholar 

  17. Ganguly, S.: Counting distinct items over update streams. Theoret. Comput. Sci. 378(3), 211–222 (2007)

    Article  MathSciNet  Google Scholar 

  18. Kane, D.M., Nelson, J., Woodruff, D.P.: An optimal algorithm for the distinct elements problem. In: Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 41–52. ACM (2010)

    Google Scholar 

  19. Zhang, Q.: Communication-efficient computation on distributed noisy datasets. In: Proceedings of the 27th ACM Symposium on Parallelism in Algorithms and Architectures, pp. 313–322. ACM (2015)

    Google Scholar 

Download references

Acknowledgement

This research is supported by Chinese Scientific and Technical Innovation Project 2030 (No. 2018AAA0102100), National Natural Science Foundation of China (No. 61772289, U1936206). We thank the reviewers for their constructive comments. We also thank Jiecao Chen and Qin Zhang for their generous help.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanlong Wen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lurong, D., Wen, Y., Zhang, J., Yuan, X. (2021). LOAD: LSH-Based \(\ell _0\)-Sampling over Stream Data with Near-Duplicates. In: Hutter, F., Kersting, K., Lijffijt, J., Valera, I. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2020. Lecture Notes in Computer Science(), vol 12457. Springer, Cham. https://doi.org/10.1007/978-3-030-67658-2_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-67658-2_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-67657-5

  • Online ISBN: 978-3-030-67658-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation