LOAD: LSH-Based $$\ell _0$$ -Sampling over Stream Data with Near-Duplicates

Lurong, Dingzhu; Wen, Yanlong; Zhang, Jiangwei; Yuan, **aojie

doi:10.1007/978-3-030-67658-2_27

Dingzhu Lurong¹²,
Yanlong Wen¹²,
Jiangwei Zhang¹³ &
…
**aojie Yuan¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12457))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

1676 Accesses

Abstract

Massive amounts of stream data nowadays almost make any real-time analysis impossible. To overcome the challenge of processing this huge amount of data, previous works typically use sampling to extract representatives and conduct analysis on this sampled dataset. In this paper, we propose LOAD, a Locality-Sensitive Hashing (LSH) based $\ell _0$-sampling over stream data. Instead of having the same diameter for all dimensions, LOAD utilizes the dimension-specific diameters which could fit the distribution of groups better. Therefore, LOAD always generates a better representative identification result. To facilitate the real-time analysis, we further optimize LOAD by applying LSH. Since nearest items are hashed into the same bucket with high probability, hence distinguishing the representatives becomes lightning fast. Extensive experiments show that LOAD is not only more accurate than other state-of-the-art algorithms, but also faster by an order of magnitude.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Fast and accurate stream processing by filtering the cold

Article 13 August 2019

Efficient approximation and privacy preservation algorithms for real time online evolving data streams

Article 20 January 2024

PrivSketch: A Private Sketch-Based Frequency Estimation Protocol for Data Streams

Notes

References

Chen, J., Zhang, Q.: Distinct sampling on streaming data with near-duplicates. In: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pp. 369–382. ACM (2018)
Google Scholar
Chen, D., Zhang, Q.: Streaming algorithms for robust distinct elements. In: Proceedings of the 2016 International Conference on Management of Data, pp. 1433–1447. ACM (2016)
Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604–613. ACM (1998)
Google Scholar
Slaney, M., He, J., Lifshits, Y.: Optimal parameters for locality-sensitive hashing. Proc. IEEE 100(9), 2604–2623 (2012)
Article Google Scholar
Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical report (2009)
Google Scholar
Mukherjee, S., Asnani, H., Lin, E., Kannan, S.: Clustergan: latent space clustering in generative adversarial networks. In: Proceedings of the AAAI Conference on Artificial Intelligence 33, 4610–4617 (2019)
Google Scholar
Cormode, G., Firmani, D.: A unifying framework for l0-sampling algorithms. Distrib. Parallel Databases 32(3), 315–335 (2014)
Article Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Engineering 19(1), 1–16 (2006)
Article Google Scholar
Frahling, G., Indyk, P., Sohler, C.: Sampling in dynamic data streams and applications. In: Symposium on Computational Geometry (2005)
Google Scholar
Gibbons, P.B., Tirthapura., S.: Estimating simple functions on the union of data streams. In: Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 281–291. ACM (2001)
Google Scholar
Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 633–634. Society for Industrial and Applied Mathematics (2002)
Google Scholar
Chung, Y.-Y., Tirthapura, S.: Distinct random sampling from a distributed stream. In: 2015 IEEE International Parallel and Distributed Processing Symposium, pp. 532–541. IEEE (2015)
Google Scholar
Ba, K.D., Indyk, P., Price, E., Woodruff, D.P.: Lower bounds for sparse recovery. In: Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1190–1197. SIAM (2010)
Google Scholar
Jowhari, H., Sağlam, M., Tardos, G.: Tight bounds for LP samplers, finding duplicates in streams, and related problems. In: Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 49–58. ACM (2011)
Google Scholar
Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31(2), 182–209 (1985)
Article MathSciNet Google Scholar
Beyer, K., Haas, P.J., Reinwald, B., Sismanis, Y., Gemulla, R.: On synopses for distinct-value estimation under multiset operations. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 199–210. ACM (2007)
Google Scholar
Ganguly, S.: Counting distinct items over update streams. Theoret. Comput. Sci. 378(3), 211–222 (2007)
Article MathSciNet Google Scholar
Kane, D.M., Nelson, J., Woodruff, D.P.: An optimal algorithm for the distinct elements problem. In: Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 41–52. ACM (2010)
Google Scholar
Zhang, Q.: Communication-efficient computation on distributed noisy datasets. In: Proceedings of the 27th ACM Symposium on Parallelism in Algorithms and Architectures, pp. 313–322. ACM (2015)
Google Scholar

Download references

Acknowledgement

This research is supported by Chinese Scientific and Technical Innovation Project 2030 (No. 2018AAA0102100), National Natural Science Foundation of China (No. 61772289, U1936206). We thank the reviewers for their constructive comments. We also thank Jiecao Chen and Qin Zhang for their generous help.

Author information

Authors and Affiliations

TKLNDST, College of Computer Science, Nankai University, Tian**, China
Dingzhu Lurong, Yanlong Wen & **aojie Yuan
National University of Singapore, Singapore, Singapore
Jiangwei Zhang

Authors

Dingzhu Lurong
View author publications
You can also search for this author in PubMed Google Scholar
Yanlong Wen
View author publications
You can also search for this author in PubMed Google Scholar
Jiangwei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
**aojie Yuan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanlong Wen .

Editor information

Editors and Affiliations

Albert-Ludwigs-Universität, Freiburg, Germany
Frank Hutter
TU Darmstadt, Darmstadt, Germany
Kristian Kersting
Ghent University, Ghent, Belgium
Jefrey Lijffijt
Saarland University, Saarbrücken, Germany
Isabel Valera

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lurong, D., Wen, Y., Zhang, J., Yuan, X. (2021). LOAD: LSH-Based $\ell _0$-Sampling over Stream Data with Near-Duplicates. In: Hutter, F., Kersting, K., Lijffijt, J., Valera, I. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2020. Lecture Notes in Computer Science(), vol 12457. Springer, Cham. https://doi.org/10.1007/978-3-030-67658-2_27

Download citation

DOI: https://doi.org/10.1007/978-3-030-67658-2_27
Published: 25 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67657-5
Online ISBN: 978-3-030-67658-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

LOAD: LSH-Based \(\ell _0\)-Sampling over Stream Data with Near-Duplicates

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Fast and accurate stream processing by filtering the cold

Efficient approximation and privacy preservation algorithms for real time online evolving data streams

PrivSketch: A Private Sketch-Based Frequency Estimation Protocol for Data Streams

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

LOAD: LSH-Based \(\ell _0\)-Sampling over Stream Data with Near-Duplicates

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Fast and accurate stream processing by filtering the cold

Efficient approximation and privacy preservation algorithms for real time online evolving data streams

PrivSketch: A Private Sketch-Based Frequency Estimation Protocol for Data Streams

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation