Abstract
Massive amounts of stream data nowadays almost make any real-time analysis impossible. To overcome the challenge of processing this huge amount of data, previous works typically use sampling to extract representatives and conduct analysis on this sampled dataset. In this paper, we propose LOAD, a Locality-Sensitive Hashing (LSH) based \(\ell _0\)-sampling over stream data. Instead of having the same diameter for all dimensions, LOAD utilizes the dimension-specific diameters which could fit the distribution of groups better. Therefore, LOAD always generates a better representative identification result. To facilitate the real-time analysis, we further optimize LOAD by applying LSH. Since nearest items are hashed into the same bucket with high probability, hence distinguishing the representatives becomes lightning fast. Extensive experiments show that LOAD is not only more accurate than other state-of-the-art algorithms, but also faster by an order of magnitude.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Chen, J., Zhang, Q.: Distinct sampling on streaming data with near-duplicates. In: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pp. 369–382. ACM (2018)
Chen, D., Zhang, Q.: Streaming algorithms for robust distinct elements. In: Proceedings of the 2016 International Conference on Management of Data, pp. 1433–1447. ACM (2016)
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604–613. ACM (1998)
Slaney, M., He, J., Lifshits, Y.: Optimal parameters for locality-sensitive hashing. Proc. IEEE 100(9), 2604–2623 (2012)
Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical report (2009)
Mukherjee, S., Asnani, H., Lin, E., Kannan, S.: Clustergan: latent space clustering in generative adversarial networks. In: Proceedings of the AAAI Conference on Artificial Intelligence 33, 4610–4617 (2019)
Cormode, G., Firmani, D.: A unifying framework for l0-sampling algorithms. Distrib. Parallel Databases 32(3), 315–335 (2014)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Engineering 19(1), 1–16 (2006)
Frahling, G., Indyk, P., Sohler, C.: Sampling in dynamic data streams and applications. In: Symposium on Computational Geometry (2005)
Gibbons, P.B., Tirthapura., S.: Estimating simple functions on the union of data streams. In: Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 281–291. ACM (2001)
Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 633–634. Society for Industrial and Applied Mathematics (2002)
Chung, Y.-Y., Tirthapura, S.: Distinct random sampling from a distributed stream. In: 2015 IEEE International Parallel and Distributed Processing Symposium, pp. 532–541. IEEE (2015)
Ba, K.D., Indyk, P., Price, E., Woodruff, D.P.: Lower bounds for sparse recovery. In: Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1190–1197. SIAM (2010)
Jowhari, H., Sağlam, M., Tardos, G.: Tight bounds for LP samplers, finding duplicates in streams, and related problems. In: Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 49–58. ACM (2011)
Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31(2), 182–209 (1985)
Beyer, K., Haas, P.J., Reinwald, B., Sismanis, Y., Gemulla, R.: On synopses for distinct-value estimation under multiset operations. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 199–210. ACM (2007)
Ganguly, S.: Counting distinct items over update streams. Theoret. Comput. Sci. 378(3), 211–222 (2007)
Kane, D.M., Nelson, J., Woodruff, D.P.: An optimal algorithm for the distinct elements problem. In: Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 41–52. ACM (2010)
Zhang, Q.: Communication-efficient computation on distributed noisy datasets. In: Proceedings of the 27th ACM Symposium on Parallelism in Algorithms and Architectures, pp. 313–322. ACM (2015)
Acknowledgement
This research is supported by Chinese Scientific and Technical Innovation Project 2030 (No. 2018AAA0102100), National Natural Science Foundation of China (No. 61772289, U1936206). We thank the reviewers for their constructive comments. We also thank Jiecao Chen and Qin Zhang for their generous help.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Lurong, D., Wen, Y., Zhang, J., Yuan, X. (2021). LOAD: LSH-Based \(\ell _0\)-Sampling over Stream Data with Near-Duplicates. In: Hutter, F., Kersting, K., Lijffijt, J., Valera, I. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2020. Lecture Notes in Computer Science(), vol 12457. Springer, Cham. https://doi.org/10.1007/978-3-030-67658-2_27
Download citation
DOI: https://doi.org/10.1007/978-3-030-67658-2_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67657-5
Online ISBN: 978-3-030-67658-2
eBook Packages: Computer ScienceComputer Science (R0)