GPU Based Hash Segmentation Index for Fast T-overlap Query

  • Conference paper
  • First Online:
Data Science (ICPCSEE 2017)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 727))

  • 2570 Accesses

Abstract

T-overlap query is the basis of set similarity query and has been applied in many important fields. Most existing approaches employ a pruning-and-verification framework, thus in low efficiency. Modern GPU has much higher parallelism as well as memory bandwidth than CPU and can be used to accelerate T-overlap query. In this paper, we use hash segmentation to divide inverted lists into segments, then design an efficient inverted index called GHSII on GPU using hash segmentation. Based on GHSII, a new segmentation parallel T-overlap algorithm, GSPS, is proposed. GSPS uses segment at a time to scan segments and uses shared memory to decrease the number of accesses to device memory. Furthermore, an optimized algorithm called GSPS-TLLO using a heuristic query order is proposed to solve the problem of load imbalance. Experiments are carried out on two real datasets and the results show that GSPS-TLLO outperforms the state-of-the-art GPU parallel T-overlap algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
EUR 29.95
Price includes VAT (Germany)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 85.59
Price includes VAT (Germany)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 106.99
Price includes VAT (Germany)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    When there is no ambiguity, we also use j-SEGs to denote all the j-th segments of query Q related lists.

References

  1. Kawamoto, H., Kitamura, T.: Similarity of speaker individualities of sentence in ATR speech database set C. In: Proceedings of IEICE Technical Report Speech, pp. 33–34 (2013)

    Google Scholar 

  2. Hadjieleftheriou, M., Chandel, A., Koudas N., et al.: Fast indexes and algorithms for set similarity selection queries. In: Proceeding of IEEE 24th International Conference on Data Engineering (ICDE 2008), pp. 267–276 (2008)

    Google Scholar 

  3. He, B., Ke, F., Fang, R., et al.: Relational joins on graphics processors. In: Proceeding of ACM SIGMOD International Conference on Management of Data, pp. 511–524 (2007)

    Google Scholar 

  4. Shalom, S.A.A., Dash, M., Tue, M.: Efficient K-means clustering using accelerated graphics processors. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 166–175. Springer, Heidelberg (2008). doi:10.1007/978-3-540-85836-2_16

    Chapter  Google Scholar 

  5. Punronen, S., Terziyan, V.: A similarity evaluation technique for data mining with an ensemble of classifiers. In: Proceeding of IEEE Computer Society, International Workshop on Database & Expert Systems Applications, pp. 1155–1160 (2000)

    Google Scholar 

  6. Kim, J., Vasardani, M., Winter, S.: Similarity matching for integrating spatial information extracted from place descriptions. Int. J. Geogr. Inf. Sci. 1–25 (2016)

    Google Scholar 

  7. Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: Proceeding of VLDB, pp. 918–929 (2006)

    Google Scholar 

  8. Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceeding of ICDE (2006)

    Google Scholar 

  9. Lin, X., Wang, W.: Set and string similarity queries a survey. Chin. J. Comput. 34(10), 1853–1862 (2012)

    Article  Google Scholar 

  10. Deng, D., Li, G., Feng, J., et al.: A unified framework for approximate dictionary-based entity extraction. VLDB J. 24(1), 143–167 (2014)

    Article  Google Scholar 

  11. Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: Proceeding of SIGMOD 2004, pp. 743–754 (2004)

    Google Scholar 

  12. Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: Proceeding of ICDE 2008, pp. 257–266 (2008)

    Google Scholar 

  13. Jia, L., **, J., Li, M., et al.: ETI: an efficient index for set similarity queries. Front. Comput. Sci. 6(6), 700–712 (2012)

    MathSciNet  Google Scholar 

  14. Li, M., Jia, L., You, J., et al.: Fast T-overlap query algorithms using graphics processor units and its applications in web data query. World Wide Web-internet Web Inf. Syst. 18(2), 1–17 (2013)

    Google Scholar 

  15. Tatikonda, S., Junqueira, F., Cambazoglu, B.B., et al.: On efficient posting list intersection with multicore processors. In: Proceeding of ACM SIGIR 2009, pp. 738–739 (2009)

    Google Scholar 

  16. Ding, B., Nig, A.: Fast set intersection in memory. In: Proceeding of the VLDB Endowment 2011, pp. 255–266 (2011)

    Google Scholar 

  17. Ao, N., Zhang, F., Wu, D., et al.: Efficient parallel lists intersection and index compression algorithms using graphics processing units. In: Proceeding of the VLDB Endowment, pp. 470–481 (2011)

    Google Scholar 

  18. Ding, S., He, J., Yan, H.: Using graphics processors for high-performance IR query processing. In: Proceeding of WWW, pp. 1213–1214 (2008)

    Google Scholar 

  19. Programming of shared memory GPUs shared memory systems. http://site.uottawa.ca/~mbolic/ceg4131/CUDA_Report.pdf. Accessed Jan 2016

  20. Wu, D., Zhang, F., Ao, N., et al.: Efficient lists intersection by CPU-GPU cooperative computing. In: Proceeding of 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), pp. 1–8. IEEE (2010)

    Google Scholar 

  21. Pagh, R., Rodler, F.: Cuckoo hashing. J. Algorithms 51(2), 122–144 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  22. Bay, S., Kibler, D., Pazzani, M., et al.: The UCI KDD archive of large data sets fordata mining research and experimentation. ACM SIGKDD Explor. Newsl. 2(2), 14–18 (2002)

    Google Scholar 

  23. Bayardo, R., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceeding of International Conference on World WideWeb 2007, pp. 71–81 (2007)

    Google Scholar 

  24. Broder, A.Z., Carmel, D., Herscovici, M., et al.: Efficient query evaluation using a two-level retrieval process. In: Proceeding of Twelfth International Conference on Information and Knowledge Management, pp. 426–434. ACM (2003)

    Google Scholar 

Download references

Acknowledgment

The research was supported by Grants from the National Natural Science Foundation of China (No. 61562054, 51467007, 61462050), and the Personnel Training Project of Yunnan Province (No. KKSY201603016).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mengjuan Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper

Jia, L., Zhang, Y., Li, M., Ding, J., You, J. (2017). GPU Based Hash Segmentation Index for Fast T-overlap Query. In: Zou, B., Li, M., Wang, H., Song, X., **e, W., Lu, Z. (eds) Data Science. ICPCSEE 2017. Communications in Computer and Information Science, vol 727. Springer, Singapore. https://doi.org/10.1007/978-981-10-6385-5_4

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-6385-5_4

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-6384-8

  • Online ISBN: 978-981-10-6385-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation