Analysis and Design of Document Similarity Using BiLSTM and BERT

  • Conference paper
  • First Online:
Advanced Communication and Intelligent Systems (ICACIS 2023)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1921))

  • 184 Accesses

Abstract

In this paper, we propose a deep learning-based approach to measure document similarity using bidirectional encoder representations (BERTs) from transformers and its implementation as an application programming interface (API). BERT has recently shown significant improvements in natural language processing and is widely used in various applications such as question answering and text classification. We trained and refined a BERT model on a large corpus of documents to measure document similarity. The proposed API receives two text arguments and returns the degree of similarity between them. On several benchmark datasets, we demonstrated that our approach outperforms conventional state-of-the-art similarity measures. Our experimental results show that the proposed method of measuring document similarity using BERT and API is effective and efficient, and that API implementations can be used in a variety of real-world scenarios is visible.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. BERT Model Reference. https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/nlp/ipynb/semantic_similarity_with_bert.ipynb#scrollTo=Y43ctXNkK56zI

  2. Hugging Face Library. https://huggingface.co/keras-io/bert-semantic-similarity

  3. Data for Training and Testing. https://nlp.stanford.edu/projects/snli/

  4. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation (2014)

    Google Scholar 

  5. Mutinda, F.W., Yada, S., Wakamiya, S., Aramiki, E.: Semantic Textual Similarity in Japanese Clinical Domain Texts Using BERT

    Google Scholar 

  6. Zheng, L.: Performance Analysis of RESTful APIs: A Systematic Map** Study

    Google Scholar 

  7. Zhang, J., Jagadish, H.V.R.: API Design Patterns and Best Practices: A Survey

    Google Scholar 

  8. Ostendorff, M., Ruas, T., Blume, T., Rehm, B.G.G.: Aspect-Based Document Similarity for Research Papers

    Google Scholar 

  9. **a, C., He, T., Li, W., Qin, Z., Zou, Z.: Similarity Analysis of Law Documents Based on Word2vec

    Google Scholar 

  10. Ramadhanti, N.R., Mariyah, S.: Document Similarity Detection Using Indonesian Language Word2vec Model

    Google Scholar 

  11. Vitale, T., Tasso, C.: The State of the Art in API Usability Evaluation

    Google Scholar 

  12. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Google Scholar 

  13. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013)

    Google Scholar 

  14. Zhang, H., Lan, Y., Yu, N.: Multi-task learning for sentiment analysis using convolutional neural networks and Global Vectors (2019)

    Google Scholar 

  15. Zhang, Y., Yang, Q.: Regularizing matrix factorization with user and item embeddings for recommendation (2018)

    Google Scholar 

  16. Mueller, T.: Siamese Recurrent Architectures for Learning Sentence Similarity (2016)

    Google Scholar 

  17. Cer, D., et al.: Universal sentence encoder for English. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 169–174. Association for Computational Linguistics, Brussels, Belgium (2018)

    Google Scholar 

  18. Baid, D., Goel, P.M., Bhardwaj, P., Singh, A., Tyagi, V.: Comparative analysis of serverless solutions from public cloud providers. In: Bhattacharya, M., Kharb, L., Chahal, D. (eds.) ICICCT 2021. CCIS, vol. 1417, pp. 63–75. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-88378-2_6

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chintan Gaur .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gaur, C., Kumar, A., Das, S. (2023). Analysis and Design of Document Similarity Using BiLSTM and BERT. In: Shaw, R.N., Paprzycki, M., Ghosh, A. (eds) Advanced Communication and Intelligent Systems. ICACIS 2023. Communications in Computer and Information Science, vol 1921. Springer, Cham. https://doi.org/10.1007/978-3-031-45124-9_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-45124-9_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-45123-2

  • Online ISBN: 978-3-031-45124-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation