Abstract
In this paper, we propose a deep learning-based approach to measure document similarity using bidirectional encoder representations (BERTs) from transformers and its implementation as an application programming interface (API). BERT has recently shown significant improvements in natural language processing and is widely used in various applications such as question answering and text classification. We trained and refined a BERT model on a large corpus of documents to measure document similarity. The proposed API receives two text arguments and returns the degree of similarity between them. On several benchmark datasets, we demonstrated that our approach outperforms conventional state-of-the-art similarity measures. Our experimental results show that the proposed method of measuring document similarity using BERT and API is effective and efficient, and that API implementations can be used in a variety of real-world scenarios is visible.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
BERT Model Reference. https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/nlp/ipynb/semantic_similarity_with_bert.ipynb#scrollTo=Y43ctXNkK56zI
Hugging Face Library. https://huggingface.co/keras-io/bert-semantic-similarity
Data for Training and Testing. https://nlp.stanford.edu/projects/snli/
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation (2014)
Mutinda, F.W., Yada, S., Wakamiya, S., Aramiki, E.: Semantic Textual Similarity in Japanese Clinical Domain Texts Using BERT
Zheng, L.: Performance Analysis of RESTful APIs: A Systematic Map** Study
Zhang, J., Jagadish, H.V.R.: API Design Patterns and Best Practices: A Survey
Ostendorff, M., Ruas, T., Blume, T., Rehm, B.G.G.: Aspect-Based Document Similarity for Research Papers
**a, C., He, T., Li, W., Qin, Z., Zou, Z.: Similarity Analysis of Law Documents Based on Word2vec
Ramadhanti, N.R., Mariyah, S.: Document Similarity Detection Using Indonesian Language Word2vec Model
Vitale, T., Tasso, C.: The State of the Art in API Usability Evaluation
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013)
Zhang, H., Lan, Y., Yu, N.: Multi-task learning for sentiment analysis using convolutional neural networks and Global Vectors (2019)
Zhang, Y., Yang, Q.: Regularizing matrix factorization with user and item embeddings for recommendation (2018)
Mueller, T.: Siamese Recurrent Architectures for Learning Sentence Similarity (2016)
Cer, D., et al.: Universal sentence encoder for English. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 169–174. Association for Computational Linguistics, Brussels, Belgium (2018)
Baid, D., Goel, P.M., Bhardwaj, P., Singh, A., Tyagi, V.: Comparative analysis of serverless solutions from public cloud providers. In: Bhattacharya, M., Kharb, L., Chahal, D. (eds.) ICICCT 2021. CCIS, vol. 1417, pp. 63–75. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-88378-2_6
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Gaur, C., Kumar, A., Das, S. (2023). Analysis and Design of Document Similarity Using BiLSTM and BERT. In: Shaw, R.N., Paprzycki, M., Ghosh, A. (eds) Advanced Communication and Intelligent Systems. ICACIS 2023. Communications in Computer and Information Science, vol 1921. Springer, Cham. https://doi.org/10.1007/978-3-031-45124-9_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-45124-9_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45123-2
Online ISBN: 978-3-031-45124-9
eBook Packages: Computer ScienceComputer Science (R0)