Building an Efficient Retriever System with Limited Resources

Nguyen, Quang Nhat; Le, Huong Thanh

doi:10.1007/978-3-031-49529-8_5

Quang Nhat Nguyen¹⁴ &
Huong Thanh Le¹⁴

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 847))

Included in the following conference series:

International Conference on Advances in Information and Communication Technology

204 Accesses

Abstract

Despite significant advancements in Question-Answering (QA) systems based on Large Languge Models (LLMs), the issue of generating imprecise answers leading to less informative responses persists. To develop effective QA systems for open-domain datasets, particularly content-specific datasets, dense passage retrieval and the two-stage retriever-reader model remain a rational choice. However, when being applied in real-world systems, these approaches encounter challenges posed by the limitation of computational resources and training data. To address the scarcity of training data, we propose fine-tuning the pretrained BERT-based encoder using masked language modeling before employing a dual-encoder architecture—an established and efficient technique. Additionally, we introduce a modified loss function for dual-encoder training that reduces memory usage during training without compromising system performance. The new loss function is employed in a multi-stage training strategy, yielding enhanced retriever performance at each training stage. To further augment the system's capabilities, we train a cross-encoder to construct a robust retriever for domain-specific datasets. The effectiveness of these proposed techniques is validated by experiments with significant increases in performance compared to the baseline models, underscoring their potential to advance the state-of-the-art in open-domain question-answering systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 127.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 159.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Another Look at DPR: Reproduction of Training and Replication of Retrieval

PyGaggle: A Gaggle of Resources for Open-Domain Question Answering

Multi-stage transfer learning with BERTology-based language models for question answering system in vietnamese

Article 30 January 2023

Notes

1.
https://challenge.zalo.ai/portal/legal-text-retrieval.
2.
https://pypi.org/project/pyvi/.
3.
https://pypi.org/project/rank-bm25/.
4.
https://www.kaggle.com/.
5.
From the website: https://thuvienphapluat.vn/.

References

Chang, W.C., Yu, F.X., Chang, Y.W., Yang, Y., Kumar, S.: Pre-training tasks for embedding-based large-scale retrieval (2020)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding (2019)
Google Scholar
Fajcik, M., Docekal, M., Ondrej, K., Smrz, P.: R2-d2: a modular baseline for open-domain question answering (2021)
Google Scholar
Gao, L., Callan, J.: Condenser: a pre-training architecture for dense retrieval (2021)
Google Scholar
Guu, K., Lee, K., Tung, Z., Pasupat, P., Chang, M.W.: Realm: retrieval-augmented language model pre-training (2020)
Google Scholar
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus (2017)
Google Scholar
Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., Yih, W.: Dense passage retrieval for open-domain question answering (2020)
Google Scholar
Lee, K., Chang, M.W., Toutanova, K.: Latent retrieval for weakly supervised open domain question answering (2019)
Google Scholar
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: a robustly optimized bert pretraining approach (2019)
Google Scholar
Luan, Y., Eisenstein, J., Toutanova, K., Collins, M.: Sparse, dense, and attentional representations for text retrieval (2021)
Google Scholar
Nguyen, D.Q., Nguyen, A.T.: Phobert: pre-trained language models for Vietnamese (2020)
Google Scholar
Nogueira, R., Cho, K.: Passage re-ranking with BERT (2020)
Google Scholar
Qu, Y., Ding, Y., Liu, J., Liu, K., Ren, R., Zhao, W.X., Dong, D., Wu, H., Wang, H.: RocketQA: an optimized training approach to dense passage retrieval for open-domain question answering (2021)
Google Scholar
Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using Siamese bert-networks (2019)
Google Scholar
Robertson, S., Walker, S.: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval, pp. 232–241 (Jan. 1994). https://doi.org/10.1007/978-1-4471-2099-524
Robertson, S.E., Zaragoza, H.: The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr.Retr. 3, 333–389 (2009)
Article Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag.Manag. 24, 513–523 (1988)
Google Scholar
**ong, L., **ong, C., Li, Y., Tang, K.F., Liu, J., Bennett, P., Ahmed, J., Over-Wijk, A.: Approximate nearest neighbor negative contrastive learning for dense text retrieval (2020)
Google Scholar
Zhan, J., Mao, J., Liu, Y., Guo, J., Zhang, M., Ma, S.: Optimizing dense retrieval model training with hard negatives (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

Hanoi University of Science and Technology, Hanoi, Vietnam
Quang Nhat Nguyen & Huong Thanh Le

Authors

Quang Nhat Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Huong Thanh Le
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huong Thanh Le .

Editor information

Editors and Affiliations

Thai Nguyen University of Information and Communication Technology, Thai Nguyen, Vietnam
Phung Trung Nghia
Thai Nguyen University of Information and Communication Technology, Thai Nguyen, Vietnam
Vu Duc Thai
VNU University of Engineering and Technology, Vietnam National University, Ha Noi, Vietnam
Nguyen Thanh Thuy
Information Technology Institute, Vietnam National University, Ha Noi, Vietnam
Le Hoang Son
Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan
Van-Nam Huynh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen, Q.N., Le, H.T. (2023). Building an Efficient Retriever System with Limited Resources. In: Nghia, P.T., Thai, V.D., Thuy, N.T., Son, L.H., Huynh, VN. (eds) Advances in Information and Communication Technology. ICTA 2023. Lecture Notes in Networks and Systems, vol 847. Springer, Cham. https://doi.org/10.1007/978-3-031-49529-8_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-49529-8_5
Published: 13 December 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-49528-1
Online ISBN: 978-3-031-49529-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Building an Efficient Retriever System with Limited Resources

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Another Look at DPR: Reproduction of Training and Replication of Retrieval

PyGaggle: A Gaggle of Resources for Open-Domain Question Answering

Multi-stage transfer learning with BERTology-based language models for question answering system in vietnamese

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Building an Efficient Retriever System with Limited Resources

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Another Look at DPR: Reproduction of Training and Replication of Retrieval

PyGaggle: A Gaggle of Resources for Open-Domain Question Answering

Multi-stage transfer learning with BERTology-based language models for question answering system in vietnamese

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation