Building an Efficient Retriever System with Limited Resources

  • Conference paper
  • First Online:
Advances in Information and Communication Technology (ICTA 2023)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 847))

  • 204 Accesses

Abstract

Despite significant advancements in Question-Answering (QA) systems based on Large Languge Models (LLMs), the issue of generating imprecise answers leading to less informative responses persists. To develop effective QA systems for open-domain datasets, particularly content-specific datasets, dense passage retrieval and the two-stage retriever-reader model remain a rational choice. However, when being applied in real-world systems, these approaches encounter challenges posed by the limitation of computational resources and training data. To address the scarcity of training data, we propose fine-tuning the pretrained BERT-based encoder using masked language modeling before employing a dual-encoder architecture—an established and efficient technique. Additionally, we introduce a modified loss function for dual-encoder training that reduces memory usage during training without compromising system performance. The new loss function is employed in a multi-stage training strategy, yielding enhanced retriever performance at each training stage. To further augment the system's capabilities, we train a cross-encoder to construct a robust retriever for domain-specific datasets. The effectiveness of these proposed techniques is validated by experiments with significant increases in performance compared to the baseline models, underscoring their potential to advance the state-of-the-art in open-domain question-answering systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 127.50
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 159.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://challenge.zalo.ai/portal/legal-text-retrieval.

  2. 2.

    https://pypi.org/project/pyvi/.

  3. 3.

    https://pypi.org/project/rank-bm25/.

  4. 4.

    https://www.kaggle.com/.

  5. 5.

    From the website: https://thuvienphapluat.vn/.

References

  1. Chang, W.C., Yu, F.X., Chang, Y.W., Yang, Y., Kumar, S.: Pre-training tasks for embedding-based large-scale retrieval (2020)

    Google Scholar 

  2. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding (2019)

    Google Scholar 

  3. Fajcik, M., Docekal, M., Ondrej, K., Smrz, P.: R2-d2: a modular baseline for open-domain question answering (2021)

    Google Scholar 

  4. Gao, L., Callan, J.: Condenser: a pre-training architecture for dense retrieval (2021)

    Google Scholar 

  5. Guu, K., Lee, K., Tung, Z., Pasupat, P., Chang, M.W.: Realm: retrieval-augmented language model pre-training (2020)

    Google Scholar 

  6. Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus (2017)

    Google Scholar 

  7. Karpukhin, V., OÄŸuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., Yih, W.: Dense passage retrieval for open-domain question answering (2020)

    Google Scholar 

  8. Lee, K., Chang, M.W., Toutanova, K.: Latent retrieval for weakly supervised open domain question answering (2019)

    Google Scholar 

  9. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: a robustly optimized bert pretraining approach (2019)

    Google Scholar 

  10. Luan, Y., Eisenstein, J., Toutanova, K., Collins, M.: Sparse, dense, and attentional representations for text retrieval (2021)

    Google Scholar 

  11. Nguyen, D.Q., Nguyen, A.T.: Phobert: pre-trained language models for Vietnamese (2020)

    Google Scholar 

  12. Nogueira, R., Cho, K.: Passage re-ranking with BERT (2020)

    Google Scholar 

  13. Qu, Y., Ding, Y., Liu, J., Liu, K., Ren, R., Zhao, W.X., Dong, D., Wu, H., Wang, H.: RocketQA: an optimized training approach to dense passage retrieval for open-domain question answering (2021)

    Google Scholar 

  14. Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using Siamese bert-networks (2019)

    Google Scholar 

  15. Robertson, S., Walker, S.: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval, pp. 232–241 (Jan. 1994). https://doi.org/10.1007/978-1-4471-2099-524

  16. Robertson, S.E., Zaragoza, H.: The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr.Retr. 3, 333–389 (2009)

    Article  Google Scholar 

  17. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag.Manag. 24, 513–523 (1988)

    Google Scholar 

  18. **ong, L., **ong, C., Li, Y., Tang, K.F., Liu, J., Bennett, P., Ahmed, J., Over-Wijk, A.: Approximate nearest neighbor negative contrastive learning for dense text retrieval (2020)

    Google Scholar 

  19. Zhan, J., Mao, J., Liu, Y., Guo, J., Zhang, M., Ma, S.: Optimizing dense retrieval model training with hard negatives (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huong Thanh Le .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nguyen, Q.N., Le, H.T. (2023). Building an Efficient Retriever System with Limited Resources. In: Nghia, P.T., Thai, V.D., Thuy, N.T., Son, L.H., Huynh, VN. (eds) Advances in Information and Communication Technology. ICTA 2023. Lecture Notes in Networks and Systems, vol 847. Springer, Cham. https://doi.org/10.1007/978-3-031-49529-8_5

Download citation

Publish with us

Policies and ethics

Navigation