Studying the Variability of System Setting Effectiveness by Data Analytics and Visualization

  • Conference paper
  • First Online:
Experimental IR Meets Multilinguality, Multimodality, and Interaction (CLEF 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11696))

  • 1103 Accesses

Abstract

Search engines differ from their modules and parameters; defining the optimal system setting is challenging the more because of the complexity of a retrieval stream. The main goal of this study is to determine which are the most important system components and parameters in system setting, thus which ones should be tuned as the first priority. We carry out an extensive analysis of 20, 000 different system settings applied to three TREC ad-hoc collections. Our analysis includes zooming in and out the data using various data analysis methods such as ANOVA, CART, and data visualization. We found that the query expansion model is the most significant component that changes the system effectiveness, consistently across collections. Zooming in the queries, we show that the most significant component changes to the retrieval model when considering easy queries only. The results of our study are directly re-usable for the system designers and for system tuning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
EUR 29.95
Price includes VAT (Spain)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 60.98
Price includes VAT (Spain)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 76.95
Price includes VAT (Spain)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://trec.nist.gov/data.html.

  2. 2.

    A system setting refers to an IR system configured with a retrieval model and an optional query expansion model with its parameters.

  3. 3.

    http://trec.nist.gov/trec_eval/.

  4. 4.

    http://terrier.org/docs/current/javadoc/org/terrier/matching/models/package-summary.html.

  5. 5.

    We also calculated the Two-way ANOVA considering the main and interaction effects of query expansion (QE) and retrieval model (RMod) factors on AP; query expansion is consistently ranked first as well across the collections.

  6. 6.

    Some combinations are not meaningful and thus were not used (e.g., using 5 documents in query expansion while the “expansion model” used is none).

References

  1. Banks, D., Over, P., Zhang, N.F.: Blind men and elephants: six approaches to trec data. Inf. Retrieval 1(1–2), 7–34 (1999)

    Article  Google Scholar 

  2. Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13(Feb), 281–305 (2012)

    MathSciNet  MATH  Google Scholar 

  3. Bigot, A., Chrisment, C., Dkaki, T., Hubert, G., Mothe, J.: Fusing different information retrieval systems according to query-topics: a study based on correlation in information retrieval systems and trec topics. Inf. Retrieval 14(6), 617 (2011)

    Article  Google Scholar 

  4. Bigot, A., Déjean, S., Mothe, J.: Learning to choose the best system configuration in information retrieval: the case of repeated queries. J. Univ. Comput. Sci. 21(13), 1726–1745 (2015)

    MathSciNet  Google Scholar 

  5. Breiman, L.: Classification and Regression Trees. Routledge, Abingdon (2017)

    Book  Google Scholar 

  6. Chifu, A.G., Laporte, L., Mothe, J., Ullah, M.Z.: Query performance prediction focused on summarized letor features. In: The 41st International ACM SIGIR Conference, pp. 1177–1180. ACM (2018)

    Google Scholar 

  7. Chrisment, C., Dkaki, T., Mothe, J., Poulain, S., Tanguy, L.: Recherche d information - analyse des résultats de différents systèmes réalisant la même tâche. Rev. Sci. Technol. l’Inf. 10(1), 31–55 (2005)

    Google Scholar 

  8. Compaoré, J., Déjean, S., Gueye, A.M., Mothe, J., Randriamparany, J.: Mining information retrieval results: significant IR parameters. In: Advances in Information Mining and Management, October 2011

    Google Scholar 

  9. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391 (1990)

    Article  Google Scholar 

  10. Dinçer, B.T.: Statistical principal components analysis for retrieval experiments. J. Assoc. Inf. Sci. Technol. 58(4), 560–574 (2007)

    Article  Google Scholar 

  11. Ferro, N.: What does affect the correlation among evaluation measures? ACM Trans. Inf. Syst. 36(2), 19:1–19:40 (2017). https://doi.org/10.1145/3106371

    Article  Google Scholar 

  12. Ferro, N., Silvello, G.: A general linear mixed models approach to study system component effects. In: Proceedings of the 39th International ACM SIGIR Conference, pp. 25–34. ACM (2016)

    Google Scholar 

  13. Harman, D., Buckley, C.: Overview of the reliable information access workshop. Inf. Retrieval 12(6), 615 (2009)

    Article  Google Scholar 

  14. Macdonald, C., McCreadie, R., Santos, R., Ounis, I.: From puppy to maturity: experiences in develo** terrier. In: Proceedings of OSIR at SIGIR, pp. 60–63 (2012)

    Google Scholar 

  15. Miller Jr., R.G.: Beyond ANOVA: Basics of Applied Statistics. Chapman and Hall/CRC, London (1997)

    Book  Google Scholar 

  16. Mizzaro, S., Robertson, S.: Hits hits trec: exploring IR evaluation results with network analysis. In: Proceedings of the 30th ACM SIGIR, pp. 479–486. ACM (2007)

    Google Scholar 

  17. Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Proceedings of the 21st ACM SIGIR Conference, pp. 275–281. ACM (1998)

    Google Scholar 

  18. Robertson, S., Zaragoza, H., et al.: The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retrieval 3(4), 333–389 (2009)

    Article  Google Scholar 

  19. Roy, D., Ganguly, D., Mitra, M., Jones, G.J.: Estimating gaussian mixture models in the local neighbourhood of embedded word vectors for query performance prediction. Inf. Process. Manag. 56(3), 1026–1045 (2019)

    Article  Google Scholar 

  20. Salton, G.: The SMART Retrieval System-Experiments in Automatic Document Processing. Prentice-Hall Inc., Upper Saddle River (1971)

    Google Scholar 

  21. Schutze, H., Hull, D.A., Pedersen, J.O.: A comparison of classifiers and document representations for the routing problem (1995)

    Google Scholar 

  22. Shtok, A., Kurland, O., Carmel, D.: Query performance prediction using reference lists. ACM Trans. Inf. Syst. 34(4), 19:1–19:34 (2016)

    Article  Google Scholar 

  23. Taylor, M., Zaragoza, H., Craswell, N., Robertson, S., Burges, C.: Optimisation methods for ranking functions with multiple parameters. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pp. 585–593. CIKM (2006)

    Google Scholar 

  24. Voorhees, E.M., Samarov, D., Soboroff, I.: Using replicates in information retrieval evaluation. ACM Trans. Inf. Syst. (TOIS) 36(2), 12 (2017)

    Article  Google Scholar 

  25. Zobel, J.: How reliable are the results of large-scale information retrieval experiments? In: Proceedings of the 21st Annual International ACM SIGIR Conference, pp. 307–314. ACM (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Md. Zia Ullah .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Déjean, S., Mothe, J., Ullah, M.Z. (2019). Studying the Variability of System Setting Effectiveness by Data Analytics and Visualization. In: Crestani, F., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2019. Lecture Notes in Computer Science(), vol 11696. Springer, Cham. https://doi.org/10.1007/978-3-030-28577-7_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-28577-7_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-28576-0

  • Online ISBN: 978-3-030-28577-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation