Introduction

Riezler, Stefan; Hagmann, Michael

doi:10.1007/978-3-031-57065-0_1

Stefan Riezler⁴ &
Michael Hagmann⁵

Part of the book series: Synthesis Lectures on Human Language Technologies ((SLHLT))

16 Accesses

Abstract

Empirical methods are means to answering methodological questions of empirical sciences by statistical techniques. The methodological questions addressed in this book include the problems of validity, reliability, and significance. In the case of machine learning, these correspond to the questions of whether a model predicts what it purports to predict, whether a model’s performance is consistent across replications, and whether a performance difference between two models is due to chance, respectively. The goal of this book is to answer these questions by concrete statistical tests that can be applied to assess validity, reliability, and significance of data annotation and machine learning prediction in the fields of NLP and data science. Our focus is on model-based empirical methods where data annotations and model predictions are treated as training data for interpretable probabilistic models from the well-understood families of generalized additive models (GAMs) and linear mixed effects models (LMEMs). Based on the interpretable parameters of the trained GAMs or LMEMs, the book presents model-based statistical tests such as a validity test that allows detecting circular features that circumvent learning. Furthermore, the book discusses a reliability coefficient using variance decomposition based on random effect parameters of LMEMs. Last, a significance test based on the likelihood ratio of nested LMEMs trained on the performance scores of two machine learning models is shown to naturally allow the inclusion of variations in meta-parameter settings into hypothesis testing, and further facilitates a refined system comparison conditional on properties of input data. This book can be used as an introduction to empirical methods for machine learning in general, with a special focus on applications in NLP and data science. The book is self-contained, with an appendix on the mathematical background on GAMs and LMEMs, and with an accompanying webpage including R code to replicate experiments presented in the book.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 34.99; Price excludes VAT (USA)

Hardcover Book: USD 44.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Clearly, this paradigm is pervasive in machine learning and artificial intelligence in general, for example, in the area of image processing that uses similar methods and exhibits similar problems as the area of natural language processing. We will frequently refer to examples from related areas, but keep our focus on running examples from the areas of NLP and medical data science.
2.
The orthogonality of our methodological point of view to statistical learning theory is shown by the fact that it applies to classical learning theory as well as to more recent approaches (Arjovsky et al., 2019; Kawaguchi et al., 2022; Shen et al., 2021).

References

Arjovsky, M., Bottou, L., Gulrajani, I., & Lopez-Paz, D. (2019). Invariant risk minimization. ar**v:abs/1907.02893. Available from: https://doi.org/10.48550/ar**v.1907.02893
Balzer, W., & Brendel, K. R. (2019). Theorie der Wissenschaften. Springer. Available from: https://doi.org/10.1007/978-3-658-21222-3
Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111(4), 1061–1071. Available from: https://doi.org/10.1037/0033-295x.111.4.1061
Bottou, L., Curtis, F. E., & Nocedal, J. (2018). Optimization methods for large-scale machine learning. SIAM Review, 60(2), 223–311. Available from: https://doi.org/10.1137/16m1080173
Bousquet, O., Boucheron, S., & Lugosi, G. (2004). Introduction to statistical learning theory. In O. Bousquet, U. von Luxburg & G. Rätsch (Eds.), Advanced lectures on machine learning (pp. 169–207). Springer. Available from: https://doi.org/10.1007/978-3-540-28650-9_8
Bowman, S. R., & Dahl, G. (2021). What will it take to fix benchmarking in natural language understanding? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Online. Available from: https://doi.org/10.18653/v1/2021.naacl-main.385
Brennan, R. L. (2001). Generalizability theory. Springer. Available from: https://doi.org/10.1007/978-1-4757-3456-0
Clark, C., Yatskar, M., & Zettlemoyer, L. (2019). Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Available from: https://doi.org/10.18653/v1/d19-1418
Corfield, D., Schölkopf, B., & Vapnik, V. (2009). Falsificationism and statistical learning theory: Comparing the Popper and Vapnik-Chervonenkis dimensions. Journal for General Philosophy of Science, 40, 51–58. Available from: https://doi.org/10.1007/s10838-009-9091-3
Cox, D., & Reid, N. (2000). The theory of the design of experiments. Chapman & Hall/CRC. Available from: https://doi.org/10.1201/9781420035834
Dodge, J., Gururangan, S., Card, D., Schwartz, R., & Smith, N. A. (2019). Show your work: Improved reporting of experimental results. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Available from: https://doi.org/10.18653/v1/D19-1224
Dror, R., Peled, L., Shlomov, S., & Reichart, R. (2020). Statistical significance testing for natural language processing. Morgan & Claypool. Available from: https://doi.org/10.1007/978-3-031-02174-9
Fisher, R. A. (1925). Statistical methods for research workers. Oliver and Boyd. Available from: https://global.oup.com/academic/product/statistical-methods-experimental-design-and-scientific-inference-9780198522294
Fisher, R. A. (1935). The design of experiments. Hafner. Available from: https://global.oup.com/academic/product/statistical-methods-experimental-design-and-scientific-inference-9780198522294
Graf, E., & Azzopardi, L. (2008). A methodology for building a patent test collection for prior art search. In Proceedings of the 2nd International Workshop on Evaluating Information Access (EVIA) (pp. 60–71). Available from: http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings7/pdf/EVIA2008/11-EVIA2008-GrafE.pdf
Guo, Y., & Gomes, C. (2009). Ranking structured documents: A large margin based approach for patent prior art search. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’09) (pp. 1058–1064). Available from: https://www.ijcai.org/Proceedings/09/Papers/179.pdf
Henderson, P., Islam, R., Bachmann, P., Pineau, J., Precup, D., & Meger, D. (2018). Deep reinforcement learning that matters. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI). Available from: https://doi.org/10.1609/aaai.v32i1.11694
Inhelder, B., & Piaget, J. (1958). The growth of logical thinking from childhood to adolescence. Basic Books. Available from: https://doi.org/10.1037/10034-000
Kaufmann, S., Rosset, S., & Perlich, C. (2011). Leakage in data mining: Formulation, detection, and avoidance. In Proceedings of the Conference on Knowledge Discovery and Data Mining (KDD). Available from: https://doi.org/10.1145/2020408.2020496
Kawaguchi, K., Kaelbling, L. P., & Bengio, Y. (2022). Generalization in deep learning. In Mathematical aspects of deep learning. Cambridge University Press. Available from: https://doi.org/10.1017/9781009025096.003
Korb, K. (2004). Introduction: Machine learning as philosophy of science. Minds and Machines, 14(4), 1–7. Available from: https://doi.org/10.1023/b:mind.0000045986.90956.7f
Lones, M. A. (2021). How to avoid machine learning pitfalls: A guide for academic researchers. ar**v:abs/2108.02497. Available from: https://doi.org/10.48550/ar**v.2108.02497
Lucic, M., Kurach, K., Michalski, M., Bousquet, O., & Gelly, S. (2018). Are GANs created equal? A large-scale study. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS). Available from: https://proceedings.neurips.cc/paper/2018/file/e46de7e1bcaaced9a54f1e9d0d2f800d-Paper.pdf
Magdy, W., & Jones, G. J. F. (2010). Applying the KISS principle for the CLEF- IP 2010 prior art candidate patent search task. In Proceedings of the CLEF 2010 Workshop. Available from: http://ceur-ws.org/Vol-1176/CLEF2010wn-CLEF-IP-MagdyEt2010.pdf
Mead, R., Gilmour, S., & Mead, A. (2012). Statistical principles for the design of experiments. Cambridge University Press. Available from: https://doi.org/10.1017/cbo9781139020879
Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, 231, 289–337. Available from: https://doi.org/10.1098/rsta.1933.0009
Pawitan, Y. (2001). In all likelihood. Statistical modelling and inference using likelihood. Clarendon Press. Available from: https://global.oup.com/academic/product/in-all-likelihood-9780199671229
Pinheiro, J. C., & Bates, D. M. (2000). Mixed-effects models in S and S-PLUS. Springer. Available from: https://doi.org/10.1007/b98882
Piroi, F., & Tait, J. (2010). CLEF-IP 2010: Retrieval experiments in the intellectual property domain. In Proceedings of the Conference on Multilingual and Multimodal Information Access Evaluation (CLEF 2010). Available from: http://www.ifs.tuwien.ac.at/~clef-ip/pubs/CLEF-IP-2010-IRF-TR-2010-00005.pdf
Schamoni, S., Lindner, H. A., Schneider-Lindner, V., Thiel, M., & Riezler, S. (2019). Leveraging implicit expert knowledge for non-circular machine learning in sepsis prediction. Journal of Artificial Intelligence in Medicine, 100, 1–9. Available from: https://doi.org/10.1016/j.artmed.2019.101725
Schölkopf, B. (2022). Causality for machine learning (1st ed.) (pp. 765–804). Association for Computing Machinery. Available from: https://doi.org/10.1145/3501714.3501755
Searle, S. R., Casella, G., & McCulloch, C. E. (1992). Variance components. Wiley. Available from: https://doi.org/10.1002/9780470316856
Shen, Z., Liu, J., He, Y., Zhang, X., Xu, R., Yu, H., & Cui, P. (2021). Towards out-of-distribution generalization: A survey. ar**v:abs/2108.13624. Available from: https://doi.org/10.48550/ar**v.2108.13624
Vapnik, V. N. (1998). Statistical learning theory. Wiley. Available from: https://www.wiley.com/en-us/Statistical+Learning+Theory-p-9780471030034
von Luxburg, U., & Schölkopf, B. (2011). Statistical learning theory: Models, concepts, and results. In D. Gabbay, S. Hartmann & J. Woods (Eds.), Handbook of the history of logic, Vol. 10: Inductive logic (pp. 651–706). Elsevier. Available from: https://doi.org/10.1016/b978-0-444-52936-7.50016-1
Wood, S. N. (2017). Generalized additive models. An introduction with R (2nd ed.). Chapman & Hall/CRC. Available from: https://doi.org/10.1201/9781315370279

Download references

Author information

Authors and Affiliations

Department of Computational Linguistics and Interdisciplinary Center for Scientific Computing, Heidelberg University, Heidelberg, Germany
Stefan Riezler
Department of Computational Linguistics, Heidelberg University, Heidelberg, Germany
Michael Hagmann

Authors

Stefan Riezler
View author publications
You can also search for this author in PubMed Google Scholar
Michael Hagmann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefan Riezler .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Riezler, S., Hagmann, M. (2024). Introduction. In: Validity, Reliability, and Significance. Synthesis Lectures on Human Language Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-57065-0_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-57065-0_1
Published: 10 June 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-57064-3
Online ISBN: 978-3-031-57065-0
eBook Packages: Synthesis Collection of Technology (R0)

Publish with us

Policies and ethics