Abstract
Empirical methods are means to answering methodological questions of empirical sciences by statistical techniques. The methodological questions addressed in this book include the problems of validity, reliability, and significance. In the case of machine learning, these correspond to the questions of whether a model predicts what it purports to predict, whether a model’s performance is consistent across replications, and whether a performance difference between two models is due to chance, respectively. The goal of this book is to answer these questions by concrete statistical tests that can be applied to assess validity, reliability, and significance of data annotation and machine learning prediction in the fields of NLP and data science. Our focus is on model-based empirical methods where data annotations and model predictions are treated as training data for interpretable probabilistic models from the well-understood families of generalized additive models (GAMs) and linear mixed effects models (LMEMs). Based on the interpretable parameters of the trained GAMs or LMEMs, the book presents model-based statistical tests such as a validity test that allows detecting circular features that circumvent learning. Furthermore, the book discusses a reliability coefficient using variance decomposition based on random effect parameters of LMEMs. Last, a significance test based on the likelihood ratio of nested LMEMs trained on the performance scores of two machine learning models is shown to naturally allow the inclusion of variations in meta-parameter settings into hypothesis testing, and further facilitates a refined system comparison conditional on properties of input data. This book can be used as an introduction to empirical methods for machine learning in general, with a special focus on applications in NLP and data science. The book is self-contained, with an appendix on the mathematical background on GAMs and LMEMs, and with an accompanying webpage including R code to replicate experiments presented in the book.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Clearly, this paradigm is pervasive in machine learning and artificial intelligence in general, for example, in the area of image processing that uses similar methods and exhibits similar problems as the area of natural language processing. We will frequently refer to examples from related areas, but keep our focus on running examples from the areas of NLP and medical data science.
- 2.
References
Arjovsky, M., Bottou, L., Gulrajani, I., & Lopez-Paz, D. (2019). Invariant risk minimization. ar**v:abs/1907.02893. Available from: https://doi.org/10.48550/ar**v.1907.02893
Balzer, W., & Brendel, K. R. (2019). Theorie der Wissenschaften. Springer. Available from: https://doi.org/10.1007/978-3-658-21222-3
Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111(4), 1061–1071. Available from: https://doi.org/10.1037/0033-295x.111.4.1061
Bottou, L., Curtis, F. E., & Nocedal, J. (2018). Optimization methods for large-scale machine learning. SIAM Review, 60(2), 223–311. Available from: https://doi.org/10.1137/16m1080173
Bousquet, O., Boucheron, S., & Lugosi, G. (2004). Introduction to statistical learning theory. In O. Bousquet, U. von Luxburg & G. Rätsch (Eds.), Advanced lectures on machine learning (pp. 169–207). Springer. Available from: https://doi.org/10.1007/978-3-540-28650-9_8
Bowman, S. R., & Dahl, G. (2021). What will it take to fix benchmarking in natural language understanding? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Online. Available from: https://doi.org/10.18653/v1/2021.naacl-main.385
Brennan, R. L. (2001). Generalizability theory. Springer. Available from: https://doi.org/10.1007/978-1-4757-3456-0
Clark, C., Yatskar, M., & Zettlemoyer, L. (2019). Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Available from: https://doi.org/10.18653/v1/d19-1418
Corfield, D., Schölkopf, B., & Vapnik, V. (2009). Falsificationism and statistical learning theory: Comparing the Popper and Vapnik-Chervonenkis dimensions. Journal for General Philosophy of Science, 40, 51–58. Available from: https://doi.org/10.1007/s10838-009-9091-3
Cox, D., & Reid, N. (2000). The theory of the design of experiments. Chapman & Hall/CRC. Available from: https://doi.org/10.1201/9781420035834
Dodge, J., Gururangan, S., Card, D., Schwartz, R., & Smith, N. A. (2019). Show your work: Improved reporting of experimental results. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Available from: https://doi.org/10.18653/v1/D19-1224
Dror, R., Peled, L., Shlomov, S., & Reichart, R. (2020). Statistical significance testing for natural language processing. Morgan & Claypool. Available from: https://doi.org/10.1007/978-3-031-02174-9
Fisher, R. A. (1925). Statistical methods for research workers. Oliver and Boyd. Available from: https://global.oup.com/academic/product/statistical-methods-experimental-design-and-scientific-inference-9780198522294
Fisher, R. A. (1935). The design of experiments. Hafner. Available from: https://global.oup.com/academic/product/statistical-methods-experimental-design-and-scientific-inference-9780198522294
Graf, E., & Azzopardi, L. (2008). A methodology for building a patent test collection for prior art search. In Proceedings of the 2nd International Workshop on Evaluating Information Access (EVIA) (pp. 60–71). Available from: http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings7/pdf/EVIA2008/11-EVIA2008-GrafE.pdf
Guo, Y., & Gomes, C. (2009). Ranking structured documents: A large margin based approach for patent prior art search. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’09) (pp. 1058–1064). Available from: https://www.ijcai.org/Proceedings/09/Papers/179.pdf
Henderson, P., Islam, R., Bachmann, P., Pineau, J., Precup, D., & Meger, D. (2018). Deep reinforcement learning that matters. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI). Available from: https://doi.org/10.1609/aaai.v32i1.11694
Inhelder, B., & Piaget, J. (1958). The growth of logical thinking from childhood to adolescence. Basic Books. Available from: https://doi.org/10.1037/10034-000
Kaufmann, S., Rosset, S., & Perlich, C. (2011). Leakage in data mining: Formulation, detection, and avoidance. In Proceedings of the Conference on Knowledge Discovery and Data Mining (KDD). Available from: https://doi.org/10.1145/2020408.2020496
Kawaguchi, K., Kaelbling, L. P., & Bengio, Y. (2022). Generalization in deep learning. In Mathematical aspects of deep learning. Cambridge University Press. Available from: https://doi.org/10.1017/9781009025096.003
Korb, K. (2004). Introduction: Machine learning as philosophy of science. Minds and Machines, 14(4), 1–7. Available from: https://doi.org/10.1023/b:mind.0000045986.90956.7f
Lones, M. A. (2021). How to avoid machine learning pitfalls: A guide for academic researchers. ar**v:abs/2108.02497. Available from: https://doi.org/10.48550/ar**v.2108.02497
Lucic, M., Kurach, K., Michalski, M., Bousquet, O., & Gelly, S. (2018). Are GANs created equal? A large-scale study. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS). Available from: https://proceedings.neurips.cc/paper/2018/file/e46de7e1bcaaced9a54f1e9d0d2f800d-Paper.pdf
Magdy, W., & Jones, G. J. F. (2010). Applying the KISS principle for the CLEF- IP 2010 prior art candidate patent search task. In Proceedings of the CLEF 2010 Workshop. Available from: http://ceur-ws.org/Vol-1176/CLEF2010wn-CLEF-IP-MagdyEt2010.pdf
Mead, R., Gilmour, S., & Mead, A. (2012). Statistical principles for the design of experiments. Cambridge University Press. Available from: https://doi.org/10.1017/cbo9781139020879
Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, 231, 289–337. Available from: https://doi.org/10.1098/rsta.1933.0009
Pawitan, Y. (2001). In all likelihood. Statistical modelling and inference using likelihood. Clarendon Press. Available from: https://global.oup.com/academic/product/in-all-likelihood-9780199671229
Pinheiro, J. C., & Bates, D. M. (2000). Mixed-effects models in S and S-PLUS. Springer. Available from: https://doi.org/10.1007/b98882
Piroi, F., & Tait, J. (2010). CLEF-IP 2010: Retrieval experiments in the intellectual property domain. In Proceedings of the Conference on Multilingual and Multimodal Information Access Evaluation (CLEF 2010). Available from: http://www.ifs.tuwien.ac.at/~clef-ip/pubs/CLEF-IP-2010-IRF-TR-2010-00005.pdf
Schamoni, S., Lindner, H. A., Schneider-Lindner, V., Thiel, M., & Riezler, S. (2019). Leveraging implicit expert knowledge for non-circular machine learning in sepsis prediction. Journal of Artificial Intelligence in Medicine, 100, 1–9. Available from: https://doi.org/10.1016/j.artmed.2019.101725
Schölkopf, B. (2022). Causality for machine learning (1st ed.) (pp. 765–804). Association for Computing Machinery. Available from: https://doi.org/10.1145/3501714.3501755
Searle, S. R., Casella, G., & McCulloch, C. E. (1992). Variance components. Wiley. Available from: https://doi.org/10.1002/9780470316856
Shen, Z., Liu, J., He, Y., Zhang, X., Xu, R., Yu, H., & Cui, P. (2021). Towards out-of-distribution generalization: A survey. ar**v:abs/2108.13624. Available from: https://doi.org/10.48550/ar**v.2108.13624
Vapnik, V. N. (1998). Statistical learning theory. Wiley. Available from: https://www.wiley.com/en-us/Statistical+Learning+Theory-p-9780471030034
von Luxburg, U., & Schölkopf, B. (2011). Statistical learning theory: Models, concepts, and results. In D. Gabbay, S. Hartmann & J. Woods (Eds.), Handbook of the history of logic, Vol. 10: Inductive logic (pp. 651–706). Elsevier. Available from: https://doi.org/10.1016/b978-0-444-52936-7.50016-1
Wood, S. N. (2017). Generalized additive models. An introduction with R (2nd ed.). Chapman & Hall/CRC. Available from: https://doi.org/10.1201/9781315370279
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Riezler, S., Hagmann, M. (2024). Introduction. In: Validity, Reliability, and Significance. Synthesis Lectures on Human Language Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-57065-0_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-57065-0_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-57064-3
Online ISBN: 978-3-031-57065-0
eBook Packages: Synthesis Collection of Technology (R0)