Validity

Riezler, Stefan; Hagmann, Michael

doi:10.1007/978-3-031-57065-0_2

Stefan Riezler⁴ &
Michael Hagmann⁵

Part of the book series: Synthesis Lectures on Human Language Technologies ((SLHLT))

16 Accesses

Abstract

The notion of validity of a prediction has an ill-defined status in NLP, and it is not associated with a widely accepted evaluation measure such as precision as a measure of prediction quality, or recall as a measure of prediction quantity, in classification. The goal of this chapter is to give a clear definition of the concept of validity in NLP and data science, which then can be operationalized into methods that allow measuring validity, and applied to general NLP and data science tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: EUR 29.95; Price includes VAT (Germany)

eBook: EUR 34.23; Price includes VAT (Germany)

Hardcover Book: EUR 42.79; Price includes VAT (Germany)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The defining criteria concern heart rate (>90 BPM), temperature (>38\(^\circ \) or <36 \(^\circ \)C), respiratory rate (>20 BPM), or white blood cell count (>12 or <4 thousands per microliter), measured in the last 2–8 hrs (Dellinger et al., 2013).
2.
The measurements are taken for creatinine level and urine output, Glasgow Coma Scale, bilirubin level, respiratory level, thrombocytes level (Vincent et al., 1996).
3.
Balzer and Brendel (2019) and Balzer (1992) utilize a formalism that allows them to express all relevant concepts (even functions) in terms of tuples and sets. Essentially, the condition of disjointness of the function to be measured and the function given by the model means that the input measurements must be determinable without knowing the quantity that one wants to measure.
4.
Further and even stricter conditions on validity of measurement are possible and have been discussed in philosophy of science. For example, see Sneed (1971) and Stegmüller (1979, 1986) for a discussion of theoretical terms and possible circularity problems for fundamental measurement procedures. For a deeper discussion of statistical measurement procedures, see Balzer and Brendel (2019).
5.
A well-known example from the area of image processing is the (mis)use of copyright tags in image processing (Lapuschkin et al., 2019).
6.
A precise definition of the notion of interpretability is an open research problem that is outside the scope of this book. It involves issues ranging from the (non)concurvity of features (Amodio et al., 2014; Tomaschek et al., 2018) to human factors of intelligibility (Alvarez-Melis & Jaakkola, 2018; Doshi-Velez & Kim, 2017; Miller, 2019).
7.
In a similar way, factorized latent representations (Chen et al., 2016; Higgins et al., 2017; Locatello et al., 2019) have to be mapped to interpretable concepts when used as explanatory factors in image processing.
8.
Clearly, invariance of correlations across different environments is only part of causality, and further conditions are necessary (Rosenfeld et al., 2021). Thus, we do not make any causality claims on our validity tests, but instead we take a practical approach where computing the descriptive statistics of the correlation coefficient for given features and labels across given domains replaces the notion of causality in Borsboom and Mellenbergh’s approach to construct validity.
9.
Rescaling was performed by the min-max formula \(f(x) = \frac{x- \min }{\max - \min }\). Negations were computed by a regular expression extracting negation words, following https://www.nltk.org/_modules/nltk/sentiment/util.html.
10.
415 sentence pairs were filtered out because of duplications or missing labels.
11.
For example, correlation in multi-class classification problems requires measures such as mutual information (Cover & Thomas, 1991), and even our natural language inference example used a special subcase of Pearson correlation called point-biserial correlation between continuous and dichotomous variables (Agresti, 2002).
12.
In the simplest form, degrees of freedom of a model are calculated by the number of tuneable parameters. For example, a GAM for \(n=1, \ldots ,N\) data points, modeling feature shapes for each of \(k=1, \ldots , p\) input features with cubic splines of \(d_k\) parameters for each feature, together with a smoothness penalty for each of feature, adds up to \((N \times \sum _{k=1}^p d_k) + p\) degrees of freedom. For the notion of effective degrees of freedom and its computation, see Appendix A.1.
13.
The feedforward neural network was implemented in https://pytorch.org. It consists of 7 layers, with an ascending, then descending number of neurons per layer, and a tanh activation function. It was trained for regression using PyTorch’s SGD optimizer, with batch size 64, learning rate 0.01, without dropout, for 5 epochs. All other optimizer settings are default values of PyTorch’s SGD optimizer.
14.
For the binary classification data, we use a GAM that assumes a binomial response variable and a logistic link function.
15.
The feedforward neural network was implemented in https://pytorch.org. It consists of 7 layers, with an ascending, then descending number of neurons per layer, and a ReLU activation function (Glorot et al., 2011). It was trained for regression using PyTorch’s SGD optimizer, with batch size 64, learning rate 0.01, and dropout rate of 0.2 in hidden layers, for 5 epochs. All other optimizer settings are default values of PyTorch’s SGD optimizer.
16.
Minor differences in meta-parameter settings to the model trained for liver SOFA prediction include a smaller batch size of 32 and a dropout rate of 0.

References

Agarwal, R., Melnick, L., Frosst, N., Zhang, X., Lengerich, B., Caruana, R., & Hinton, G. (2021a). Neural additive models: Interpretable machine learning with neural nets. In Advances in Neural Information Processing Systems. Virtual. Available from: https://openreview.net/forum?id=wHkKTW2wrmm
Agrawal, A., Batra, D., Parikh, D., & Kembhavi, A. (2018). Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Available from: https://openaccess.thecvf.com/content_cvpr_2018/papers/Agrawal_Dont_Just_Assume_CVPR_2018_paper.pdf
Agresti, A. (2002). Categorical data analysis. Wiley. Available from: https://doi.org/10.1002/0471249688
Alvarez-Melis, D., & Jaakkola, T. S. (2018). Towards robust interpretability with self-explaining neural networks. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS). Available from: https://proceedings.neurips.cc/paper_files/paper/2018/file/3e9f0fc9b2f89e043bc6233994dfcf76-Paper.pdf
Amodio, S., Aria, M., & D’Ambrosio, A. (2014). On concurvity in nonlinear and nonparameric regression models. Statistica, 1, 85–98. Available from: http://dx.doi.org/10.6092/issn.1973-2201/4599
Arjovsky, M., Bottou, L., Gulrajani, I., & Lopez-Paz, D. (2019). Invariant risk minimization. ar**v:abs/1907.02893. Available from: https://doi.org/10.48550/ar**v.1907.02893
Balzer, W. (1992). The structuralist view of measurement: an extension of received measurement theories. In C. Savage & P. Ehrlich (Eds.), Philosophical and foundational issues in measurement theory (pp. 93–117). Erlbaum. Available from: http://dx.doi.org/10.4324/9780203772256-10
Balzer, W., & Brendel, K. R. (2019). Theorie der Wissenschaften. Springer. Available from: http://dx.doi.org/10.1007/978-3-658-21222-3
Borsboom, D. (2005). Measuring the mind. Conceptual issues in contemporary psychometrics. Cambridge University Press. Available from: http://dx.doi.org/10.1017/cbo9780511490026
Borsboom, D., & Mellenbergh, G. J. (2007). Test validity in cognitive assessment. In J. P. Leighton & M. J. Gierl, M. J. (Eds.), Cognitive Diagnostic Assessment for Education. Theory and Applications (pp. 85–115). Cambridge University Press. Available from: http://dx.doi.org/10.1017/cbo9780511611186.004
Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111(4), 1061–1071. Available from: http://dx.doi.org/10.1037/0033-295x.111.4.1061
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. In Proceedings of the Seventh International World-Wide Web Conference (WWW 1998). Available from: http://dx.doi.org/10.1016/s0169-7552(98)00110-x
Chapelle, O., & Chang, Y. (2011). Yahoo learning to rank challenge overview. In Proceedings of the Yahoo Learning to Rank Challenge. Available from: https://proceedings.mlr.press/v14/chapelle11a.html
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P. (2016). InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS). Available from: https://proceedings.neurips.cc/paper_files/paper/2016/file/7c9d0b1f96aebd7b5eca8c3edaa1
Clark, C., Yatskar, M., & Zettlemoyer, L. (2019). Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Available from: http://dx.doi.org/10.18653/v1/d19-1418
Collobert, R., Weston, J., Michael Karlen, L. B., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, 2461–2505. Available from: https://www.jmlr.org/papers/v12/collobert11a.html
Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. Wiley. Available from: http://dx.doi.org/10.1002/0471200611.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281–302. Available from: http://dx.doi.org/10.1037/h0040957
de Stoppelaar, S. F., van’t Veer, C., & van der Poll, T. (2014). The role of platelets in sepsis. Thrombosis and Haemostasis, 112(4), 666–667. Available from: https://doi.org/10.1160/th14-02-0126
Dellinger, R., Levy, M., Rhodes, A., Annane, D., Gerlach, H., Opal, S. M., Sevransky, J., Sprung, C., Douglas, I., Ant, T.M. Osborn, R. J., Nunnally, M., Townsend, S., Reinhart, K., Kleinpell, R., Angus, D., Deutschman, C., Machado, F., Rubenfeld, G., Webb, S., Beale, R., Vincent, J., & Moreno, R. (2013). Surviving sepsis campaign: International guidelines for management of severe sepsis and septic shock: 2012. Critical Care Medicine, 41(2), 580–637. Available from: http://dx.doi.org/10.1097/CCM.0b013e31827e83af
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL:HLT). Available from: https://aclanthology.org/N19-1423.pdf
Ding, Y., Liu, Y., Luan, H., & Sun, M. (2017). Visualizing and understanding neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL). Available from: http://dx.doi.org/10.18653/v1/P17-1106
Doshi-Velez, F., & Kim, B. (2017). Towards a rigorous science of interpretable machine learning. ar**v:abs/1702.08608. Available from: https://doi.org/10.48550/ar**v.1702.08608
Dyagilev, K., & Saria, S. (2016). Learning (predictive) risk scores in the presence of censoring due to interventions. Machine Learning, 20(3), 323–348. Available from: http://dx.doi.org/10.1007/s10994-015-5527-7
Gitelman, L. (2013). Raw data. Is an oxymoron. MIT Press. Available from: https://doi.org/10.7551/mitpress/9302.001.0001
Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep sparse rectifier neural networks. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS). Available from: https://proceedings.mlr.press/v15/glorot11a/glorot11a.pdf
Gorman, K., & Bedrick, S. (2019). We need to talk about standard splits. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). Available from: http://dx.doi.org/10.18653/v1/P19-1267
Graf, E. & Azzopardi, L. (2008). A methodology for building a patent test collection for prior art search. In Proceedings of the 2nd International Workshop on Evaluating Information Access (EVIA) (pp. 60–71). Available from: http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings7/pdf/EVIA2008/11-EVIA2008-GrafE.pdf
Guo, Y., & Gomes, C. (2009). Ranking structured documents: A large margin based approach for patent prior art search. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’09) (pp. 1058–1064). Available from: https://www.ijcai.org/Proceedings/09/Papers/179.pdf
Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S., & Smith, N. A. (2018). Annotation artifacts in natural language inference data. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Available from: http://dx.doi.org/10.18653/v1/n18-2017
Hastie, T., & Tibshirani, R. (1986). Generalized additive models. Statistical Science, 1(3), 297–318. Available from: http://dx.doi.org/10.1214/ss/1177013604
Hastie, T., & Tibshirani, R. (1990). Generalized additive models. Chapman and Hall. Available from: https://www.routledge.com/Generalized-Additive-Models/Hastie-Tibshirani/p/book/9780412343902
Heckman, N. E. (1986). Spline smoothing in a partly linear model. Journal of the Royal Statistical Society B, 48(2), 244–248. Available from: http://dx.doi.org/10.1111/j.2517-6161.1986.tb01407.x
Henry, K. E., Hager, D. N., Pronovost, P. J., & Saria, S. (2015). A targeted real-time early warning score (TREWScore) for septic shock. Science Translational Medicine, 7(229), 1–9. Available from: http://dx.doi.org/10.1126/scitranslmed.aab3719
Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., & Lerchner, A. (2017). beta-VAE: Learning basic visual concepts with a constrained variational framework. In Proceedings of the 5th International Conference on Learning Representations (ICLR). Available from: https://openreview.net/forum?id=Sy2fzU9gl
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. In NIPS deep learning workshop. Available from: https://doi.org/10.48550/ar**v.1503.02531
Inhelder, B., & Piaget, J. (1958). The growth of logical thinking from childhood to adolescence. Basic Books. Available from: http://dx.doi.org/10.1037/10034-000
Jia, R., & Liang, P. (2017). Adversarial examples for evaluating reading comprehension systems. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
Google Scholar
Jones, K. S. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11–21. Available from: http://dx.doi.org/10.7551/mitpress/12274.003.0037
Kaufmann, S., Rosset, S., & Perlich, C. (2011). Leakage in data mining: Formulation, detection, and avoidance. In Proceedings of the Conference on Knowledge Discovery and Data Mining (KDD). Available from: http://dx.doi.org/10.1145/2020408.2020496
Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Available from: http://dx.doi.org/10.3115/v1/d14-1181
Kim, Y., & Rush, A. M. (2016). Sequence-level knowledge distillation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Available from: http://dx.doi.org/10.18653/v1/d16-1139
Kim, B., Kim, H., Kim, K., Kim, S., & Kim, J. (2019b). Learning not to learn: Training deep neural networks with biased data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Available from: http://dx.doi.org/10.1109/cvpr.2019.00922
Krantz, D. H., Luce, R. D., Suppes, P., & Tversky, A. (1971). Foundations of measurement. Academic. Available from: http://dx.doi.org/10.2307/3172791
Kuwa, T., Schamoni, S., & Riezler, S. (2020). Embedding meta-textual information for improved learning to rank. In The 28th International Conference on Computational Linguistics (COLING). Available from: http://dx.doi.org/10.18653/v1/2020.coling-main.487
Lapuschkin, S., Wäldchen, S., Binder, A., Montavon, G., Samek, W., & Müller, K. (2019). Unmasking clever hans predictors and assessing what machines really learn. Nature Communications, 10(1), 1–8. Available from: http://dx.doi.org/10.1038/s41467-019-08987-4
Larsen, R. J., & Marx, M. L. (2012). Mathematical statistics and its applications (5th ed.). Prentice Hall. Available from: https://doi.org/10.1080/00031305.2011.645758
Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Schölkopf, B., and Bachem, O. (2019). Challenging common assumptions in the unsupervised learning of disentangled representations. In Proceedings of the 36th International Conference on Machine Learning (ICML). Available from: http://proceedings.mlr.press/v97/locatello19a.html
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Addison-Wesley. Available from: https://www.infoagepub.com/products/Statistical-Theories-of-Mental-Test-Scores
Magdy, W., & Jones, G. J. F. (2010). Applying the KISS principle for the CLEF- IP 2010 prior art candidate patent search task. In In Proceedings of the CLEF 2010 Workshop. Available from: http://ceur-ws.org/Vol-1176/CLEF2010wn-CLEF-IP-MagdyEt2010.pdf
Mahdabi, P., & Crestani, F. (2014). Query-driven mining of citation networks for patent citation retrieval and recommendation. In Proceedings of the 23rd ACM International Conference on Information and Knowledge Management (CIKM). Available from: http://dx.doi.org/10.1145/2661829.2661899
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press. Available from: http://dx.doi.org/10.1017/cbo9780511809071
Markus, K. A., & Borsboom, D. (2013). Frontiers of test validity theory. measurement, causation, and meaning. Routledge. Available from: http://dx.doi.org/10.4324/9780203501207
McCoy, T., Pavlick, E., & Linzen, T. (2019). Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). Available from: http://dx.doi.org/10.18653/v1/p19-1334
McCullagh, P., & Nelder, J. (1989). Generalized linear models (2nd ed.). Chapman and Hall. Available from: http://dx.doi.org/10.1201/9780203753736
Michell, J. (2004). Measurement on psychology. Cambridge University Press. Available from: http://dx.doi.org/10.1017/cbo9780511490040
Mikolov, T., Yih, W.-T., & Zweig, G. (2013). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAAC:HLT). Available from: https://aclanthology.org/N13-1090
Miller, T. (2019). Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 267, 1–38. Available from: https://www.sciencedirect.com/science/article/pii/S0004370218305988
Mitchell, T., Cohen, W., Hruschka, E., Talukdar, P., Yang, B., Betteridge, J., Carlson, A., Dalvi, B., Gardner, M., Kisiel, B., Krishnamurthy, J., Lao, N., Mazaitis, K., Mohamed, T., Nakashole, N., Platanios, E., Ritter, A., Samadi, M., Settles, B., Wang, R., Wijaya, D., Gupta, A., Chen, X., Saparov, A., Greaves, M., & Welling, J. (2018). Never-ending learning. Communication ACM, 61(5), 103–115. Available from: https://doi.org/10.1145/3191513
Narens, L. (1985). Abstract measurement theory. Cambridge University Press. Available from: https://mitpress.mit.edu/9780262140379/abstract-measurement-theory/
Nemati, S., Holder, A., Razmi, F., Stanley, M. D., Clifford, G. D., & Buchman, T. G. (2018). An interpretable machine learning model for accurate prediction of sepsis in the ICU. Critical Care Medicine, 46(4), 547–553. Available from: http://dx.doi.org/10.1097/ccm.0000000000002936
Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J., & Kiela, D. (2020). Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), Online. Available from: http://dx.doi.org/10.18653/v1/2020.acl-main.441
Niven, T., & Kao, H.-Y. (2019). Probing neural network comprehension of natural language arguments. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). Available from: http://dx.doi.org/10.18653/v1/P19-1459
Pearl, J. (2009). Causality: Models, reasoning, and inference (2nd ed.). Cambridge University Press. Available from: https://doi.org/10.1017/CBO9780511803161
Peters, J., Bühlmann, P., & Meinshausen, N. (2016). Causal inference using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society, Series B, 78(5), 947–1012. Available from: https://www.jstor.org/stable/44682904
Peters, J., Janzing, D., & Schölkopf, B. (2017). Elements of causal inference: Foundations and learning algorithms. MIT Press. Available from: https://mitpress.mit.edu/9780262037310/elements-of-causal-inference/
Piroi, F., & Tait, J. (2010). CLEF-IP 2010: Retrieval experiments in the intellectual property domain. In Proceedings of the Conference on Multilingual and Multimodal Information Access Evaluation (CLEF 2010). Available from: http://www.ifs.tuwien.ac.at/~clef-ip/pubs/CLEF-IP-2010-IRF-TR-2010-00005.pdf
Poliak, A., Naradowsky, J., Haldar, A., Rudinger, R., & Van Durme, B. (2018). Hypothesis only baselines in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics. Available from: http://dx.doi.org/10.18653/v1/S18-2023
Qin, T., Liu, T.-Y., Xu, J., & Li, H. (2010). LETOR: A benchmark collection for research on learning to rank for information retrieval. Information Retrieval Journal, 13(4), 346–374. Available from: https://doi.org/10.1007/s10791-009-9123-y
Reyna, M. A., Josef, C. S., Jeter, R., Shashikumar, S. P., Westover, M. B., Nemati, S., Clifford, G. D., & Sharma, A. (2019). Early prediction of sepsis from clinical data: The physionet/computing in cardiology challenge 2019. Critical Care Medicine, 48(2), 210–217. Available from: https://doi.org/10.1097/CCM.0000000000004145
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). Why Should I Trust You? Explaining the predictions of any classifier. In Proceedings of the Conference on Knowledge Discovery and Data Mining (KDD). Available from: http://dx.doi.org/10.1145/2939672.2939778
Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333–389. Available from: http://dx.doi.org/10.1561/1500000019
Rosenfeld, E., Ravikumar, P., & Risteski, A. (2021). The risks of invariant risk minimization. In Proceedings of the International Conference on Learning Representations (ICLR). Virtual. Available from: https://openreview.net/forum?id=BbNIbVPJ-42
Rosset, S., Perlich, C., Swirszcz, G., Melville, P., & Liu, Y. (2009). Medical data mining: insights from winning two competitions. Data Mining and Knowledge Discovery, 20, 439–468. Available from: https://doi.org/10.1007/s10618-009-0158-x
Rudd, K. E., Johnson, S. C., Agesa, K. M., Shackelford, K. A., Tsoi, D., Kievlan, D. R., Colombara, D. V., Ikuta, K. S., Kissoon, N., Finfer, S., Fleischmann-Struzek, C., Machado, F. R., Reinhart, K. K., Rowan, K., Seymour, C. W., Watson, R. S., West, T. E., Marinho, F., Hay, S. I., Lozano, R., Lopez, A. D., Angus, D. C., Murray, C. J. L., & Naghavi, M. (2020). Global, regional, and national sepsis incidence and mortality, 1990–2017: analysis for the global burden of disease study. The Lancet, 395(10219), 200–211. Available from: https://doi.org/10.1016/S0140-6736(19)32989-7
Schamoni, S., & Riezler, S. (2015). Combining orthogonal information in large-scale cross-language information retrieval. In Proceedings of the 38th Annual ACM SIGIR Conference. Available from: http://dx.doi.org/10.1145/2766462.2767805
Schamoni, S., Lindner, H. A., Schneider-Lindner, V., Thiel, M., & Riezler, S. (2019). Leveraging implicit expert knowledge for non-circular machine learning in sepsis prediction. Journal of Artificial Intelligence in Medicine, 100, 1–9. Available from: https://doi.org/10.1016/j.artmed.2019.101725
Schlegel, V., Nenadic, G., & Batista-Navarro, R. (2020). Beyond leaderboards: A survey of methods for revealing weaknesses in natural language inference data and models. ar**v:abs/2005.14709, Available from: https://doi.org/10.48550/ar**v.2005.14709
Schölkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., & Bengio, Y. (2021). Toward causal representation learning. Proceedings of the IEEE, 109(5), 612–634. Available from: https://doi.org/10.1109/JPROC.2021.3058954
Seymour, C. W., Liu, V. X., Iwashyna, T. J., Brunkhorst, F. M., Rea, T. D., Scherag, A., Rubenfeld, G., Kahn, J. M., Shankar-Hari, M., Singer, M., Deutschman, C. S., Escobar, G. J., & Angus, D. C. (2016). Assessment of clinical criteria for sepsis for the third international consensus definitions for sepsis and septic shock (Sepsis-3). JAMA, 315(8), 762–774. Available from: https://doi.org/10.1001/jama.2016.0288
Singer, M., Deutschman, C. S., & Seymour, C. W. (2016). The third international consensus definitions for sepsis and septic shock (Sepsis-3). JAMA, 315(8), 801–810. Available from: https://doi.org/10.1001/jama.2016.0287
Sneed, J. D. (1971). The logical structure of mathematical physics. D. Reidel. Available from: https://doi.org/10.1007/978-94-010-3066-3
Søgaard, A., Ebert, S., Bastings, J., & Filippova, K. (2021). We need to talk about random splits. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Online. Available from: http://dx.doi.org/10.18653/v1/2021.eacl-main.156
Stegmüller, W. (1979). The structuralist view of theories. A possible analogue of the Bourbaki programme in physical science. Springer. Available from: http://dx.doi.org/10.1007/978-3-642-95360-6
Stegmüller, W. (1986). Probleme und Resultate der Wissenschaftstheorie und Analytischen Philosophie. Band II: Theorie und Erfahrung. Zweiter Teilband: Therienstrukturen und Theoriendynamik (2nd ed.). Springer. Available from: https://doi.org/10.1007/978-3-642-61671-6
Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677–680. Available from: http://dx.doi.org/10.1126/science.103.2684.677
Tan, S., Caruana, R., Hooker, G., & Lou, Y. (2018). Distill-and-compare: Auditing black-box models using transparent model distillation. In Proceedings of AIES. Available from: http://dx.doi.org/10.1145/3278721.3278725
Tomaschek, F., Hendrix, P., & Baayen, R. H. (2018). Strategies for addressing collinearity in multivariate linguistic data. Journal of Phonetics, 71, 249–267. Available from: http://dx.doi.org/10.1016/j.wocn.2018.09.004
Vincent, J., Moreno, R., Takala, J., Willatts, S., Mendonça, A. D., Bruining, H., Reinhart, C., Suter, P., & Thijs, L. (1996). The SOFA (Sepsis-related Organ Failure Assessment) score to describe organ dysfunction/failure. Intensive Care Medicine, 22(7), 707–710. Available from: https://doi.org/10.1007/BF01709751
Williams, A., Nangia, N., & Bowman, S. (2018). A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL:HLT). Available from: http://dx.doi.org/10.18653/v1/N18-1101
Wood, S. N. (2017). Generalized additive models. An introduction with R (2nd ed.). Chapman & Hall/CRC. Available from: https://doi.org/10.1201/9781315370279
Zhai, C., & Lafferty, J. (2001). A study of smoothing methods for language models applied to information retrieval. In Proceedings of the 24th Annual International Conference on Research and Development in Information Retrieval (SIGIR). Available from: https://doi.org/10.1145/383952.384019

Download references

Author information

Authors and Affiliations

Department of Computational Linguistics and Interdisciplinary Center for Scientific Computing, Heidelberg University, Heidelberg, Germany
Stefan Riezler
Department of Computational Linguistics, Heidelberg University, Heidelberg, Germany
Michael Hagmann

Authors

Stefan Riezler
View author publications
You can also search for this author in PubMed Google Scholar
Michael Hagmann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefan Riezler .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Riezler, S., Hagmann, M. (2024). Validity. In: Validity, Reliability, and Significance. Synthesis Lectures on Human Language Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-57065-0_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-57065-0_2
Published: 10 June 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-57064-3
Online ISBN: 978-3-031-57065-0
eBook Packages: Synthesis Collection of Technology (R0)

Publish with us

Policies and ethics