Validity

  • Chapter
  • First Online:
Validity, Reliability, and Significance

Part of the book series: Synthesis Lectures on Human Language Technologies ((SLHLT))

  • 16 Accesses

Abstract

The notion of validity of a prediction has an ill-defined status in NLP, and it is not associated with a widely accepted evaluation measure such as precision as a measure of prediction quality, or recall as a measure of prediction quantity, in classification. The goal of this chapter is to give a clear definition of the concept of validity in NLP and data science, which then can be operationalized into methods that allow measuring validity, and applied to general NLP and data science tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
EUR 29.95
Price includes VAT (Germany)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 34.23
Price includes VAT (Germany)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
EUR 42.79
Price includes VAT (Germany)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The defining criteria concern heart rate (>90 BPM), temperature (>38\(^\circ \) or <36 \(^\circ \)C), respiratory rate (>20 BPM), or white blood cell count (>12 or <4 thousands per microliter), measured in the last 2–8 hrs (Dellinger et al., 2013).

  2. 2.

    The measurements are taken for creatinine level and urine output, Glasgow Coma Scale, bilirubin level, respiratory level, thrombocytes level (Vincent et al., 1996).

  3. 3.

    Balzer and Brendel (2019) and Balzer (1992) utilize a formalism that allows them to express all relevant concepts (even functions) in terms of tuples and sets. Essentially, the condition of disjointness of the function to be measured and the function given by the model means that the input measurements must be determinable without knowing the quantity that one wants to measure.

  4. 4.

    Further and even stricter conditions on validity of measurement are possible and have been discussed in philosophy of science. For example, see Sneed (1971) and Stegmüller (1979, 1986) for a discussion of theoretical terms and possible circularity problems for fundamental measurement procedures. For a deeper discussion of statistical measurement procedures, see Balzer and Brendel (2019).

  5. 5.

    A well-known example from the area of image processing is the (mis)use of copyright tags in image processing (Lapuschkin et al., 2019).

  6. 6.

    A precise definition of the notion of interpretability is an open research problem that is outside the scope of this book. It involves issues ranging from the (non)concurvity of features (Amodio et al., 2014; Tomaschek et al., 2018) to human factors of intelligibility (Alvarez-Melis & Jaakkola, 2018; Doshi-Velez & Kim, 2017; Miller, 2019).

  7. 7.

    In a similar way, factorized latent representations (Chen et al., 2016; Higgins et al., 2017; Locatello et al., 2019) have to be mapped to interpretable concepts when used as explanatory factors in image processing.

  8. 8.

    Clearly, invariance of correlations across different environments is only part of causality, and further conditions are necessary (Rosenfeld et al., 2021). Thus, we do not make any causality claims on our validity tests, but instead we take a practical approach where computing the descriptive statistics of the correlation coefficient for given features and labels across given domains replaces the notion of causality in Borsboom and Mellenbergh’s approach to construct validity.

  9. 9.

    Rescaling was performed by the min-max formula \(f(x) = \frac{x- \min }{\max - \min }\). Negations were computed by a regular expression extracting negation words, following https://www.nltk.org/_modules/nltk/sentiment/util.html.

  10. 10.

    415 sentence pairs were filtered out because of duplications or missing labels.

  11. 11.

    For example, correlation in multi-class classification problems requires measures such as mutual information (Cover & Thomas, 1991), and even our natural language inference example used a special subcase of Pearson correlation called point-biserial correlation between continuous and dichotomous variables (Agresti, 2002).

  12. 12.

    In the simplest form, degrees of freedom of a model are calculated by the number of tuneable parameters. For example, a GAM for \(n=1, \ldots ,N\) data points, modeling feature shapes for each of \(k=1, \ldots , p\) input features with cubic splines of \(d_k\) parameters for each feature, together with a smoothness penalty for each of feature, adds up to \((N \times \sum _{k=1}^p d_k) + p\) degrees of freedom. For the notion of effective degrees of freedom and its computation, see Appendix A.1.

  13. 13.

    The feedforward neural network was implemented in https://pytorch.org. It consists of 7 layers, with an ascending, then descending number of neurons per layer, and a tanh activation function. It was trained for regression using PyTorch’s SGD optimizer, with batch size 64, learning rate 0.01, without dropout, for 5 epochs. All other optimizer settings are default values of PyTorch’s SGD optimizer.

  14. 14.

    For the binary classification data, we use a GAM that assumes a binomial response variable and a logistic link function.

  15. 15.

    The feedforward neural network was implemented in https://pytorch.org. It consists of 7 layers, with an ascending, then descending number of neurons per layer, and a ReLU activation function (Glorot et al., 2011). It was trained for regression using PyTorch’s SGD optimizer, with batch size 64, learning rate 0.01, and dropout rate of 0.2 in hidden layers, for 5 epochs. All other optimizer settings are default values of PyTorch’s SGD optimizer.

  16. 16.

    Minor differences in meta-parameter settings to the model trained for liver SOFA prediction include a smaller batch size of 32 and a dropout rate of 0.

References

  • Agarwal, R., Melnick, L., Frosst, N., Zhang, X., Lengerich, B., Caruana, R., & Hinton, G. (2021a). Neural additive models: Interpretable machine learning with neural nets. In Advances in Neural Information Processing Systems. Virtual. Available from: https://openreview.net/forum?id=wHkKTW2wrmm

  • Agrawal, A., Batra, D., Parikh, D., & Kembhavi, A. (2018). Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Available from: https://openaccess.thecvf.com/content_cvpr_2018/papers/Agrawal_Dont_Just_Assume_CVPR_2018_paper.pdf

  • Agresti, A. (2002). Categorical data analysis. Wiley. Available from: https://doi.org/10.1002/0471249688

  • Alvarez-Melis, D., & Jaakkola, T. S. (2018). Towards robust interpretability with self-explaining neural networks. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS). Available from: https://proceedings.neurips.cc/paper_files/paper/2018/file/3e9f0fc9b2f89e043bc6233994dfcf76-Paper.pdf

  • Amodio, S., Aria, M., & D’Ambrosio, A. (2014). On concurvity in nonlinear and nonparameric regression models. Statistica, 1, 85–98. Available from: http://dx.doi.org/10.6092/issn.1973-2201/4599

  • Arjovsky, M., Bottou, L., Gulrajani, I., & Lopez-Paz, D. (2019). Invariant risk minimization. ar**v:abs/1907.02893. Available from: https://doi.org/10.48550/ar**v.1907.02893

  • Balzer, W. (1992). The structuralist view of measurement: an extension of received measurement theories. In C. Savage & P. Ehrlich (Eds.), Philosophical and foundational issues in measurement theory (pp. 93–117). Erlbaum. Available from: http://dx.doi.org/10.4324/9780203772256-10

  • Balzer, W., & Brendel, K. R. (2019). Theorie der Wissenschaften. Springer. Available from: http://dx.doi.org/10.1007/978-3-658-21222-3

  • Borsboom, D. (2005). Measuring the mind. Conceptual issues in contemporary psychometrics. Cambridge University Press. Available from: http://dx.doi.org/10.1017/cbo9780511490026

  • Borsboom, D., & Mellenbergh, G. J. (2007). Test validity in cognitive assessment. In J. P. Leighton & M. J. Gierl, M. J. (Eds.), Cognitive Diagnostic Assessment for Education. Theory and Applications (pp. 85–115). Cambridge University Press. Available from: http://dx.doi.org/10.1017/cbo9780511611186.004

  • Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111(4), 1061–1071. Available from: http://dx.doi.org/10.1037/0033-295x.111.4.1061

  • Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. In Proceedings of the Seventh International World-Wide Web Conference (WWW 1998). Available from: http://dx.doi.org/10.1016/s0169-7552(98)00110-x

  • Chapelle, O., & Chang, Y. (2011). Yahoo learning to rank challenge overview. In Proceedings of the Yahoo Learning to Rank Challenge. Available from: https://proceedings.mlr.press/v14/chapelle11a.html

  • Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P. (2016). InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS). Available from: https://proceedings.neurips.cc/paper_files/paper/2016/file/7c9d0b1f96aebd7b5eca8c3edaa1

  • Clark, C., Yatskar, M., & Zettlemoyer, L. (2019). Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Available from: http://dx.doi.org/10.18653/v1/d19-1418

  • Collobert, R., Weston, J., Michael Karlen, L. B., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, 2461–2505. Available from: https://www.jmlr.org/papers/v12/collobert11a.html

  • Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. Wiley. Available from: http://dx.doi.org/10.1002/0471200611.

  • Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281–302. Available from: http://dx.doi.org/10.1037/h0040957

  • de Stoppelaar, S. F., van’t Veer, C., & van der Poll, T. (2014). The role of platelets in sepsis. Thrombosis and Haemostasis, 112(4), 666–667. Available from: https://doi.org/10.1160/th14-02-0126

  • Dellinger, R., Levy, M., Rhodes, A., Annane, D., Gerlach, H., Opal, S. M., Sevransky, J., Sprung, C., Douglas, I., Ant, T.M. Osborn, R. J., Nunnally, M., Townsend, S., Reinhart, K., Kleinpell, R., Angus, D., Deutschman, C., Machado, F., Rubenfeld, G., Webb, S., Beale, R., Vincent, J., & Moreno, R. (2013). Surviving sepsis campaign: International guidelines for management of severe sepsis and septic shock: 2012. Critical Care Medicine, 41(2), 580–637. Available from: http://dx.doi.org/10.1097/CCM.0b013e31827e83af

  • Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL:HLT). Available from: https://aclanthology.org/N19-1423.pdf

  • Ding, Y., Liu, Y., Luan, H., & Sun, M. (2017). Visualizing and understanding neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL). Available from: http://dx.doi.org/10.18653/v1/P17-1106

  • Doshi-Velez, F., & Kim, B. (2017). Towards a rigorous science of interpretable machine learning. ar**v:abs/1702.08608. Available from: https://doi.org/10.48550/ar**v.1702.08608

  • Dyagilev, K., & Saria, S. (2016). Learning (predictive) risk scores in the presence of censoring due to interventions. Machine Learning, 20(3), 323–348. Available from: http://dx.doi.org/10.1007/s10994-015-5527-7

  • Gitelman, L. (2013). Raw data. Is an oxymoron. MIT Press. Available from: https://doi.org/10.7551/mitpress/9302.001.0001

  • Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep sparse rectifier neural networks. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS). Available from: https://proceedings.mlr.press/v15/glorot11a/glorot11a.pdf

  • Gorman, K., & Bedrick, S. (2019). We need to talk about standard splits. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). Available from: http://dx.doi.org/10.18653/v1/P19-1267

  • Graf, E. & Azzopardi, L. (2008). A methodology for building a patent test collection for prior art search. In Proceedings of the 2nd International Workshop on Evaluating Information Access (EVIA) (pp. 60–71). Available from: http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings7/pdf/EVIA2008/11-EVIA2008-GrafE.pdf

  • Guo, Y., & Gomes, C. (2009). Ranking structured documents: A large margin based approach for patent prior art search. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’09) (pp. 1058–1064). Available from: https://www.ijcai.org/Proceedings/09/Papers/179.pdf

  • Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S., & Smith, N. A. (2018). Annotation artifacts in natural language inference data. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Available from: http://dx.doi.org/10.18653/v1/n18-2017

  • Hastie, T., & Tibshirani, R. (1986). Generalized additive models. Statistical Science, 1(3), 297–318. Available from: http://dx.doi.org/10.1214/ss/1177013604

  • Hastie, T., & Tibshirani, R. (1990). Generalized additive models. Chapman and Hall. Available from: https://www.routledge.com/Generalized-Additive-Models/Hastie-Tibshirani/p/book/9780412343902

  • Heckman, N. E. (1986). Spline smoothing in a partly linear model. Journal of the Royal Statistical Society B, 48(2), 244–248. Available from: http://dx.doi.org/10.1111/j.2517-6161.1986.tb01407.x

  • Henry, K. E., Hager, D. N., Pronovost, P. J., & Saria, S. (2015). A targeted real-time early warning score (TREWScore) for septic shock. Science Translational Medicine, 7(229), 1–9. Available from: http://dx.doi.org/10.1126/scitranslmed.aab3719

  • Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., & Lerchner, A. (2017). beta-VAE: Learning basic visual concepts with a constrained variational framework. In Proceedings of the 5th International Conference on Learning Representations (ICLR). Available from: https://openreview.net/forum?id=Sy2fzU9gl

  • Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. In NIPS deep learning workshop. Available from: https://doi.org/10.48550/ar**v.1503.02531

  • Inhelder, B., & Piaget, J. (1958). The growth of logical thinking from childhood to adolescence. Basic Books. Available from: http://dx.doi.org/10.1037/10034-000

  • Jia, R., & Liang, P. (2017). Adversarial examples for evaluating reading comprehension systems. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).

    Google Scholar 

  • Jones, K. S. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11–21. Available from: http://dx.doi.org/10.7551/mitpress/12274.003.0037

  • Kaufmann, S., Rosset, S., & Perlich, C. (2011). Leakage in data mining: Formulation, detection, and avoidance. In Proceedings of the Conference on Knowledge Discovery and Data Mining (KDD). Available from: http://dx.doi.org/10.1145/2020408.2020496

  • Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Available from: http://dx.doi.org/10.3115/v1/d14-1181

  • Kim, Y., & Rush, A. M. (2016). Sequence-level knowledge distillation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Available from: http://dx.doi.org/10.18653/v1/d16-1139

  • Kim, B., Kim, H., Kim, K., Kim, S., & Kim, J. (2019b). Learning not to learn: Training deep neural networks with biased data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Available from: http://dx.doi.org/10.1109/cvpr.2019.00922

  • Krantz, D. H., Luce, R. D., Suppes, P., & Tversky, A. (1971). Foundations of measurement. Academic. Available from: http://dx.doi.org/10.2307/3172791

  • Kuwa, T., Schamoni, S., & Riezler, S. (2020). Embedding meta-textual information for improved learning to rank. In The 28th International Conference on Computational Linguistics (COLING). Available from: http://dx.doi.org/10.18653/v1/2020.coling-main.487

  • Lapuschkin, S., Wäldchen, S., Binder, A., Montavon, G., Samek, W., & Müller, K. (2019). Unmasking clever hans predictors and assessing what machines really learn. Nature Communications, 10(1), 1–8. Available from: http://dx.doi.org/10.1038/s41467-019-08987-4

  • Larsen, R. J., & Marx, M. L. (2012). Mathematical statistics and its applications (5th ed.). Prentice Hall. Available from: https://doi.org/10.1080/00031305.2011.645758

  • Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Schölkopf, B., and Bachem, O. (2019). Challenging common assumptions in the unsupervised learning of disentangled representations. In Proceedings of the 36th International Conference on Machine Learning (ICML). Available from: http://proceedings.mlr.press/v97/locatello19a.html

  • Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Addison-Wesley. Available from: https://www.infoagepub.com/products/Statistical-Theories-of-Mental-Test-Scores

  • Magdy, W., & Jones, G. J. F. (2010). Applying the KISS principle for the CLEF- IP 2010 prior art candidate patent search task. In In Proceedings of the CLEF 2010 Workshop. Available from: http://ceur-ws.org/Vol-1176/CLEF2010wn-CLEF-IP-MagdyEt2010.pdf

  • Mahdabi, P., & Crestani, F. (2014). Query-driven mining of citation networks for patent citation retrieval and recommendation. In Proceedings of the 23rd ACM International Conference on Information and Knowledge Management (CIKM). Available from: http://dx.doi.org/10.1145/2661829.2661899

  • Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press. Available from: http://dx.doi.org/10.1017/cbo9780511809071

  • Markus, K. A., & Borsboom, D. (2013). Frontiers of test validity theory. measurement, causation, and meaning. Routledge. Available from: http://dx.doi.org/10.4324/9780203501207

  • McCoy, T., Pavlick, E., & Linzen, T. (2019). Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). Available from: http://dx.doi.org/10.18653/v1/p19-1334

  • McCullagh, P., & Nelder, J. (1989). Generalized linear models (2nd ed.). Chapman and Hall. Available from: http://dx.doi.org/10.1201/9780203753736

  • Michell, J. (2004). Measurement on psychology. Cambridge University Press. Available from: http://dx.doi.org/10.1017/cbo9780511490040

  • Mikolov, T., Yih, W.-T., & Zweig, G. (2013). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAAC:HLT). Available from: https://aclanthology.org/N13-1090

  • Miller, T. (2019). Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 267, 1–38. Available from: https://www.sciencedirect.com/science/article/pii/S0004370218305988

  • Mitchell, T., Cohen, W., Hruschka, E., Talukdar, P., Yang, B., Betteridge, J., Carlson, A., Dalvi, B., Gardner, M., Kisiel, B., Krishnamurthy, J., Lao, N., Mazaitis, K., Mohamed, T., Nakashole, N., Platanios, E., Ritter, A., Samadi, M., Settles, B., Wang, R., Wijaya, D., Gupta, A., Chen, X., Saparov, A., Greaves, M., & Welling, J. (2018). Never-ending learning. Communication ACM, 61(5), 103–115. Available from: https://doi.org/10.1145/3191513

  • Narens, L. (1985). Abstract measurement theory. Cambridge University Press. Available from: https://mitpress.mit.edu/9780262140379/abstract-measurement-theory/

  • Nemati, S., Holder, A., Razmi, F., Stanley, M. D., Clifford, G. D., & Buchman, T. G. (2018). An interpretable machine learning model for accurate prediction of sepsis in the ICU. Critical Care Medicine, 46(4), 547–553. Available from: http://dx.doi.org/10.1097/ccm.0000000000002936

  • Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J., & Kiela, D. (2020). Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), Online. Available from: http://dx.doi.org/10.18653/v1/2020.acl-main.441

  • Niven, T., & Kao, H.-Y. (2019). Probing neural network comprehension of natural language arguments. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). Available from: http://dx.doi.org/10.18653/v1/P19-1459

  • Pearl, J. (2009). Causality: Models, reasoning, and inference (2nd ed.). Cambridge University Press. Available from: https://doi.org/10.1017/CBO9780511803161

  • Peters, J., Bühlmann, P., & Meinshausen, N. (2016). Causal inference using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society, Series B, 78(5), 947–1012. Available from: https://www.jstor.org/stable/44682904

  • Peters, J., Janzing, D., & Schölkopf, B. (2017). Elements of causal inference: Foundations and learning algorithms. MIT Press. Available from: https://mitpress.mit.edu/9780262037310/elements-of-causal-inference/

  • Piroi, F., & Tait, J. (2010). CLEF-IP 2010: Retrieval experiments in the intellectual property domain. In Proceedings of the Conference on Multilingual and Multimodal Information Access Evaluation (CLEF 2010). Available from: http://www.ifs.tuwien.ac.at/~clef-ip/pubs/CLEF-IP-2010-IRF-TR-2010-00005.pdf

  • Poliak, A., Naradowsky, J., Haldar, A., Rudinger, R., & Van Durme, B. (2018). Hypothesis only baselines in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics. Available from: http://dx.doi.org/10.18653/v1/S18-2023

  • Qin, T., Liu, T.-Y., Xu, J., & Li, H. (2010). LETOR: A benchmark collection for research on learning to rank for information retrieval. Information Retrieval Journal, 13(4), 346–374. Available from: https://doi.org/10.1007/s10791-009-9123-y

  • Reyna, M. A., Josef, C. S., Jeter, R., Shashikumar, S. P., Westover, M. B., Nemati, S., Clifford, G. D., & Sharma, A. (2019). Early prediction of sepsis from clinical data: The physionet/computing in cardiology challenge 2019. Critical Care Medicine, 48(2), 210–217. Available from: https://doi.org/10.1097/CCM.0000000000004145

  • Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). Why Should I Trust You? Explaining the predictions of any classifier. In Proceedings of the Conference on Knowledge Discovery and Data Mining (KDD). Available from: http://dx.doi.org/10.1145/2939672.2939778

  • Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333–389. Available from: http://dx.doi.org/10.1561/1500000019

  • Rosenfeld, E., Ravikumar, P., & Risteski, A. (2021). The risks of invariant risk minimization. In Proceedings of the International Conference on Learning Representations (ICLR). Virtual. Available from: https://openreview.net/forum?id=BbNIbVPJ-42

  • Rosset, S., Perlich, C., Swirszcz, G., Melville, P., & Liu, Y. (2009). Medical data mining: insights from winning two competitions. Data Mining and Knowledge Discovery, 20, 439–468. Available from: https://doi.org/10.1007/s10618-009-0158-x

  • Rudd, K. E., Johnson, S. C., Agesa, K. M., Shackelford, K. A., Tsoi, D., Kievlan, D. R., Colombara, D. V., Ikuta, K. S., Kissoon, N., Finfer, S., Fleischmann-Struzek, C., Machado, F. R., Reinhart, K. K., Rowan, K., Seymour, C. W., Watson, R. S., West, T. E., Marinho, F., Hay, S. I., Lozano, R., Lopez, A. D., Angus, D. C., Murray, C. J. L., & Naghavi, M. (2020). Global, regional, and national sepsis incidence and mortality, 1990–2017: analysis for the global burden of disease study. The Lancet, 395(10219), 200–211. Available from: https://doi.org/10.1016/S0140-6736(19)32989-7

  • Schamoni, S., & Riezler, S. (2015). Combining orthogonal information in large-scale cross-language information retrieval. In Proceedings of the 38th Annual ACM SIGIR Conference. Available from: http://dx.doi.org/10.1145/2766462.2767805

  • Schamoni, S., Lindner, H. A., Schneider-Lindner, V., Thiel, M., & Riezler, S. (2019). Leveraging implicit expert knowledge for non-circular machine learning in sepsis prediction. Journal of Artificial Intelligence in Medicine, 100, 1–9. Available from: https://doi.org/10.1016/j.artmed.2019.101725

  • Schlegel, V., Nenadic, G., & Batista-Navarro, R. (2020). Beyond leaderboards: A survey of methods for revealing weaknesses in natural language inference data and models. ar**v:abs/2005.14709, Available from: https://doi.org/10.48550/ar**v.2005.14709

  • Schölkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., & Bengio, Y. (2021). Toward causal representation learning. Proceedings of the IEEE, 109(5), 612–634. Available from: https://doi.org/10.1109/JPROC.2021.3058954

  • Seymour, C. W., Liu, V. X., Iwashyna, T. J., Brunkhorst, F. M., Rea, T. D., Scherag, A., Rubenfeld, G., Kahn, J. M., Shankar-Hari, M., Singer, M., Deutschman, C. S., Escobar, G. J., & Angus, D. C. (2016). Assessment of clinical criteria for sepsis for the third international consensus definitions for sepsis and septic shock (Sepsis-3). JAMA, 315(8), 762–774. Available from: https://doi.org/10.1001/jama.2016.0288

  • Singer, M., Deutschman, C. S., & Seymour, C. W. (2016). The third international consensus definitions for sepsis and septic shock (Sepsis-3). JAMA, 315(8), 801–810. Available from: https://doi.org/10.1001/jama.2016.0287

  • Sneed, J. D. (1971). The logical structure of mathematical physics. D. Reidel. Available from: https://doi.org/10.1007/978-94-010-3066-3

  • Søgaard, A., Ebert, S., Bastings, J., & Filippova, K. (2021). We need to talk about random splits. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Online. Available from: http://dx.doi.org/10.18653/v1/2021.eacl-main.156

  • Stegmüller, W. (1979). The structuralist view of theories. A possible analogue of the Bourbaki programme in physical science. Springer. Available from: http://dx.doi.org/10.1007/978-3-642-95360-6

  • Stegmüller, W. (1986). Probleme und Resultate der Wissenschaftstheorie und Analytischen Philosophie. Band II: Theorie und Erfahrung. Zweiter Teilband: Therienstrukturen und Theoriendynamik (2nd ed.). Springer. Available from: https://doi.org/10.1007/978-3-642-61671-6

  • Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677–680. Available from: http://dx.doi.org/10.1126/science.103.2684.677

  • Tan, S., Caruana, R., Hooker, G., & Lou, Y. (2018). Distill-and-compare: Auditing black-box models using transparent model distillation. In Proceedings of AIES. Available from: http://dx.doi.org/10.1145/3278721.3278725

  • Tomaschek, F., Hendrix, P., & Baayen, R. H. (2018). Strategies for addressing collinearity in multivariate linguistic data. Journal of Phonetics, 71, 249–267. Available from: http://dx.doi.org/10.1016/j.wocn.2018.09.004

  • Vincent, J., Moreno, R., Takala, J., Willatts, S., Mendonça, A. D., Bruining, H., Reinhart, C., Suter, P., & Thijs, L. (1996). The SOFA (Sepsis-related Organ Failure Assessment) score to describe organ dysfunction/failure. Intensive Care Medicine, 22(7), 707–710. Available from: https://doi.org/10.1007/BF01709751

  • Williams, A., Nangia, N., & Bowman, S. (2018). A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL:HLT). Available from: http://dx.doi.org/10.18653/v1/N18-1101

  • Wood, S. N. (2017). Generalized additive models. An introduction with R (2nd ed.). Chapman & Hall/CRC. Available from: https://doi.org/10.1201/9781315370279

  • Zhai, C., & Lafferty, J. (2001). A study of smoothing methods for language models applied to information retrieval. In Proceedings of the 24th Annual International Conference on Research and Development in Information Retrieval (SIGIR). Available from: https://doi.org/10.1145/383952.384019

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefan Riezler .

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Riezler, S., Hagmann, M. (2024). Validity. In: Validity, Reliability, and Significance. Synthesis Lectures on Human Language Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-57065-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-57065-0_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-57064-3

  • Online ISBN: 978-3-031-57065-0

  • eBook Packages: Synthesis Collection of Technology (R0)

Publish with us

Policies and ethics

Navigation