Metrics reloaded: recommendations for image analysis validation

Maier-Hein, Lena; Reinke, Annika; Godau, Patrick; Tizabi, Minu D.; Buettner, Florian; Christodoulou, Evangelia; Glocker, Ben; Isensee, Fabian; Kleesiek, Jens; Kozubek, Michal; Reyes, Mauricio; Riegler, Michael A.; Wiesenfarth, Manuel; Kavur, A. Emre; Sudre, Carole H.; Baumgartner, Michael; Eisenmann, Matthias; Heckmann-Nötzel, Doreen; Rädsch, Tim; Acion, Laura; Antonelli, Michela; Arbel, Tal; Bakas, Spyridon; Benis, Arriel; Blaschko, Matthew B.; Cardoso, M. Jorge; Cheplygina, Veronika; Cimini, Beth A.; Collins, Gary S.; Farahani, Keyvan; Ferrer, Luciana; Galdran, Adrian; van Ginneken, Bram; Haase, Robert; Hashimoto, Daniel A.; Hoffman, Michael M.; Huisman, Merel; Jannin, Pierre; Kahn, Charles E.; Kainmueller, Dagmar; Kainz, Bernhard; Karargyris, Alexandros; Karthikesalingam, Alan; Kofler, Florian; Kopp-Schneider, Annette; Kreshuk, Anna; Kurc, Tahsin; Landman, Bennett A.; Litjens, Geert; Madani, Amin; Maier-Hein, Klaus; Martel, Anne L.; Mattson, Peter; Meijering, Erik; Menze, Bjoern; Moons, Karel G. M.; Müller, Henning; Nichyporuk, Brennan; Nickel, Felix; Petersen, Jens; Rajpoot, Nasir; Rieke, Nicola; Saez-Rodriguez, Julio; Sánchez, Clara I.; Shetty, Shravya; van Smeden, Maarten; Summers, Ronald M.; Taha, Abdel A.; Tiulpin, Aleksei; Tsaftaris, Sotirios A.; Van Calster, Ben; Varoquaux, Gaël; Jäger, Paul F.

doi:10.1038/s41592-023-02151-z

Metrics reloaded: recommendations for image analysis validation

Perspective
Published: 12 February 2024

Volume 21, pages 195–212, (2024)
Cite this article

Download PDF

From

View current issue Submit your manuscript

Metrics reloaded: recommendations for image analysis validation

Download PDF

9205 Accesses
23 Citations
229 Altmetric
6 Mentions
Explore all metrics

Abstract

Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. In biomedical image analysis, chosen performance metrics often do not reflect the domain interest, and thus fail to adequately measure scientific progress and hinder translation of ML techniques into practice. To overcome this, we created Metrics Reloaded, a comprehensive framework guiding researchers in the problem-aware selection of metrics. Developed by a large international consortium in a multistage Delphi process, it is based on the novel concept of a problem fingerprint—a structured representation of the given problem that captures all aspects that are relevant for metric selection, from the domain interest to the properties of the target structure(s), dataset and algorithm output. On the basis of the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics while being made aware of potential pitfalls. Metrics Reloaded targets image analysis problems that can be interpreted as classification tasks at image, object or pixel level, namely image-level classification, object detection, semantic segmentation and instance segmentation tasks. To improve the user experience, we implemented the framework in the Metrics Reloaded online tool. Following the convergence of ML methodology across application domains, Metrics Reloaded fosters the convergence of validation methodology. Its applicability is demonstrated for various biomedical use cases.

Understanding metric-related pitfalls in image analysis validation

Article 12 February 2024

Methods and open-source toolkit for analyzing and visualizing challenge results

Article Open access 27 January 2021

Labelling instructions matter in biomedical image analysis

Article Open access 02 March 2023

Main

Automatic image processing with ML is gaining increasing traction in biological and medical imaging research and practice. Research has predominantly focused on the development of new image processing algorithms. The critical issue of reliable and objective performance assessment of these algorithms, however, remains largely unexplored. Algorithm performance in image processing is commonly assessed with validation metrics (not to be confused with distance metrics in the pure mathematical sense) that should serve as proxies for the domain interest. In consequence, the impact of validation metrics cannot be overstated; first, they are the basis for deciding on the practical (for example, clinical) suitability of a method and are thus a key component for translation into biomedical practice. In fact, validation that is not conducted according to relevant metrics could be one major reason for why many artificial intelligence (AI) developments in medical imaging fail to reach clinical practice^1,2. In other words, the numbers presented in journals and conference proceedings do not reflect how successful a system will be when applied in practice. Second, metrics guide the scientific progress in the field; flawed metric use can lead to entirely futile resource investment and infeasible research directions while obscuring true scientific advancements.

Despite the importance of metrics, an increasing body of work shows that the metrics used in common practice often do not adequately reflect the underlying biomedical problems, diminishing the validity of the investigated methods^{3,4,5,2.2 for ImLC, Supplementary Note 2.3 for SemS, Supplementary Note 2.4 for ObD and Supplementary Note 2.5 for InS.}

Full size image

**Fig. 3: Relevant properties of a driving biomedical image analysis problem are captured by the problem fingerprint (selection for SemS shown here).**

Based on the problem fingerprint, the user is then, in a transparent and understandable manner, guided through the process of selecting an appropriate set of metrics while being made aware of potential pitfalls related to the specific characteristics of the underlying biomedical problem. The Metrics Reloaded framework currently supports problems in which categorical target variables are to be predicted based on a given n-dimensional input image (possibly enhanced with context information) at pixel, object or image level (Fig. 4). It thus supports problems that can be assigned to one of the following four problem categories: image-level classification (ImLC; image level), ObD (object level), semantic segmentation (SemS; pixel level) or instance segmentation (InS; pixel level). Designed to be imaging modality independent, Metrics Reloaded can be suited for application in various image analysis domains even beyond the field of biomedicine.

**Fig. 4: Metrics Reloaded fosters the convergence of validation methodology across modalities, application domains and classification scales.**

Here, we present the key contributions of our work in detail, namely (1) the Metrics Reloaded framework for problem-aware metric selection along with the key findings and design decisions that guided its development (Fig. 2), (2) the application of the framework to common biomedical use cases, showcasing its broad applicability (selection shown in Fig. 5) and (3) the open online tool that has been implemented to improve the user experience with our framework.

**Fig. 5: Instantiation of the framework with recommendations for concrete biomedical questions.**

Data availability

No data were used in this study.

Code availability

We provide reference implementations for all Metrics Reloaded metrics within the MONAI open-source framework. They are accessible at https://github.com/Project-MONAI/MetricsReloaded/.

References

Kelly, C. J., Karthikesalingam, A., Suleyman, M., Corrado, G. & King, D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 17, 195 (2019).
Article PubMed PubMed Central Google Scholar
Shah, N. H., Milstein, A. & Bagley, S. C. Making machine learning models clinically useful. JAMA 322, 1351–1352 (2019).
Article PubMed Google Scholar
Correia, P. & Pereira, F. Video object relevance metrics for overall segmentation quality evaluation. EURASIP J. Adv. Signal Process. 2006, 082195 (2006).
Article Google Scholar
Gooding, M. J. et al. Comparative evaluation of autocontouring in clinical practice: a practical method using the turing test. Med. Phys. 45, 5105–5115 (2018).
Article PubMed Google Scholar
Honauer, K., Maier-Hein, L. and Kondermann, D. The HCI stereo metrics: Geometry-aware performance analysis of stereo algorithms. In Proceedings of the IEEE International Conference on Computer Vision, 2120–2128 (2015).
Kofler, F., et al. Are we using appropriate segmentation metrics? Identifying correlates of human expert perception for CNN training beyond rolling the DICE coefficient. Preprint at ar**v https://doi.org/10.48550/ar**v.2103.06205 (2021).
Konukoglu, E., Glocker, B., Ye, D. H., Criminisi, A. & Pohl, K. M. Discriminative segmentation-based evaluation through shape dissimilarity. IEEE Trans. Med. Imaging 31, 2278–2289 (2012).
Article PubMed PubMed Central Google Scholar
L. Maier-Hein, et al. Why rankings of biomedical image analysis competitions should be interpreted with care. Nat. Comm. 9, 5217 (2018). With this comprehensive analysis of biomedical image analysis competitions (challenges), the authors initiated a shift in how such challenges are designed, performed, and reported in the biomedical domain. Its concepts and guidelines have been adopted by reputed organizations such as MICCAI.
Margolin, R., Zelnik-Manor, L., and Tal, A. How to evaluate foreground maps? In Proceedings of the IEEE Conference On Computer Vision And Pattern Recognition 248–255 (2014).
Tran, T. N. et al. Sources of performance variability in deep learning-based polyp detection. Int. J. Comput. Assist. Radiol. Surg. 18, 1311–1322 (2023).
Article CAS PubMed PubMed Central Google Scholar
Vaassen, F. et al. Evaluation of measures for assessing time-saving of automatic organ-at-risk segmentation in radiotherapy. Phys. Imaging Radiat. Oncol. 13, 1–6 (2020).
Article PubMed Google Scholar
Chenouard, N. et al. Objective comparison of particle tracking methods. Nat. Methods 11, 281–289 (2014).
Article CAS PubMed PubMed Central Google Scholar
Sage, D. et al. Quantitative evaluation of software packages for single-molecule localization microscopy. Nat. Methods 12, 717–724 (2015).
Article CAS PubMed Google Scholar
Ulman, V. et al. An objective comparison of cell-tracking algorithms. Nat. Methods 14, 1141–1152 (2017).
Article CAS PubMed PubMed Central Google Scholar
Carass, A. et al. Evaluating white matter lesion segmentations with refined Sørensen-Dice analysis. Sci. Rep. 10, 8242 (2020).
Article CAS PubMed PubMed Central ADS Google Scholar
Jäger, P. F. Challenges and opportunities of end-to-end learning in medical image classification. Karlsruher Institut für Technologie (2020).
Bernice B. B. Delphi process: a methodology used for the elicitation of opinions of experts. Technical report, The RAND Corporation (1968).
Nasa, P., Jain, R. & Juneja, D. Delphi methodology in healthcare research: how to decide its appropriateness. World J. Methodol. 11, 116–129 (2021).
Article PubMed PubMed Central Google Scholar
Reinke, A. et al. Understanding metric-related pitfalls in image analysis validation. Nat. Methods https://doi.org/10.1038/s41592-023-02150-0 (2023). Sister publication jointly submitted with this work.
Reinke, A. et al. How to exploit weaknesses in biomedical challenge design and organization. In International Conference on Medical Image Computing and Computer-Assisted Intervention (eds. A. F. Frangi et al.) 388–395 (Springer, 2018).
Schulz, K. F., Altman, D. G., Moher, D. & CONSORT Group. Consort 2010 statement: updated guidelines for reporting parallel group randomized trials. Ann. Intern. Med. 152, 726–732 (2010).
Moons, K. G. M. et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (tripod): explanation and elaboration. Ann. Intern. Med. 162, W1–W73 (2015).
Article PubMed Google Scholar
Bossuyt, P. M. et al. Towards complete and accurate reporting of studies of diagnostic accuracy: the stard initiative. Ann. Intern. Med. 138, 40–44 (2003).
Article PubMed Google Scholar
Vickers, A. J., Van Calster, B. & Steyerberg, E. W. Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests. BMJ 352, i6 (2016).
Article PubMed PubMed Central Google Scholar
van Leeuwen, D. A. & Brümmer, N. An introduction to application-independent evaluation of speaker recognition systems. In Speaker classification I (ed. C. Muller) 330–353 (Springer, 2007).
Ferrer, L. Analysis and comparison of classification metrics. Preprint at ar**v https://doi.org/10.48550/ar**v.2209.05355 (2022). The document discusses common performance metrics used in machine learning classification, and introduces the EC metric. It compares these metrics and argues that EC is superior due to its generality, simplicity and intuitive nature. Additionally, it highlights the potential of EC in measuring calibration and optimal decision-making using class posteriors.
Reinke, A. et al. Common limitations of image processing metrics: a picture story. Preprint at ar**v https://doi.org/10.48550/ar**v.2104.05642 (2021).
Gruber, S. & Buettner, F. Better uncertainty calibration via proper scores for classification and beyond. Adv. Neural Inform. Process Syst. 35, 8618–8632 (2022).
Google Scholar
Kirillov, A., He, K., Girshick, R., Rother, C. and Dollár, P. Panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 9404–9413 (2019).
Wiesenfarth, M. et al. Methods and open-source toolkit for analyzing and visualizing challenge results. Sci. Rep. 11, 2369 (2021).
Article CAS PubMed PubMed Central ADS Google Scholar
Liu, X. et al. Baseline photos and confident annotation improve automated detection of cutaneous graft-versus-host disease. Clin. Hematol. Int. 3, 108–115(2021).
Article PubMed PubMed Central Google Scholar
Taha, A. A. & Hanbury, A. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Med. Imaging 15, 29 (2015). The paper discusses the importance of effective metrics for evaluating the accuracy of 3D medical image segmentation algorithms. The authors analyze existing metrics, propose a selection methodology, and develop a tool to aid researchers in choosing appropriate evaluation metrics based on the specific characteristics of the segmentation task.
Article PubMed PubMed Central Google Scholar
Perez-Lebel, A., Le Morvan, M., and Varoquaux, G. Beyond calibration: estimating the grou** loss of modern neural networks. Preprint at ar**v https://doi.org/10.48550/ar**v.2210.16315 (2023).
Rand, W. M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66, 846–850 (1971).
Article Google Scholar
Meilă, M. Comparing clusterings by the variation of information. In Learning Theory and Kernel Machines 173–187 (Springer, 2003).
Côté, M. A. et al. Tractometer: towards validation of tractography pipelines. Medical Image Analysis https://doi.org/10.1016/j.media.2013.03.009. (2013)
Ellis, D. G., Alvarez, C. M. and Aizenberg, M. R. Qualitative criteria for feasible cranial implant designs. In Cranial Implant Design Challenge 8–18 (Springer, 2021).
D’Amour, A. et al. Underspecification presents challenges for credibility in modern machine learning. J. Mach. Learn. Res. 23, 10237–10297 (2022).
Schulam, P. & Saria, S. Can you trust this prediction? Auditing pointwise reliability after learning. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics (eds. Chaudhuri, K. & Sugiyama, M.) Vol. 89, 1022–1031 (PMLR, 2019).
P. F. Jaeger, Carsten T. Lüth, Lukas Klein, and Till J. Bungert. A call to reflect on evaluation practices for failure detection in image classification. Preprint at ar**v https://doi.org/10.48550/ar**v.2211.15259 (2023).
Université de Montréal. The Declaration - Montreal Responsible AI, 2017. https://declarationmontreal-iaresponsable.com/
The Institute for Ethical Ai and Machine Learning. https://ethical.institute/principles.html. Accessed 5/21/2022 (2018).
Jannin, P. Towards responsible research in digital technology for health care. Preprint at ar**v https://doi.org/10.48550/ar**v.2110.09255 (2021).
Lacoste, A., Luccioni, A., Schmidt, V., and Dandres, T. Quantifying the carbon emissions of machine learning. Preprint at https://arxiv.org/abs/1910.09700 (2019).
Patterson, D., et al. Carbon emissions and large neural network training. Preprint at ar**v https://doi.org/10.48550/ar**v.2104.10350 (2021).
Strubell, E., Ganesh, A., and McCallum, A. Energy and policy considerations for deep learning in NLP. Preprint at https://doi.org/10.48550/ar**v.1906.02243 (2019).
Lannelongue, L., Grealey, J. & Inouye, M. Green algorithms: quantifying the carbon footprint of computation. Adv. Sci. 8, 2100707 (2021).
Article Google Scholar
Anthony, L. F. W., Kanding, B., and Selvan, R. Carbontracker: tracking and predicting the carbon footprint of training deep learning models. Preprint at ar**v https://doi.org/10.48550/ar**v.2007.03051 (2020).
Roß, T. et al. Beyond rankings: learning (more) from algorithm validation. Med. Image Anal. 86, 102765 (2023).
Article PubMed Google Scholar
Char, D. S., Shah, N. H. & Magnus, D. Implementing machine learning in health care - addressing ethical challenges. N. Engl. J. Med. 378, 981–983 (2018).
Article PubMed PubMed Central Google Scholar
Oakden-Rayner, L., Dunnmon, J., Carneiro, G., and Ré, C. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. Proc. ACM Conf. Health Inference Learn 2020, 151–159 (2020).
Adamson, A. S. & Smith, A. Machine learning and health care disparities in dermatology. JAMA Dermatol. 154, 1247–1248 (2018).
Article PubMed Google Scholar
Geirhos, R. et al. Shortcut learning in deep neural networks. Nat. Mach. Intell. 2, 665–673 (2020).
Article Google Scholar
Ibrahim, H., Liu, X., Zariffa, N., Morris, A. D. & Denniston, A. K. Health data poverty: an assailable barrier to equitable digital health care. Lancet Digit. Health 3, e260–e265 (2021).
Article CAS PubMed Google Scholar
Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366, 447–453 (2019).
Article CAS PubMed ADS Google Scholar
McCradden, M. D. et al. A research ethics framework for the clinical translation of healthcare machine learning. Am. J. Bioeth. 22, 8–22 (2022).
Park, S. H. et al. Methods for clinical evaluation of artificial intelligence algorithms for medical diagnosis. Radiology https://doi.org/10.1148/radiol.220182 (2023).
Usatine, R. & Manci, R. Dermoscopedia https://dermoscopedia.org/File:DF_chinese_dms.JPG (2021).
Armato, S. G. III et al. The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI): a completed reference database of lung nodules on CT scans. Med. Phys. 38, 915–931 (2011).
Article PubMed PubMed Central Google Scholar
Ljosa, V., Sokolnicki, K. L. & Carpenter, A. E. Annotated high-throughput microscopy image sets for validation. Nat. Methods 9, 637 (2012).
Article CAS PubMed PubMed Central Google Scholar
Maier-Hein, L. et al. Heidelberg colorectal data set for surgical data science in the sensor operating room. Sci. Data 8, 101 (2021).
Article PubMed PubMed Central Google Scholar
Haugen, T. B. et al. Visem: a multimodal video dataset of human spermatozoa. In Proceedings of the 10th ACM Multimedia Systems Conference 261–266 (2019).
Codella, N. et al. Skin lesion analysis toward melanoma detection 2018: a challenge hosted by the international skin imaging collaboration (ISIC). Preprint at ar**v https://doi.org/10.48550/ar**v.1902.03368 (2019).
Targosz, A., Przystałka, P., Wiaderkiewicz, R. & Mrugacz, G. Semantic segmentation of human oocyte images using deep neural networks. Biomed. Eng. Online 20, 40 (2021).
Article PubMed PubMed Central Google Scholar
Antonelli, M. et al. The medical segmentation decathlon. Nat. Commun. 13, 4128 (2022).
Article CAS PubMed PubMed Central ADS Google Scholar
Simpson, A. L. et al. A large annotated medical image dataset for the development and evaluation of segmentation algorithms. Preprint at https://doi.org/10.48550/ar**v.1902.09063 (2019).
Nagao, Y., Sakamoto, M., Chinen, T., Okada, Y. & Takao, D. Robust classification of cell cycle phase and biological feature extraction by image-based deep learning. Mol. Biol. Cell 31, 1346–1354 (2020).
Article CAS PubMed PubMed Central Google Scholar
Zhang, Y. et al. Deepphagy: a deep learning framework for quantitatively measuring autophagy activity in saccharomyces cerevisiae. Autophagy 16, 626–640 (2020).
Article CAS PubMed Google Scholar
Commowick, O. et al. Objective evaluation of multiple sclerosis lesion segmentation using a data management and processing infrastructure. Sci. Rep. 8, 13650 (2018).
Article PubMed PubMed Central ADS Google Scholar
Kofler, F. et al. blob loss: instance imbalance aware loss functions for semantic segmentation. In International Conference on Information Processing in Medical Imaging 755–767 (Springer Nature Switzerland, 2023).
Mais, L., Hirsch, P. and Kainmueller, D. Patchperpix for instance segmentation. In European Conference on Computer Vision 288–304 (Springer, 2020).
Meissner, G. et al. A searchable image resource of Drosophila GAL4-driver expression patterns with single neuron resolution. eLife 12, e80660 (2023).
Tirian, L. & Dickson, B. J. The VT GAL4, Lexa, and split-GAL4 driver line collections for targeted expression in the Drosophila nervous system. Preprint at bioRxiv https://doi.org/10.1101/198648 (2017).
Brümmer, N. & Du Preez, J. Application-independent evaluation of speaker detection. Comput. Speech Lang. 20, 230–275 (2006).
Article Google Scholar

Download references

Acknowledgements

This work was initiated by the Helmholtz Association of German Research Centers in the scope of the Helmholtz Imaging Incubator (HI), the MICCAI Special Interest Group on biomedical image analysis challenges and the benchmarking working group of the MONAI initiative. It received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement no. 101002198, NEURAL SPICING). It was further supported in part by the Intramural Research Program of the National Institutes of Health (NIH) Clinical Center as well as by the National Cancer Institute (NCI) and the National Institute of Neurological Disorders and Stroke (NINDS) of the NIH, under award numbers NCI:U01CA242871, NCI:U24CA279629 and NINDS:R01NS042645. The content of this publication is solely the responsibility of the authors and does not represent the official views of the NIH. T.A. acknowledges the Canada Institute for Advanced Research (CIFAR) AI Chairs program, the Natural Sciences and Engineering Research Council of Canada. F.B. was co-funded by the European Union (ERC, TAIPO, 101088594). Views and opinions expressed are those of the authors only and do not necessarily reflect those of the European Union or the ERC. Neither the European Union nor the granting authority can be held responsible for them. V.C. acknowledges funding from Novo Nordisk Foundation (NNF21OC0068816) and Independent Research Council Denmark (1134-00017B). B.A.C. was supported by NIH grant P41 GM135019 and grant 2020-225720 from the Chan Zuckerberg Initiative DAF, an advised fund of the Silicon Valley Community Foundation. G.S.C. was supported by Cancer Research UK (program grant no. C49297/A27294). M.M.H. is supported by the Natural Sciences and Engineering Research Council of Canada (RGPIN-2022- 05134). A. Karargyris is supported by French State Funds managed by the ‘Agence Nationale de la Recherche (ANR)’ - ‘Investissements d’Avenir’ (Investments for the Future), grant ANR-10-IAHU- 02 (IHU Strasbourg). M.K. was supported by the Ministry of Education, Youth and Sports of the Czech Republic (project LM2018129). T.K. was supported in part by 4UH3-CA225021-03, 1U24CA180924-01A1, 3U24CA215109-02 and 1UG3-CA225-021-01 grants from the NIH. G.L. receives research funding from the Dutch Research Council, the Dutch Cancer Association, HealthHolland, the ERC, the European Union and the Innovative Medicine Initiative. C.H.S. is supported by an Alzheimer’s Society Junior Fellowship (AS-JF-17-011). M.R. is supported by Innosuisse (grant no. 31274.1) and Swiss National Science Foundation (grant no. 205320_212939). R.M.S. is supported by the Intramural Research Program of the NIH Clinical Center. A.T. acknowledges support from the Academy of Finland (Profi6 336449 funding program), University of Oulu strategic funding, Finnish Foundation for Cardiovascular Research, Wellbeing Services County of North Ostrobothnia (VTR project K62716) and the Terttu foundation. S.A.T. acknowledges the support of Canon Medical and the Royal Academy of Engineering and the Research Chairs and Senior Research Fellowships scheme (grant RCSRF1819\8\25). We thank N. Sautter, P. Vieten and T. Adler for proposing the name for the project. We thank P. Bankhead, F. Hamprecht, H. Kenngott, D. Moher and B. Stieltjes for fruitful discussions on the framework. We thank S. Steger for the data protection supervision and A. Trotter for the hosting of the surveys. We thank L. Mais for instantiating the use case for InS of neurons from the fruit fly in 3D multicolor light microscopy images. We further thank the Janelia FlyLight Project Team for providing us with example images for this use case. We thank the following people for testing the metric map**s, reviewing the recommendations and performing metric-centric testing: T. Adler, C. Bender, A. B. Qasim, K. Dreher, N. Holzwarth, M. Hübner, D. Michael, L. -R. Müller, M. Rees, T. Rix, M. Schellenberg, S. Seidlitz, J. Sellner, A. Srivastava, F. Wolf, A. E. Yamlahi, S. D. Almeida, M. Baumgartner, D. Bounias, T. Bungert, M. Fischer, L. Klein, G. Köhler, B. Kovács, C. Lueth, T. Norajitra, C. Ulrich, T. Wald, I. Alekseenko, X. Liu, A. Marheim Storås and V. Thambawita. We thank the following people for taking our social media community survey and providing helpful feedback for improving the framework: Y. Akemi, R. Anteby, C. Arthurs, P. De Backer, H. Badgery, M. Baugh, J. Bernal, D. Bounias, F. C. Kitamura, J. Carse, C. Chen, I. Flipse, N. Gaggion, C. González, P. M. Gordaliza, T. Horeman, L. Joskowicz, A. Jose, A. Kamath, B. Kelly, Y. Kirchhoff, L. A. Kobelke, L. Krämer, M. Krendel, J. LaMaster, T. de Lange, J. L. Lavanchy, J. Li, C. Lüth, L. Mais, A. Marheim Storås, V. Nath, C. Scannell, C. Pape, M. P. Schijven, A. Selvanetti, B. S. Fadida, R. Staff, J. Tan, E. Tkaczyk, R. T. Calumby, A. Vlontzos, W. Zhang, C. Zhao and J. Zhu.

Author information

Robert Haase
Present address: Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI), Leipzig University, Leipzig, Germany
These authors contributed equally: Lena Maier-Hein, Annika Reinke.

Authors and Affiliations

German Cancer Research Center (DKFZ) Heidelberg, Division of Intelligent Medical Systems, Heidelberg, Germany
Lena Maier-Hein, Annika Reinke, Patrick Godau, Minu D. Tizabi, Evangelia Christodoulou, A. Emre Kavur, Matthias Eisenmann, Doreen Heckmann-Nötzel & Tim Rädsch
German Cancer Research Center (DKFZ) Heidelberg, HI Helmholtz Imaging, Heidelberg, Germany
Lena Maier-Hein, Annika Reinke, Tim Rädsch & Paul F. Jäger
Faculty of Mathematics and Computer Science, Heidelberg University, Heidelberg, Germany
Lena Maier-Hein, Annika Reinke & Patrick Godau
Medical Faculty, Heidelberg University, Heidelberg, Germany
Lena Maier-Hein
National Center for Tumor Diseases (NCT), NCT Heidelberg, a partnership between DKFZ and University Medical Center Heidelberg, Heidelberg, Germany
Lena Maier-Hein, Patrick Godau, Minu D. Tizabi & Doreen Heckmann-Nötzel
German Cancer Consortium (DKTK), partner site Frankfurt/Mainz, a partnership between DKFZ and UCT Frankfurt-Marburg, Frankfurt am Main, Germany
Florian Buettner
German Cancer Research Center (DKFZ) Heidelberg, Heidelberg, Germany
Florian Buettner
Department of Medicine, Goethe University Frankfurt, Frankfurt am Main, Germany
Florian Buettner
Department of Informatics, Goethe University Frankfurt, Frankfurt am Main, Germany
Florian Buettner
Frankfurt Cancer Insititute, Frankfurt am Main, Germany
Florian Buettner
Department of Computing, Imperial College London, South Kensington Campus, London, UK
Ben Glocker
German Cancer Research Center (DKFZ) Heidelberg, Division of Medical Image Computing, Heidelberg, Germany
Fabian Isensee, A. Emre Kavur, Michael Baumgartner, Klaus Maier-Hein & Jens Petersen
German Cancer Research Center (DKFZ) Heidelberg, HI Applied Computer Vision Lab, Heidelberg, Germany
Fabian Isensee & A. Emre Kavur
Institute for AI in Medicine, University Medicine Essen, Essen, Germany
Jens Kleesiek
Centre for Biomedical Image Analysis and Faculty of Informatics, Masaryk University, Brno, Czech Republic
Michal Kozubek
ARTORG Center for Biomedical Engineering Research, University of Bern, Bern, Switzerland
Mauricio Reyes
Department of Radiation Oncology, University Hospital Bern, University of Bern, Bern, Switzerland
Mauricio Reyes
Simula Metropolitan Center for Digital Engineering, Oslo, Norway
Michael A. Riegler
Department of Computer Science, UiT The Arctic University of Norway, Tromsø, Norway
Michael A. Riegler
German Cancer Research Center (DKFZ) Heidelberg, Division of Biostatistics, Heidelberg, Germany
Manuel Wiesenfarth & Annette Kopp-Schneider
MRC Unit for Lifelong Health and Ageing at UCL and Centre for Medical Image Computing, Department of Computer Science, University College London, London, UK
Carole H. Sudre
School of Biomedical Engineering and Imaging Science, King’s College London, London, UK
Carole H. Sudre, Michela Antonelli & M. Jorge Cardoso
Instituto de Cálculo, CONICET – Universidad de Buenos Aires, Buenos Aires, Argentina
Laura Acion
Centre for Medical Image Computing, University College London, London, UK
Michela Antonelli
Centre for Intelligent Machines and MILA (Québec Artificial Intelligence Institute), McGill University, Montréal, Quebec, Canada
Tal Arbel
Division of Computational Pathology, Department of Pathology & Laboratory Medicine, Indiana University School of Medicine, IU Health Information and Translational Sciences Building, Indianapolis, IN, USA
Spyridon Bakas
Center for Biomedical Image Computing and Analytics (CBICA), University of Pennsylvania, Philadelphia, PA, USA
Spyridon Bakas
Department of Digital Medical Technologies, Holon Institute of Technology, Holon, Israel
Arriel Benis
European Federation for Medical Informatics, Le Mont-sur-Lausanne, Switzerland
Arriel Benis
Center for Processing Speech and Images, Department of Electrical Engineering, KU Leuven, Leuven, Belgium
Matthew B. Blaschko
Department of Computer Science, IT University of Copenhagen, Copenhagen, Denmark
Veronika Cheplygina
Imaging Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA
Beth A. Cimini
Centre for Statistics in Medicine, University of Oxford, Nuffield Orthopaedic Centre, Oxford, UK
Gary S. Collins
Center for Biomedical Informatics and Information Technology, National Cancer Institute, Bethesda, MD, USA
Keyvan Farahani
Instituto de Investigación en Ciencias de la Computación (ICC), CONICET-UBA, Ciudad Autónoma de Buenos Aires, Buenos Aires, Argentina
Luciana Ferrer
BCN Medtech, Universitat Pompeu Fabra, Barcelona, Spain
Adrian Galdran
Australian Institute for Machine Learning AIML, University of Adelaide, Adelaide, South Australia, Australia
Adrian Galdran
Fraunhofer MEVIS, Bremen, Germany
Bram van Ginneken
Radboud Institute for Health Sciences, Radboud University Medical Center, Nijmegen, the Netherlands
Bram van Ginneken
Technische Universität (TU) Dresden, DFG Cluster of Excellence ‘Physics of Life’, Dresden, Germany
Robert Haase
Center for Systems Biology, Dresden, Germany
Robert Haase
Department of Surgery, Perelman School of Medicine, Philadelphia, PA, USA
Daniel A. Hashimoto
General Robotics Automation Sensing and Perception Laboratory, School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, PA, USA
Daniel A. Hashimoto
Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
Michael M. Hoffman
Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada
Michael M. Hoffman & Anne L. Martel
Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada
Michael M. Hoffman
Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
Michael M. Hoffman & Anne L. Martel
Department of Radiology and Nuclear Medicine, Radboud University Medical Center, Nijmegen, the Netherlands
Merel Huisman
Laboratoire Traitement du Signal et de l’Image – UMR_S 1099, Université de Rennes 1, Rennes, France
Pierre Jannin
INSERM, Paris, France
Pierre Jannin
Department of Radiology and Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA
Charles E. Kahn
Max-Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Biomedical Image Analysis and HI Helmholtz Imaging, Berlin, Germany
Dagmar Kainmueller
Digital Engineering Faculty, University of Potsdam, Potsdam, Germany
Dagmar Kainmueller
Department of Computing, Faculty of Engineering, Imperial College London, London, UK
Bernhard Kainz
Department AIBE, Friedrich-Alexander-Universität (FAU), Erlangen-Nürnberg, Germany
Bernhard Kainz
IHU Strasbourg, Strasbourg, France
Alexandros Karargyris
Google Health DeepMind, London, UK
Alan Karthikesalingam
Helmholtz AI, Oberschleißheim, Germany
Florian Kofler
Cell Biology and Biophysics Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany
Anna Kreshuk
Department of Biomedical Informatics, Stony Brook University, Health Science Center, Stony Brook, NY, USA
Tahsin Kurc
Electrical Engineering, Vanderbilt University, Nashville, TN, USA
Bennett A. Landman
Department of Pathology, Radboud University Medical Center, Nijmegen, the Netherlands
Geert Litjens
Department of Surgery, University Health Network, Philadelphia, PA, USA
Amin Madani
Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital, Heidelberg, Germany
Klaus Maier-Hein
Physical Sciences, Sunnybrook Research Institute, Toronto, Ontario, Canada
Anne L. Martel
Google, 1600 Amphitheatre Pkwy, Mountain View, CA, USA
Peter Mattson
School of Computer Science and Engineering, University of New South Wales, UNSW Sydney, Kensington, New South Wales, Australia
Erik Meijering
Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland
Bjoern Menze
Julius Center for Health Sciences and Primary Care, UMC Utrecht, Utrecht University, Utrecht, the Netherlands
Karel G. M. Moons & Maarten van Smeden
Information Systems Institute, University of Applied Sciences Western Switzerland (HES-SO), Sierre, Switzerland
Henning Müller
Medical Faculty, University of Geneva, Geneva, Switzerland
Henning Müller
MILA (Québec Artificial Intelligence Institute), Montréal, Quebec, Canada
Brennan Nichyporuk
Department of General, Visceral and Thoracic Surgery, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
Felix Nickel
Tissue Image Analytics Laboratory, Department of Computer Science, University of Warwick, Coventry, UK
Nasir Rajpoot
NVIDIA, München, Germany
Nicola Rieke
Institute for Computational Biomedicine, Heidelberg University, Heidelberg, Germany
Julio Saez-Rodriguez
Faculty of Medicine, Heidelberg University Hospital, Heidelberg, Germany
Julio Saez-Rodriguez
Informatics Institute, Faculty of Science, University of Amsterdam, Amsterdam, the Netherlands
Clara I. Sánchez
Google Health, Google, Palo Alto, CA, USA
Shravya Shetty
National Institutes of Health Clinical Center, Bethesda, MD, USA
Ronald M. Summers
Institute of Information Systems Engineering, TU Wien, Vienna, Austria
Abdel A. Taha
Research Unit of Health Sciences and Technology, Faculty of Medicine, University of Oulu, Oulu, Finland
Aleksei Tiulpin
Neurocenter Oulu, Oulu University Hospital, Oulu, Finland
Aleksei Tiulpin
School of Engineering, The University of Edinburgh, Edinburgh, Scotland
Sotirios A. Tsaftaris
Department of Development and Regeneration and EPI-centre, KU Leuven, Leuven, Belgium
Ben Van Calster
Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, the Netherlands
Ben Van Calster
Parietal project team, INRIA Saclay-Île de France, Palaiseau, France
Gaël Varoquaux
German Cancer Research Center (DKFZ) Heidelberg, Interactive Machine Learning Group, Heidelberg, Germany
Paul F. Jäger

Authors

Lena Maier-Hein
View author publications
You can also search for this author in PubMed Google Scholar
Annika Reinke
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Godau
View author publications
You can also search for this author in PubMed Google Scholar
Minu D. Tizabi
View author publications
You can also search for this author in PubMed Google Scholar
Florian Buettner
View author publications
You can also search for this author in PubMed Google Scholar
Evangelia Christodoulou
View author publications
You can also search for this author in PubMed Google Scholar
Ben Glocker
View author publications
You can also search for this author in PubMed Google Scholar
Fabian Isensee
View author publications
You can also search for this author in PubMed Google Scholar
Jens Kleesiek
View author publications
You can also search for this author in PubMed Google Scholar
Michal Kozubek
View author publications
You can also search for this author in PubMed Google Scholar
Mauricio Reyes
View author publications
You can also search for this author in PubMed Google Scholar
Michael A. Riegler
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Wiesenfarth
View author publications
You can also search for this author in PubMed Google Scholar
A. Emre Kavur
View author publications
You can also search for this author in PubMed Google Scholar
Carole H. Sudre
View author publications
You can also search for this author in PubMed Google Scholar
Michael Baumgartner
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Eisenmann
View author publications
You can also search for this author in PubMed Google Scholar
Doreen Heckmann-Nötzel
View author publications
You can also search for this author in PubMed Google Scholar
Tim Rädsch
View author publications
You can also search for this author in PubMed Google Scholar
Laura Acion
View author publications
You can also search for this author in PubMed Google Scholar
Michela Antonelli
View author publications
You can also search for this author in PubMed Google Scholar
Tal Arbel
View author publications
You can also search for this author in PubMed Google Scholar
Spyridon Bakas
View author publications
You can also search for this author in PubMed Google Scholar
Arriel Benis
View author publications
You can also search for this author in PubMed Google Scholar
Matthew B. Blaschko
View author publications
You can also search for this author in PubMed Google Scholar
M. Jorge Cardoso
View author publications
You can also search for this author in PubMed Google Scholar
Veronika Cheplygina
View author publications
You can also search for this author in PubMed Google Scholar
Beth A. Cimini
View author publications
You can also search for this author in PubMed Google Scholar
Gary S. Collins
View author publications
You can also search for this author in PubMed Google Scholar
Keyvan Farahani
View author publications
You can also search for this author in PubMed Google Scholar
Luciana Ferrer
View author publications
You can also search for this author in PubMed Google Scholar
Adrian Galdran
View author publications
You can also search for this author in PubMed Google Scholar
Bram van Ginneken
View author publications
You can also search for this author in PubMed Google Scholar
Robert Haase
View author publications
You can also search for this author in PubMed Google Scholar
Daniel A. Hashimoto
View author publications
You can also search for this author in PubMed Google Scholar
Michael M. Hoffman
View author publications
You can also search for this author in PubMed Google Scholar
Merel Huisman
View author publications
You can also search for this author in PubMed Google Scholar
Pierre Jannin
View author publications
You can also search for this author in PubMed Google Scholar
Charles E. Kahn
View author publications
You can also search for this author in PubMed Google Scholar
Dagmar Kainmueller
View author publications
You can also search for this author in PubMed Google Scholar
Bernhard Kainz
View author publications
You can also search for this author in PubMed Google Scholar
Alexandros Karargyris
View author publications
You can also search for this author in PubMed Google Scholar
Alan Karthikesalingam
View author publications
You can also search for this author in PubMed Google Scholar
Florian Kofler
View author publications
You can also search for this author in PubMed Google Scholar
Annette Kopp-Schneider
View author publications
You can also search for this author in PubMed Google Scholar
Anna Kreshuk
View author publications
You can also search for this author in PubMed Google Scholar
Tahsin Kurc
View author publications
You can also search for this author in PubMed Google Scholar
Bennett A. Landman
View author publications
You can also search for this author in PubMed Google Scholar
Geert Litjens
View author publications
You can also search for this author in PubMed Google Scholar
Amin Madani
View author publications
You can also search for this author in PubMed Google Scholar
Klaus Maier-Hein
View author publications
You can also search for this author in PubMed Google Scholar
Anne L. Martel
View author publications
You can also search for this author in PubMed Google Scholar
Peter Mattson
View author publications
You can also search for this author in PubMed Google Scholar
Erik Meijering
View author publications
You can also search for this author in PubMed Google Scholar
Bjoern Menze
View author publications
You can also search for this author in PubMed Google Scholar
Karel G. M. Moons
View author publications
You can also search for this author in PubMed Google Scholar
Henning Müller
View author publications
You can also search for this author in PubMed Google Scholar
Brennan Nichyporuk
View author publications
You can also search for this author in PubMed Google Scholar
Felix Nickel
View author publications
You can also search for this author in PubMed Google Scholar
Jens Petersen
View author publications
You can also search for this author in PubMed Google Scholar
Nasir Rajpoot
View author publications
You can also search for this author in PubMed Google Scholar
Nicola Rieke
View author publications
You can also search for this author in PubMed Google Scholar
Julio Saez-Rodriguez
View author publications
You can also search for this author in PubMed Google Scholar
Clara I. Sánchez
View author publications
You can also search for this author in PubMed Google Scholar
Shravya Shetty
View author publications
You can also search for this author in PubMed Google Scholar
Maarten van Smeden
View author publications
You can also search for this author in PubMed Google Scholar
Ronald M. Summers
View author publications
You can also search for this author in PubMed Google Scholar
Abdel A. Taha
View author publications
You can also search for this author in PubMed Google Scholar
Aleksei Tiulpin
View author publications
You can also search for this author in PubMed Google Scholar
Sotirios A. Tsaftaris
View author publications
You can also search for this author in PubMed Google Scholar
Ben Van Calster
View author publications
You can also search for this author in PubMed Google Scholar
Gaël Varoquaux
View author publications
You can also search for this author in PubMed Google Scholar
Paul F. Jäger
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

L.M.-H. initiated and led the study, was a member of the Delphi core team, wrote and reviewed the manuscript, prepared and evaluated all surveys, organized all workshops, tested the online toolkit, and organized the social media campaign. A.R. initiated and led the study, was a member of the Delphi core team, wrote and reviewed the manuscript, prepared and evaluated all surveys, organized all workshops, tested the metric map**s and the online toolkit, organized the social media campaign, and designed all figures. P.F.J. initiated and led the study, was a member of the Delphi core team, led the ObD and InS expert group, wrote and reviewed the manuscript, prepared and evaluated all surveys, organized all workshops, tested the metric map**s and the online toolkit, organized the social media campaign, and participated in surveys. P.G. led the ImLC expert group, was a member of the extended Delphi core team, wrote and reviewed the manuscript, prepared the BPMN diagrams, tested the online toolkit, and participated in surveys and workshops. M.D.T. was a member of the extended Delphi core team and wrote and reviewed the manuscript. F.B. led the calibration expert group, reviewed the manuscript, and participated in surveys. E.C. led the cross-topic expert group, was a member of the extended Delphi core team, and reviewed the manuscript. B.G. led the cross-topic expert group and was an active member of the SemS expert group, reviewed the manuscript, and participated in surveys and workshops. F.I. led the SemS expert group, reviewed the manuscript, tested the online toolkit, and participated in surveys and workshops. J.K. led the biomedical expert group, reviewed the manuscript, and participated in surveys and workshops. M.K. led the ObD and InS expert group, reviewed the manuscript, and participated in surveys and workshops. M.R. led the SemS expert group, reviewed the manuscript, and participated in surveys and workshops. M.A.R. led the ImLC expert group, reviewed the manuscript, tested the metric map**s, and participated in surveys and workshops. M.W. co-led the cross-topic expert group. A.E.K. implemented the online toolkit and was a member of the extended Delphi core team. C.H.S. implemented the reference implementations of all metrics in Python, was an active member of the ObD and InS expert group, reviewed the manuscript, and participated in surveys workshops. M.B. was a member of the extended Delphi core team, was an active member of the ObD and InS expert group, wrote and reviewed the manuscript, tested the metric map**s and the online toolkit, and participated in surveys and workshops. M.E. was a member of the extended Delphi core team, prepared the BPMN diagrams, reviewed the document, assisted in survey preparation, tested the metric map**s and the online toolkit, and participated in surveys. D.H.-N. was a member of the extended Delphi core team and prepared all surveys. T.R. was a member of the extended Delphi core team, was an active member of the ObD and InS expert group, wrote and reviewed the document, assisted in survey preparation, tested the metric map**s and the online toolkit, and participated in surveys and workshops. L.A. reviewed the manuscript and participated in surveys and workshops. M.A. was an active member of the SemS expert group and participated in surveys and workshops. T.A. was an active member of the ObD and InS expert group, tested the metric map**s, reviewed the manuscript, and participated in surveys and workshops. S.B. co-led the SemS expert group, reviewed the manuscript, and participated in surveys and workshops. A.B. was an active member of the biomedical and cross-topic expert groups, reviewed the manuscript, and participated in surveys and workshops. M.B.B. triggered changes in the framework by responding to public questionnaire, reviewed the manuscript, and participated in surveys. M.J.C. was an active member of the ImLC expert group and participated in surveys and workshops. V.C. was an active member of the ImLC and cross-topic expert groups, reviewed the manuscript, and participated in surveys and workshops. B.A.C. was an active member of the ObD and InS expert group, tested the metric map**s, reviewed the manuscript, and participated in surveys and workshops. K.F. was an active member of the biomedical and cross-topic expert groups and participated in surveys and workshops. L.F. triggered changes in the framework by responding to public questionnaire, was an active member of the calibration expert group, reviewed the manuscript, and participated in surveys. A.G. triggered changes in the framework by responding to public questionnaire, was an active member of the calibration expert group, reviewed the manuscript, and participated in surveys. B.v.G. participated in surveys and workshops. R.H. triggered changes in the framework by responding to public questionnaire and participated in surveys. D.A.H. was an active member of the biomedical and cross-topic expert groups, reviewed the manuscript, and participated in surveys and workshops. M.M.H. was an active member of the ImLC expert group, reviewed the manuscript, and participated in surveys and workshops. M.H. co-led the biomedical expert group, was an active member of the cross-topic expert group, reviewed the manuscript, and participated in surveys and workshops. P.J. co-led the cross-topic expert group, was an active member of the ObD and InS expert group, reviewed the manuscript, and participated in surveys and workshops. C.E.K. was an active member of the biomedical expert group, reviewed the manuscript, and participated in surveys and workshops. D.K. triggered changes in the framework by responding to public questionnaire and participated in surveys. B.K. triggered changes in the framework by responding to public questionnaire, reviewed the manuscript, and participated in surveys. F.K. triggered changes in the framework by responding to public questionnaire and participated in surveys. A.K.-S. was a member of the extended Delphi core team and was an active member of the cross-topic group. A.K. was an active member of the biomedical expert group, reviewed the manuscript, and participated in surveys and workshops. B.A.L. was an active member of the SemS expert group and participated in surveys and workshops. G.L. was an active member of the ImLC expert group, reviewed the manuscript, and participated in surveys and workshops. A.M. was an active member of the biomedical and SemS expert groups and participated in surveys and workshops. K.M.-H. was an active member of the SemS expert group, reviewed the manuscript, and participated in surveys and workshops. E.M. was an active member of the ImLC expert group, reviewed the manuscript, and participated in surveys. B.M. participated in surveys and workshops. K.G.M.M. was an active member of the cross-topic expert group, reviewed the manuscript, and participated in surveys and workshops. H.M. was an active member of the ImLC expert group, tested the metric map**s, reviewed the manuscript, and participated in surveys and workshops. B.N. was an active member of the ObD and InS expert group, tested the metric map**s, reviewed the manuscript, and participated in surveys. N. Rieke was an active member of the SemS expert group and participated in surveys and workshops. R.M.S. was an active member of the ObD and InS, the biomedical and the cross-topic expert groups, reviewed the manuscript, and participated in surveys and workshops. A.A.T. co-led the SemS expert group and participated in surveys and workshops. A.T. was an active member of the calibration group, reviewed the manuscript, and participated in surveys. S.A.T. was an active member of the ObD and InS expert group, tested the metric map**s, reviewed the manuscript, and participated in surveys and workshops. B.v.C. was an active member of the cross-topic expert group and participated in surveys. G.V. was an active member of the ImLC and cross-topic expert groups, reviewed the manuscript, and participated in surveys and workshops. G.S.C., A. Karthikesalingam, T.K., A.L.M., P.M., F.N., J.P., N. Rajpoot, J.S.-R., C.I.S., S.S. and M.v.S. served on the expert Delphi panel and participated in workshops and surveys.

Corresponding authors

Correspondence to Lena Maier-Hein, Annika Reinke or Paul F. Jäger.

Ethics declarations

Competing interests

We declare the following competing interests: Under terms of employment, M.B.B. is entitled to stock options in Mona.health, a KU Leuven spinoff. F.B. is an employee of Siemens AG. F.B. reports funding from Merck. B.v.G. is a shareholder of Thirona. B.G. was an employee of HeartFlow and Kheiron Medical Technologies. M.M.H. received an Nvidia GPU grant. B.K. is a consultant for ThinkSono. G.L. is on the advisory board of Canon Healthcare IT and is a shareholder of Aiosyn BV. N. Rieke is an employee of NVIDIA. J.S.-R. reports funding from GSK, Pfizer and Sanofi and fees from Travere Therapeutics, Stadapharm, Astex Therapeutics, Pfizer and Grunenthal. R.M.S. receives patent royalties from iCAD, ScanMed, Philips, Translation Holdings and **An; the laboratory of R.M.S. received research support from **An through a Cooperative Research and Development Agreement. S.A.T. receives financial support from Canon Medical Research Europe. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks **kun Yan for their contribution to the peer review of this work. Primary Handling Editor: Rita Strack, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Subprocess S1 for selecting a problem category.

The Category Map** maps a given research problem to the appropriate problem category with the goal of grou** problems by similarity of validation. The leaf nodes represent the categories: image-level classification, object detection, instance segmentation, or semantic segmentation. FP2.1 refers to fingerprint 2.1 (see Fig. SN 1.10). An overview of the symbols used in the process diagram is provided in Fig. SN 5.1.

Extended Data Fig. 2 Subprocess S2 for selecting multi-class metrics (if any).

Applies to: image-level classification (ImLC). In the case of presence of class imbalance and no compensation of class imbalance being requested, one should follow the ‘No’ branch. Decision guides are provided in Supplementary Note 2.7.1. A detailed description of the subprocess is given in Supplementary Note 2.2.

Extended Data Fig. 3 Subprocess S3 for selecting a per-class counting metric (if any).

Applies to: image-level classification (ImLC), object detection (ObD), and instance segmentation (InS). Decision guides are provided in Supplementary Note 2.7.2. A detailed description of the subprocess is given in Supplementary Notes 2.2, 2.4, and 2.5.

Extended Data Fig. 4 Subprocess S4 for selecting a multi-threshold metric (if any).

Applies to: image-level classification (ImLC), object detection (ObD), and instance segmentation (InS). Decision guides are provided in Supplementary Note 2.7.3. A detailed description of the subprocess is given in Supplementary Notes 2.2, 2.4, and 2.5.

Extended Data Fig. 5 Subprocess S5 for selecting a calibration metric (if any).

Applies to: image-level classification (ImLC). Decision guides are provided in Supplementary Note 2.7.4. A detailed description of the subprocess is given in Supplementary Note 2.6. Further suggested calibration metrics include the calibration loss⁷⁴, calibration slope⁴⁶, Expected Calibration Index (ECI)²⁴ and Observed:Expected ratio (O:E ratio)⁴⁹.

Extended Data Fig. 6 Subprocess S6 for selecting overlap-based segmentation metrics (if any).

Applies to: semantic segmentation (SemS) and instance segmentation (InS). Decision guides are provided in Supplementary Note 2.7.5. A detailed description of the subprocess is given in Supplementary Notes 2.3 and 2.5.

Extended Data Fig. 7 Subprocess S7 for selecting a boundary-based segmentation metric (if any).

Applies to: semantic segmentation (SemS) and instance segmentation (InS). Decision guides are provided in Supplementary Note 2.7.6. A detailed description of the subprocess is given in Supplementary Notes 2.3 and 2.5.

Extended Data Fig. 8 Subprocess S8 for selecting the localization criterion.

Applies to: object detection (ObD) and instance segmentation (InS). Definitions of the localization criteria can be found in¹⁹. Decision guides are provided in Supplementary Note 2.7.7. A detailed description of the subprocess is given in Supplementary Notes 2.4 and 2.5.

Extended Data Fig. 9 Subprocess S9 for selecting the assignment strategy.

Applies to: object detection (ObD) and instance segmentation (InS). Assignment strategies are defined in¹⁹. Decision guides are provided in Supplementary Note 2.7.8. A detailed description of the subprocess is given in Supplementary Notes 2.4 and 2.5.

Extended Data Table 1 Recommendations for metric application addressing the pitfalls collected in ref. ¹⁹

Full size table

Supplementary information

Supplementary Information

Supplementary Methods and Notes 1–5

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Maier-Hein, L., Reinke, A., Godau, P. et al. Metrics reloaded: recommendations for image analysis validation. Nat Methods 21, 195–212 (2024). https://doi.org/10.1038/s41592-023-02151-z

Download citation

Received: 09 February 2023
Accepted: 12 December 2023
Published: 12 February 2024
Issue Date: February 2024
DOI: https://doi.org/10.1038/s41592-023-02151-z
Springer Nature America, Inc.

This article is cited by

Where imaging and metrics meet

Nature Methods (2024)
Predicting non-muscle invasive bladder cancer outcomes using artificial intelligence: a systematic review using APPRAISE-AI
- Jethro C. C. Kwong
- Jeremy Wu
- Girish S. Kulkarni
npj Digital Medicine (2024)
Test-time augmentation with synthetic data addresses distribution shifts in spectral imaging
- Ahmad Bin Qasim
- Alessandro Motta
- Lena Maier-Hein
International Journal of Computer Assisted Radiology and Surgery (2024)
The intelligent imaging revolution: artificial intelligence in MRI and MRS acquisition and reconstruction
- Thomas Küstner
- Chen Qin
- Cian M. Scannell
Magnetic Resonance Materials in Physics, Biology and Medicine (2024)
AI powered road network prediction with fused low-resolution satellite imagery and GPS trajectory
- Necip Enes Gengec
- Ergin Tari
- Ulas Bagci
Earth Science Informatics (2024)

Metrics reloaded: recommendations for image analysis validation

Abstract

Similar content being viewed by others

Main

Metrics Reloaded addresses all three types of metric pitfalls

Image-level classification

Semantic segmentation

Object detection

Instance segmentation

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Navigation