Abstract
The integration of artificial intelligence (AI) in medical image interpretation requires effective collaboration between clinicians and AI algorithms. Although previous studies demonstrated the potential of AI assistance in improving overall clinician performance, the individual impact on clinicians remains unclear. This large-scale study examined the heterogeneous effects of AI assistance on 140 radiologists across 15 chest X-ray diagnostic tasks and identified predictors of these effects. Surprisingly, conventional experience-based factors, such as years of experience, subspecialty and familiarity with AI tools, fail to reliably predict the impact of AI assistance. Additionally, lower-performing radiologists do not consistently benefit more from AI assistance, challenging prevailing assumptions. Instead, we found that the occurrence of AI errors strongly influences treatment outcomes, with inaccurate AI predictions adversely affecting radiologist performance on the aggregate of all pathologies and on half of the individual pathologies investigated. Our findings highlight the importance of personalized approaches to clinician–AI collaboration and the importance of accurate AI models. By understanding the factors that shape the effectiveness of AI assistance, this study provides valuable insights for targeted implementation of AI, enabling maximum benefits for individual clinicians in clinical practice.
Similar content being viewed by others
Main
The integration of artificial intelligence (AI) into medical image interpretation has shown great potential for improving diagnostic accuracy and efficiency32 or generate nuanced radiology reports33,34,35,36,37, rather than probabilities alone, may allow radiologists to potentially extract value from inaccurate AI predictions. In addition, we emphasize that these findings between AI accuracy and treatment effect are the result of many factors simultaneously at play, including the ground truth probability, the radiologist’s predicted probability and how radiologists interpret and use AI assistance, which can all be correlated with AI’s predicted probability. Therefore, these findings should not be extrapolated for defining the cognitive mechanism in which AI assistance helps or hurts radiologists. Further research with explicit control of the potential factors is necessary to understand that underlying mechanism29.
Our study has several limitations that should be acknowledged. First, the randomization of treatment conditions in the experiment, although necessary to eliminate confounding factors, prevented the analysis of temporal trends in radiologists’ response to AI assistance. We were unable to assess whether radiologists improved in incorporating AI predictions over time as they encountered more patient cases. Future research should aim to investigate these evolving dynamics between radiologists and AI. Second, the AI assistance available to radiologists contained only predicted probabilities and did not include additional explanations, such as localization of pathologies, which could help radiologists more accurately interpret and, therefore, make better use of the available AI predictions. Designers of AI systems should investigate the optimal types of explanations to present and the mode of presentation while staying cautious of the increased cognitive burden that this additional information can bring. Another limitation is the lack of exploration into the impact of task granularity. The AI model generated predictions for 15 individual pathologies, some of which were interconnected and represented different levels of detail. For instance, airspace opacity encompasses pathologies such as atelectasis, edema and consolidation. Understanding the relationships between higher-level and lower-level pathologies would be valuable in future studies. Furthermore, due to the simultaneous presentation of all 15 AI predictions, it was challenging to isolate the effect of AI assistance on individual pathologies. The influence of AI predictions on one pathology could potentially affect the radiologists’ response to AI predictions on other pathologies, especially when they are interrelated. Additionally, because we provided actual AI predictions on patient cases to radiologists, it was also difficult to eliminate the confounding factor of the patient case when studying the relationship between the accuracy of AI predictions and the radiologist’s treatment effect. Future work may control for the influence of the patient case by providing artificially set predictions to radiologists.
In conclusion, our study underscores the need for individualized approaches that are aware of clinician heterogeneity, high-quality AI models and comprehensive assessments of multiple factors to optimize the implementation of AI assistance in clinical medicine. Collaboration between clinicians and AI developers, focusing on personalized strategies and continuous improvement of AI models, will be essential for achieving the full potential of clinician–AI collaboration in healthcare.
Methods
This research complied with all relevant ethical regulations. The study that produced the AI assistance dataset29 used in this study was determined by the Massachusetts Institute of Technology (MIT) Committee on the Use of Humans as Experimental Subjects to be exempt through exempt determination E-2953.
Dataset specification
This study used 324 retrospective patient cases from Stanford University’s healthcare system containing chest X-rays and clinical histories, which include patients’ indication, vitals and labs. In this study, we analyzed data collected from a total of 140 radiologists participating in two experiment designs. The non-repeated-measure design included 107 radiologists in a non-repeated-measure setup (Supplementary Fig. 1). Each radiologist read 60 patient cases across four subsequences that each contained 15 cases. Each subsequence corresponded to one of four treatment conditions: with AI assistance and clinical histories, with AI assistance and without clinical history, without AI assistance and with clinical histories and without AI assistance and clinical histories. The four subsequences and associated treatment conditions were organized in a random order. The 60 patient cases were randomly selected and randomly assigned to one of the treatment conditions. This design included across-subject and within-subject variations in the treatment conditions; it did not allow within-case-subject comparisons because a case was encountered only once for a radiologist38. Order effects were mitigated by the randomization of treatment conditions. The repeated-measure design included 33 radiologists in a repeated-measure setup (Supplementary Fig. 2). Each radiologist read a total of 60 patient cases, each under each of the four treatment conditions and producing a total of 240 diagnoses. The radiologist completed the experiment in four sessions, and the radiologist read the same 60 randomly selected patient cases in each session under each of the various treatment arms. In each session, 15 cases were read in each treatment arm in batches of five cases. Treatments were randomly ordered. This resulted in the radiologist reading each patient case under a different treatment condition over the four sessions. There was a 2-week washout period15,39,40 between every session to minimize order effects of radiologists reading the same case multiple times. This design included across-subject and within-subject variations as well as across-case-radiologist and within-case-radiologist variations in treatment conditions. Order effects were mitigated by the randomization of treatment conditions. No enrichment was applied to the data collection process. We combined data from both experiment designs from the clinical history conditions. Further details about the data collection process are available in a separate study29, which focuses on establishing a Bayesian framework for defining optimal human–AI collaboration and characterizing actual radiologist behavior in incorporating AI assistance. The study was determined exempt by the MIT Committee on the Use of Humans as Experimental Subjects through exempt determination E-2953.
There are 15 pathologies with corresponding AI predictions: abnormal, airspace opacity, atelectasis, bacterial/lobar pneumonia, cardiomediastinal abnormality, cardiomegaly, consolidation, edema, lesion, pleural effusion, pleural other, pneumothorax, rib fracture, shoulder fracture and support device hardware. These pathologies, the interrelations among these pathologies and additional pathologies without AI predictions can be visualized in a hierarchical structure in Supplementary Fig. B.1. Radiologists were asked to familiarize themselves with the hierarchy before starting, had access to the figure throughout the experiment and had to provide predictions for pathologies following this hierarchy. This aimed to maximize clarity on the specific pathologies referenced in the experiment. When radiologists received AI assistance, they were simultaneously presented with the AI predictions for these 15 pathologies along with the patient’s chest X-ray and, if applicable, their clinical history. The AI predictions were presented in the form of prediction probabilities on a 0–100 scale. The AI predictions were generated by the CheXpert model8, which is a DenseNet121 (ref. 41)-based model for chest X-rays that has been shown to perform similarly to board-certified radiologists. The model generated a single prediction for fracture that was used as the AI prediction for both rib fracture and shoulder fracture. Authors of the CheXpert model8 decided on the 14 pathologies (with a single prediction for fracture) based on the prevalence of observations in radiology reports in the CheXpert dataset and clinical relevance, conforming to the Fleischner Society’s recommended glossary42 whenever applicable. Among the pathologies, they included ‘Pneumonia’ (corresponding to ‘bacterial/lobar pneumonia’) to indicate the diagnosis of primary infection and ‘No Finding’ (corresponding to ‘abnormal’) to indicate the absence of all pathologies. These pathologies were set in the creation of the CheXpert labeler8, which has been applied to generate labels for reports in the CheXpert dataset and MIMIC-CXR43, which are among the largest chest X-ray datasets publicly available.
The ground truth probabilities for a patient case were determined by averaging the continuous predicted probabilities of five board-certified radiologists from Mount Sinai Hospital with at least 10 years of experience and chest radiology as a subspecialty on a 0–100 scale. For instance, if the predicted probabilities of the five board-certified radiologists are 91, 92, 92, 100 and 100, respectively, the ground truth probability is 95. The prevalence of the pathologies based on a ground truth probability threshold of 50 of a pathology being present is shown in Supplementary Table 1.
The participating radiologists represent a diverse set of institutions recruited through two means. Their primary affiliations include large, medium and small clinical settings and non-clinical settings. Additionally, some radiologists are affiliated with an academic hospital, whereas others are not. Radiologists in the non-repeated-measure design were recruited from teleradiology companies. Radiologists in the repeated-measure design were recruited from the Vinmec health system in Vietnam. Details about the participating radiologists and recruitment process can be found in Supplementary Note | Participant recruitment and affiliation.
The experiment interface and instructions presented to participating radiologists can be found in Supplementary Note | Experiment interface and instructions. Before entering the experiment, radiologists were instructed to walk through the experiment instructions, the hierarchy of pathological findings, basic information and performance of the AI model, video demonstration of the experiment interface and examples, consent clauses, comprehension check questions, information on bonus payment that incentivizes effort and practice patient cases covering four treatment conditions and showing example AI predictions from the AI model used in the experiment.
Sex and gender statistics of the participating radiologists and patient cases are available in Supplementary Tables 39 and 40, respectively. Sex and gender were not considered in the original data collection procedures. Disaggregated information about sex and gender at the individual level was collected in the separate study and will be made available29.
Empirical Bayes for individual heterogeneity
We used the empirical Bayes method30 to shrink the raw mean heterogeneous treatment effects and performance metrics of individual radiologists measured on the dataset toward the grand mean to ameliorate overestimating heterogeneity due to sampling error. The values include AI’s treatment effects on error, sensitivity and specificity and performance metrics on unassisted error, sensitivity and specificity.
Assume that \({t}_{r}\) is radiologist r’s true mean treatment effect from AI assistance or any metric of interest. We observe
which differs from \({t}_{r}\) by \({{{\eta }}}_{r}\). We use a normal distribution as the prior distribution over the metric of interest. The mean of the prior distribution can be computed as
the mean of the observed mean metric of interest of radiologists. The variance of the prior distribution can be computed as
the variance of the observed mean metric of interest of radiologists minus the estimated \(E\left[{{{\eta }}}_{r}^{2}\right]\). We can estimate \(E\left[{{{\eta }}}_{r}^{2}\right]\) with
Denote the estimated mean and variance of the prior distribution as \({{\rm{\mu }}}_{0}\) and \({{\rm{\sigma }}}_{0}^{2}\). We can compute the mean of the posterior distribution for radiologist \(r\) as
where \({{\rm{\mu }}}_{r}=\widetilde{{t}}_{t}\) and \({{\rm{\sigma }}}_{r}=s.e.\left(\widetilde{{t}}_{r}\right)\); we can compute the variance of the posterior as
where \({{\rm{\sigma }}}_{r}=s.e.\left(\widetilde{{t}}_{r}\right)\). The updated mean of the posterior distribution is the radiologist’s metric of interest after shrinkage.
For the analysis on treatment effects on absolute error, we focus on high-prevalence pathologies with prevalence greater than 10%, because radiologists’ baseline performance without AI assistance is generally highly accurate on low-prevalence pathologies, where they correctly predict that a pathology is not present, and, as a result, there is little variation in radiologists’ errors. This is especially true when computing each individual radiologist’s treatment effect. When there is zero variance in the performance of a radiologist under a treatment condition, the associated standard error estimate is zero, making it impossible to perform inference on this radiologist’s treatment effect.
Combined characteristics model for splitting radiologists into subgroups
The combined characteristics model was fitted on a training set of half of the radiologists (n = 68) to predict treatment effects of the test set of the remaining half (n = 68). The treatment effect predictions on the test set were used as the combined characteristics score for splitting the test set radiologists into binary subgroups (based on whether a particular radiologist’s combined characteristics score was smaller than or equal to the median treatment effect of radiologists computed from all available reads). Then, the same procedure was repeated after flip** the training set and test set radiologists to split the other set of radiologists into binary subgroups. The experience-based characteristics of radiologists in the randomly split training set and test set were balanced: one set contained 27 radiologists with less than or equal to 6 years of experience and 41 radiologists with more than 6 years of experience, and the other set contained 41 and 27, respectively. One set contained 47 radiologists who did not specialize in thoracic radiology and 21 radiologists who did, and the other set contained 54 and 14 radiologists, respectively. One set contained 32 radiologists without experience with AI tools and 36 radiologists with experience, and the other set contained 31 and 37, respectively.
Treatment effect models
To compute a radiologist’s observed mean treatment effect and the corresponding standard errors and the overall treatment effect of AI assistance across subgroups, we built a linear regression model with the following formulation using the statsmodels library: error ∼ 1 + C(treatment). Here, error refers to the absolute error of a radiologist prediction; 1 refers to an intercept term; and treatment refers to a binary indicator of whether the prediction is made with or without AI assistance. This formulation allows us to compute the treatment effect of AI assistance for both non-repeated-measure and repeated-measure data.
Subgroup-specific treatment effect models
For the analyses on experience-based radiologist characteristics and AI error, we computed the treatment effects of subgroups split based on the predictor of interest by building a linear regression model with the following formulation using the statsmodels library: error ∼ 1 + C(subgroup) + C(treatment):C(subgroup). Here, error refers to the absolute error of a radiologist prediction; 1 refers to an intercept term; subgroup refers to an indicator of the subgroup that the radiologist is split into; and treatment refers to a binary indicator of whether the prediction is made with or without AI assistance. This formulation allows us to compute the subgroup-specific treatment effect of AI assistance for both non-repeated-measure data and repeated-measure data.
Cluster-robust standard errors
To account for correlations of observations within patient cases and radiologists, we computed cluster-robust standard errors that are two-way clustered at the patient case and radiologist level for all inferences unless otherwise specified44,45. With the statsmodels library’s ordinary least squares (OLS) class, we used a clustered covariance estimator as the type of robust sandwich estimator and defined two-way groups based on identifiers of the patient cases and radiologists. The approach assumes that regression model errors are independent across clusters defined by the patient cases and radiologists and adjusts for correlations within clusters.
Reversion to the mean
The reversion to the mean effect and the mechanism of split sampling in avoiding reversion to the mean are explained in the following derivation:
Suppose that \({u}_{i,r}^{* }\) and \({a}_{i,r}^{* }\) are the true unassisted and assisted diagnostic error of radiologist \(r\) on patient case i. Suppose that we measure \({u}_{i,r}={u}_{i,r}^{* }+{e}_{i,r}^{u}\) and \({a}_{i,r}={a}_{i,r}^{* }+{e}_{i,r}^{a}\) where \({e}_{i,r}^{u}\) and \({e}_{i,r}^{a}\) are measurement errors. Assume that the measurement errors are independent of \({u}_{i,r}^{* }\) and \({a}_{i,r}^{* }\).
To study the relationship between unassisted error and treatment effect, we intend to build the following linear regression model:
where the error is independent of the independent variable, and \({u}_{r}^{* }\) and \({a}_{r}^{* }\) are the mean unassisted and assisted performance of radiologist \(r\). Here, the moment condition
is as desired. This univariate regression estimates the true value of \({{\beta }}\), which is defined as
However, because we have access only to noisy measurements \({u}_{r}\) and \({a}_{r}\), consider instead an approach that builds the model
and assumes the moment condition
This linear regression model using noisy measurements instead generates the following estimate of \({{\beta }}\):
which is incorrect because of the additional \({{V}}\,{{ar}}\left({{{e}}}_{{{r}}}^{{{u}}}\right)\) terms in the numerator and the denominator. The additional term in the denominator represents attenuation bias, which we address in detail in a later subsection. The term in the numerator represents the reversion to the mean issue, which we now discuss in further detail.
As the equation shows, the bias caused by reversion to the mean is positive. This term exists because the moment condition \(E\left[{e}_{r}\times {u}_{r}\right]=0\), equation (11), is not valid at the true value of \({{\beta }}\) as shown in the following derivation:
Split sampling solves this bias by using separate patient cases for computing unassisted error and treatment effect. A simple construction of split sampling is to use a separate case i for computing the treatment effect and using the remaining cases to compute unassisted error. With this construction, we obtain the following estimate of \({{\beta }}\):
where \({u}_{i,r}\) is the unassisted performance on case i for radiologist \(r\), and \({u}_{\ne i,r}\) is the mean unassisted performance computed on all unassisted cases other than i. If the errors on each case used to compute \({u}_{r}^{* }\) and \({a}_{r}^{* }\) are independent, the estimate of \({{\beta }}\) is equal to
The remaining discrepancy in the denominator again represents attenuation bias and is addressed in a later subsection.
Data efficient split sampling construction
To study unassisted error as a predictor of treatment effect, we built a linear regression model with the following formulation using the statsmodels library: treatment effect ∼ 1 + unassisted error. We designed the following split sampling construction to maximize data efficiency when computing the independent and dependent variables in the linear regression.
Let i index a patient case and \(r\) index a radiologist. Assume that a radiologist reads \({N}_{u}\) cases unassisted and \({N}_{a}\) cases assisted. Recall that the unassisted and assisted cases are disjoint for the non-repeated-measure data; they overlap exactly for the repeated-measure data.
For the non-repeated-measure design, we adopt the following construction:
where \({x}_{\ne i,r}=\frac{1}{{N}_{u}-1}{\sum }_{k\ne i}{u}_{k,r}\) and \({a}_{r}=\frac{1}{{N}_{a}}{\sum }_{k}{a}_{k,r}\). Here, \({x}_{\ne i,r}\) is the mean unassisted performance computed on all unassisted cases other than i; \({u}_{{i},{r}}\) is the unassisted performance on case i for radiologist \(r\); and \({a}_{r}\) is the mean assisted performance on all assisted cases for radiologist \(r\).
For the repeated-measure design, we adopt the following construction:
where \({x}_{\ne i,r}=\frac{1}{{N}_{u}-1}{\sum }_{k\ne i}{u}_{k,r}\). Here, \({x}_{\ne i,r}\) is the mean unassisted performance computed on all cases other than i; \({u}_{i,r}\) is the unassisted performance on case i for radiologist \(r\); and \({a}_{i,r}\) is the assisted performance on case i for radiologist \(r\).
To study unassisted error as a predictor of assisted error, we built a linear regression model with the following formulation using the statsmodels library: assisted error ∼ 1 + unassisted error. We designed the following split sampling construction that maximizes data efficiency when computing the independent and dependent variables in the linear regression.
For the non-repeated-measure design, we adopt the following construction:
where \({x}_{r}=\frac{1}{{N}_{u}}{\sum }_{k}\,{x}_{k,r}\). Here, \({x}_{r}\) is the mean unassisted performance computed on all unassisted cases, and \({a}_{i,r}\) is the assisted performance on case i for radiologist \(r\).
For the repeated-measure design, we adopt the following construction:
where \({x}_{\ne i,r}=\frac{1}{{N}_{u}-1}{\sum }_{k\ne i}{u}_{k,r}\). Here, \({x}_{\ne i,r}\) is the mean unassisted performance computed on all unassisted cases other than i and \({a}_{i,r}\) is the assisted performance on case i for radiologist \(r\).
The constructions above again emphasize the necessity for split sampling. Without split sampling, the mean unassisted performance, which is the independent variable of the linear regression, will be correlated with the error terms due to overlap** patient cases, leading to a bias in the regression.
Adjustment for attenuation bias
We adjusted for attenuation bias for the split sampling linear regression formulations.
We want to estimate regressions of the form
where \({Y}_{r}\) is an outcome for radiologist \(r\) and \(E\left[{x}_{r}\right]\) is radiologist \(r\)ʼs average unassisted performance. We observe
where \({{{\eta }}}_{r}=\frac{1}{{N}_{r}}\mathop{\sum }\limits_{i}{x}_{{ir}}-E\left[{x}_{r}\right]\) and \(E\left[{{{\eta }}}_{r}{x}_{r}\right]=0\) and \(E\left[{{{\eta }}}_{r}{{\rm{\varepsilon }}}_{r}\right]=0\), which are justified by independent and identically distributed (i.i.d.) sampling of cases and split sampling, respectively.
Using observations from the experiment, we estimate the following regression:
Recall that
where \({\rm{\lambda }}=\frac{E\left[{\left({x}_{r}-E\left[{x}_{r}\right]\right)}^{2}\right]}{E\left[{\left({x}_{r}-E\left[{x}_{r}\right]\right)}^{2}\right]+E\left[{{{\eta }}}_{r}^{2}\right]}\) and \({{{\beta }}}_{1}=\frac{E\left[\left({x}_{r}-E\left[{x}_{r}\right]\right)\left({Y}_{r}-E\left[{Y}_{r}\right]\right)\right]}{E\left[{\left({x}_{r}-E\left[{x}_{r}\right]\right)}^{2}\right]}\). We can estimate \({\rm{\lambda }}\) using a plug-in estimator for each term in the data: (1)
This is the standard error of the mean estimator. (2)
which can be estimated by taking the difference between the variance of the observed \(\widetilde{{x}}_{r}\)’s and the estimated \(E\left[{{{\eta }}}_{r}^{2}\right]\). The denominator of \({\rm{\lambda }}\) is effectively \(E\left[{\left(\tilde{x}_{r}-E\left[\tilde{x}_{r}\right]\right)}^{2}\right]\).
Finally, with \(\hat{{\rm{\lambda }}}\), we can estimate \({{{\beta }}}_{1}\) using the estimator
For inference, notice that \(\sqrt{n}\left({{\hat{\rm{\gamma }}}_{1}}-{{\rm{\gamma }}}_{1}\right){\to }^{d}N\left(0,{{\rm{\sigma }}}_{{\rm{\gamma }}}^{2}\right)\) and \(\hat{{\rm{\lambda }}}{\to }^{p}\,{\rm{\lambda }}\). By Slutsky’s theorem, we know that
Therefore, we divide the standard errors of \({{\hat{\rm{\gamma }}}_{1}}\) by \(\hat{{\rm{\lambda }}}\) to obtain the standard errors of \({{{\hat{\beta }}}_{1}}\).
This concludes the adjustment for attenuation bias for the slope term.
Statistical testing
To determine the amount of heterogeneity between subgroups of radiologists receiving lower versus higher treatment effects, we ran an unpaired t-test between the two subgroups of treatment effects computed using the empirical Bayes method. We used the Wald test to test regression coefficients against the null hypothesis of joint equality among treatment effects of different subgroups to determine if there is a statistically significant difference among subgroups split based on the predictor of interest. We also used the Wald test to test regression coefficients against the null hypothesis of zero to determine in a continuous analysis if the independent variable, namely unassisted error, is a predictor of the dependent variable, namely treatment effect or assisted error. We used the Benjamini–Hochberg procedure to correct for multiple hypothesis testing over 15 individual pathologies. For the analysis on treatment effect on AUROC between subgroups determined by AI error (Supplementary Table 34), we conducted an F-test to determine whether there is a statistically significant difference between treatment effects on AUROC in different bins. Specifically, we used the number of reads that fall into each bin as the group size. We used the grand mean AUROC and group AUROCs along with group sizes to compute the sum of squares between; we used the estimated standard error of each group AUROC along with the group size to compute the sum of squares within (error).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The 324 patient cases from Stanford University’s healthcare system were used under licensing. They are available at https://stanfordaimi.azurewebsites.net/datasets/5194008e-61cf-4083-9896-3d4bd8bf8b0b, conditioned on a Stanford University data research use agreement. The AI predictions used in the experiment were generated by the CheXpert model trained on the CheXpert dataset8, which is publicly available. The clinician–AI collaboration dataset is available at https://osf.io/z7apq/ upon request for access at the Open Science Framework dataset page.
Code availability
Code for the analysis is available at https://doi.org/10.5281/zenodo.10467492 (ref. 46). Data analysis was conducted using Python 3.9.7, libraries statsmodels 0.13.5 and scipy 1.10.1; and R 4.1.3 and libraries MRMCaov 0.3.0 and auctestr 1.0.0.
References
Rajpurkar, P. et al. CheXNet: radiologist-level pneumonia detection on chest X-rays with deep learning. Preprint at ar**v https://doi.org/10.48550/ar**v.1711.05225 (2017).
Rajpurkar, P. et al. Deep learning for chest radiograph diagnosis: a retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS Med. 15, e1002686 (2018).
Novikov, A. A. et al. Fully convolutional architectures for multiclass segmentation in chest radiographs. IEEE Trans. Med. Imaging 37, 1865–1876 (2018).
Majkowska, A. et al. Chest radiograph interpretation with deep learning models: assessment with radiologist-adjudicated reference standards and population-adjusted evaluation. Radiology 294, 421–431 (2020).
Tiu, E. et al. Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat. Biomed. Eng. 6, 1399–1406 (2022).
McKinney, S. M. et al. International evaluation of an AI system for breast cancer screening. Nature 577, 89–94 (2020).
Yala, A. et al. Multi-institutional validation of a mammography-based breast cancer risk model. J. Clin. Oncol. 40, 1732–1740 (2022).
Irvin, J. et al. CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. in Proc. of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence 590–597 (2019).
Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. AI in health and medicine. Nat. Med. 28, 31–38 (2022).
Kelly, C. J., Karthikesalingam, A., Suleyman, M., Corrado, G. & King, D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 17, 195 (2019).
Ghassemi, M. et al. A review of challenges and opportunities in machine learning for health. AMIA Summits Transl. Sci. Proc. 2020, 191–200 (2020).
Norden, J. G. & Shah, N. R. What AI in health care can learn from the long road to autonomous vehicles. NEJM Catalyst https://catalyst.nejm.org/doi/full/10.1056/CAT.21.0458 (2022).
Rajpurkar, P. & Lungren, M. P. The current and future state of AI interpretation of medical images. N. Engl. J. Med. 388, 1981–1990 (2023).
Chi, E. A. et al. Development and validation of an artificial intelligence system to optimize clinician review of patient records. JAMA Netw. Open 4, e2117391 (2021).
Seah, J. C. Y. et al. Effect of a comprehensive deep-learning model on the accuracy of chest x-ray interpretation by radiologists: a retrospective, multireader multicase study. Lancet Digit. Health 3, e496–e506 (2021).
Frazer, H. M. L. et al. AI integration improves breast cancer screening in a real-world, retrospective cohort study. Preprint at medRxiv https://doi.org/10.1101/2022.11.23.22282646 (2022).
Lu, Z. et al. Assessment of the role of artificial intelligence in the association between time of day and colonoscopy quality. JAMA Netw. Open 6, e2253840 (2023).
Mozannar, H. et al. Who Should Predict? Exact Algorithms For Learning to Defer to Humans. in International Conference on Artificial Intelligence and Statistics 10520–10545 (PMLR, 2023).
Dvijotham, K. et al. Enhancing the reliability and accuracy of AI-enabled diagnosis via complementarity-driven deferral to clinicians (CoDoC). Nat. Med. 29, 1814–1820 (2023).
Fogliato, R. et al. Who goes first? Influences of human–AI workflow on decision making in clinical imaging. in FAccT’22: Proc. of the 2022 ACM Conference on Fairness, Accountability, and Transparency https://doi.org/10.1145/3531146.3533193 (Association for Computing Machinery, 2022).
Ahn, J. S. et al. Association of artificial intelligence–aided chest radiograph interpretation with reader performance and efficiency. JAMA Netw. Open 5, e2229289 (2022).
Farzaneh, N., Ansari, S., Lee, E., Ward, K. R. & Sjoding, M. W. Collaborative strategies for deploying artificial intelligence to complement physician diagnoses of acute respiratory distress syndrome. NPJ Digit. Med. 6, 62 (2023).
Zheng, X. et al. A deep learning model and human–machine fusion for prediction of EBV-associated gastric cancer from histopathology. Nat. Commun. 13, 2790 (2022).
Gaube, S. et al. Non-task expert physicians benefit from correct explainable AI advice when reviewing X-rays. Sci. Rep. 13, 1383 (2023).
Jones, C. M. et al. Assessment of the effect of a comprehensive chest radiograph deep learning model on radiologist reports and patient outcomes: a real-world observational study. BMJ Open 11, e052902 (2021).
Tschandl, P. et al. Human–computer collaboration for skin cancer recognition. Nat. Med. 26, 1229–1234 (2020).
Reverberi, C. et al. Experimental evidence of effective human–AI collaboration in medical decision-making. Sci. Rep. 12, 14952 (2022).
Dratsch, T. et al. Automation bias in mammography: the impact of artificial intelligence BI-RADS suggestions on reader performance. Radiology 307, e222176 (2023).
Agarwal, N., Moehring, A., Rajpurkar, P. & Salz, T. Combining human expertise with artificial intelligence: experimental evidence from radiology. National Bureau of Economic Research. Working paper 31422. https://doi.org/10.3386/w31422 (2023).
Carlin, B. P. & Louis, T. A. Empirical Bayes: past, present and future. J. Am. Stat. Assoc. 95, 1286–1289 (2000).
Stigler, S. M. Regression towards the mean, historically considered. Stat. Methods Med. Res. 6, 103–114 (1997).
Saporta, A. et al. Benchmarking saliency methods for chest X-ray interpretation. Nat. Mach. Intell. 4, 867–878 (2022).
Chen, Z., Song, Y., Chang, T.-H. & Wan, X. Generating radiology reports via memory-driven transformer. in Proc. of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 1439–1449 (Association for Computational Linguistics, 2020).
Miura, Y., Zhang, Y., Tsai, E. B., Langlotz, C. P. & Jurafsky, D. Improving factual completeness and consistency of image-to-text radiology report generation. in Proc. of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 5288–5304 (Association for Computational Linguistics, 2021).
Endo, M., Krishnan, R., Krishna, V., Ng, A. Y. & Rajpurkar, P. Retrieval-based chest X-ray report generation using a pre-trained contrastive language-image model. in Proc. of Machine Learning for Health 209–219 (PMLR, 2021).
Yan, A. et al. Weakly supervised contrastive learning for chest X-ray report generation. Preprint at ar**v https://doi.org/10.48550/ar**v.2109.12242 (2021).
Nicolson, A., Dowling, J. & Koopman, B. Improving chest X-ray report generation by leveraging warm starting. Artif. Intell. Med. 144, 102633 (2023).
Charness, G., Gneezy, U. & Kuhn, M. A. Experimental methods: between-subject and within-subject design. J. Econ. Behav. Organ. 81, 1–8 (2012).
Pacilè, S. et al. Improving breast cancer detection accuracy of mammography with the concurrent use of an artificial intelligence tool. Radiol. Artif. Intell. 2, e190208 (2020).
Conant, E. F. et al. Improving accuracy and efficiency with concurrent use of artificial intelligence for digital breast tomosynthesis. Radiol. Artif. Intell. 1, e180096 (2019).
Huang, G., Liu, Z., van der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition https://doi.org/10.1109/CVPR.2017.243 (IEEE, 2017).
Hansell, D. M. et al. Fleischner Society: glossary of terms for thoracic imaging. Radiology 246, 697–722 (2008).
Johnson, A. E. W. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6, 317 (2019).
Colin Cameron, A. & Miller, D. L. A practitioner’s guide to cluster-robust inference. J. Hum. Resour. 50, 317–372 (2015).
Angrist, J. D. & Pischke, J.-S. Mostly Harmless Econometrics: An Empiricist’s Companion (Princeton Univ. Press, 2009).
Yu, F. et al. Effects of AI assistance on radiologists: code release. https://doi.org/10.5281/zenodo.10467492 (2024).
Acknowledgements
The authors acknowledge support from the Alfred P. Sloan Foundation (2022-17182, N.A.), the J-PAL Healthcare Delivery Initiative and the MIT School of Humanities, Arts, and Social Sciences (SHASS).
Author information
Authors and Affiliations
Contributions
T.S., N.A. and P.R. conceived the study. F.Y. and A.M. planned and executed the data analysis. F.Y., A.M., O.B., T.S., N.A. and P.R. contributed to the interpretation of findings. F.Y. and O.B. drafted the manuscript. All authors provided critical feedback and substantially contributed to the revision of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Medicine thanks Jarrel Seah, Michael Sjoding and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Lorenzo Righetto, in collaboration with the Nature Medicine team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Individual heterogeneity in treatment effects.
a, b, Individual heterogeneity in treatment effects of 140 radiologists as determined using the empirical Bayes method on (a) all pathologies aggregated and (b) high-prevalence pathology labels (pathology labels with greater than 10% prevalence). The curve is the kernel density estimate (KDE).
Extended Data Fig. 2 Individual heterogeneity in unassisted error.
a, b, Individual heterogeneity in unassisted error of 140 radiologists as determined using the empirical Bayes method on (a) all pathologies aggregated and (b) high-prevalence pathology labels (pathology labels with greater than 10% prevalence). The curve is the kernel density estimate (KDE).
Extended Data Fig. 3 Individual heterogeneity in treatment effects on sensitivity, sensitivities, treatment effects on specificity, and specificities.
a-b, Individual heterogeneity in (a) improvement in sensitivities, (b) sensitivity, (c) improvement in specificity, and (d) specificity of 140 radiologists as determined using the empirical Bayes method on all pathologies aggregated. The curve is the kernel density estimate (KDE).
Extended Data Fig. 4 Conventional radiologist characteristics as indicators for treatment effect on individual pathologies.
a, Difference in treatment effects of subgroups of radiologists on high-prevalence pathology labels (pathology labels with greater than 10% prevalence). The difference is computed between lower and higher improvement subgroups. The error bars show 95% confidence intervals. There are statistically significant differences between subgroups on high-prevalence pathology labels (abnormal B-H adjusted P = 1.66e-29, airspace opacity B-H adjusted P = 7.20e-29, atelectasis B-H adjusted P = 1.10e-30, cardiomediastinal abnormality B-H adjusted P = 4.85e-30, support device hardware B-H adjusted P = 1.57e-30; B-H adjusted P < 0.001). A two-sided, unpaired t-test between the two subgroups of treatment effects was conducted. The difference is -4.194 (95% CI: -4.753 to -3.636) for abnormal, -1.465 (95% CI: -1.664 to -1.266) for airspace opacity, -1.766 (95% CI: -1.991 to -1.541) for atelectasis, -1.571 (95% CI: -1.777 to -1.365) for cardiomediastinal abnormality, and -3.150 (95% CI: -3.552 to -2.748) for support device hardware. 136 radiologists with available survey data are used. b, Difference in treatment effects of subgroups of radiologists based on combined characteristics of years of experience, subspecialty in thoracic radiology and experience with AI tools on held-out test sets of radiologists. The difference is computed between lower and higher predicted improvement subgroups. The error bars show 95% confidence intervals. n.s. indicates no statistical significance (B-H adjusted P > 0.05). The Wald test was used to test regression coefficients that estimate treatment effects against the null hypothesis of joint equality among treatment effects of different subgroups. Details of the statistical models are available in Methods. There are 136 radiologists with available survey data on the three characteristics. c-e, Difference in treatment effects of subgroups of radiologists based on (c) years of experience, (d) subspecialty in thoracic radiology, and (e) experience with AI tools on 15 individual pathologies. The difference is computed between (c) subgroups of fewer versus more years of experience, (d) subgroups without versus with subspecialty in thoracic radiology, and (e) subgroups without versus with experience using AI tools. The error bars show 95% confidence intervals. n.s. indicates no statistical significance (B-H adjusted P > 0.05). The same statistical test as in b was used. There are 136 radiologists with available survey data.
Extended Data Fig. 5 Conventional radiologist characteristics as indicators for treatment effect on AUROC on individual pathologies.
a, Difference in treatment effects on AUROC of subgroups of radiologists based on combined characteristics of years of experience, subspecialty in thoracic radiology and experience with AI tools on held-out test sets of radiologists. The difference is computed between lower and higher predicted improvement subgroups. The error bars show 95% confidence intervals. n.s. indicates no statistical significance (B-H adjusted P > 0.05). The difference is 0.034 (95% CI: -0.017 to 0.842) for abnormal and -0.023 (95% CI: -0.082 to 0.035) for airspace opacity. The Wald test was used to test regression coefficients that estimate treatment effects against the null hypothesis of joint equality among treatment effects of different subgroups. Details of the statistical models are available in Methods. 136 radiologists with available survey data are used. b-d, Difference in treatment effects of subgroups of radiologists based on (b) years of experience, (c) subspecialty in thoracic radiology, and (d) experience with AI tools on 2 individual pathologies on which the AUROC analysis could be computed. The difference is computed between (b) subgroups of fewer versus more years of experience, (c) subgroups without versus with subspecialty in thoracic radiology, and (d) subgroups without versus with experience using AI tools. The error bars show 95% confidence intervals. n.s. indicates no statistical significance (B-H adjusted P > 0.05). The same statistical test as in a was used. There are 136 radiologists with available survey data.
Supplementary information
Supplementary Information
Supplementary Tables 1–40, Supplementary Figs. 1 and 2, Supplementary Notes ‘Statistical modeling for AUROC analysis’ | ‘Participant recruitment and affiliation’ (contains Supplementary Tables A.1–4) | ‘Experiment interface and instructions’ (contains Supplementary Figs. B.1–7) and Supplementary References
Supplementary Video
Experiment instructions video presented to participating radiologists
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yu, F., Moehring, A., Banerjee, O. et al. Heterogeneity and predictors of the effects of AI assistance on radiologists. Nat Med 30, 837–849 (2024). https://doi.org/10.1038/s41591-024-02850-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41591-024-02850-w
- Springer Nature America, Inc.