Main

The integration of artificial intelligence (AI) into medical image interpretation has shown great potential for improving diagnostic accuracy and efficiency32 or generate nuanced radiology reports33,34,35,36,37, rather than probabilities alone, may allow radiologists to potentially extract value from inaccurate AI predictions. In addition, we emphasize that these findings between AI accuracy and treatment effect are the result of many factors simultaneously at play, including the ground truth probability, the radiologist’s predicted probability and how radiologists interpret and use AI assistance, which can all be correlated with AI’s predicted probability. Therefore, these findings should not be extrapolated for defining the cognitive mechanism in which AI assistance helps or hurts radiologists. Further research with explicit control of the potential factors is necessary to understand that underlying mechanism29.

Our study has several limitations that should be acknowledged. First, the randomization of treatment conditions in the experiment, although necessary to eliminate confounding factors, prevented the analysis of temporal trends in radiologists’ response to AI assistance. We were unable to assess whether radiologists improved in incorporating AI predictions over time as they encountered more patient cases. Future research should aim to investigate these evolving dynamics between radiologists and AI. Second, the AI assistance available to radiologists contained only predicted probabilities and did not include additional explanations, such as localization of pathologies, which could help radiologists more accurately interpret and, therefore, make better use of the available AI predictions. Designers of AI systems should investigate the optimal types of explanations to present and the mode of presentation while staying cautious of the increased cognitive burden that this additional information can bring. Another limitation is the lack of exploration into the impact of task granularity. The AI model generated predictions for 15 individual pathologies, some of which were interconnected and represented different levels of detail. For instance, airspace opacity encompasses pathologies such as atelectasis, edema and consolidation. Understanding the relationships between higher-level and lower-level pathologies would be valuable in future studies. Furthermore, due to the simultaneous presentation of all 15 AI predictions, it was challenging to isolate the effect of AI assistance on individual pathologies. The influence of AI predictions on one pathology could potentially affect the radiologists’ response to AI predictions on other pathologies, especially when they are interrelated. Additionally, because we provided actual AI predictions on patient cases to radiologists, it was also difficult to eliminate the confounding factor of the patient case when studying the relationship between the accuracy of AI predictions and the radiologist’s treatment effect. Future work may control for the influence of the patient case by providing artificially set predictions to radiologists.

In conclusion, our study underscores the need for individualized approaches that are aware of clinician heterogeneity, high-quality AI models and comprehensive assessments of multiple factors to optimize the implementation of AI assistance in clinical medicine. Collaboration between clinicians and AI developers, focusing on personalized strategies and continuous improvement of AI models, will be essential for achieving the full potential of clinician–AI collaboration in healthcare.

Methods

This research complied with all relevant ethical regulations. The study that produced the AI assistance dataset29 used in this study was determined by the Massachusetts Institute of Technology (MIT) Committee on the Use of Humans as Experimental Subjects to be exempt through exempt determination E-2953.

Dataset specification

This study used 324 retrospective patient cases from Stanford University’s healthcare system containing chest X-rays and clinical histories, which include patients’ indication, vitals and labs. In this study, we analyzed data collected from a total of 140 radiologists participating in two experiment designs. The non-repeated-measure design included 107 radiologists in a non-repeated-measure setup (Supplementary Fig. 1). Each radiologist read 60 patient cases across four subsequences that each contained 15 cases. Each subsequence corresponded to one of four treatment conditions: with AI assistance and clinical histories, with AI assistance and without clinical history, without AI assistance and with clinical histories and without AI assistance and clinical histories. The four subsequences and associated treatment conditions were organized in a random order. The 60 patient cases were randomly selected and randomly assigned to one of the treatment conditions. This design included across-subject and within-subject variations in the treatment conditions; it did not allow within-case-subject comparisons because a case was encountered only once for a radiologist38. Order effects were mitigated by the randomization of treatment conditions. The repeated-measure design included 33 radiologists in a repeated-measure setup (Supplementary Fig. 2). Each radiologist read a total of 60 patient cases, each under each of the four treatment conditions and producing a total of 240 diagnoses. The radiologist completed the experiment in four sessions, and the radiologist read the same 60 randomly selected patient cases in each session under each of the various treatment arms. In each session, 15 cases were read in each treatment arm in batches of five cases. Treatments were randomly ordered. This resulted in the radiologist reading each patient case under a different treatment condition over the four sessions. There was a 2-week washout period15,39,40 between every session to minimize order effects of radiologists reading the same case multiple times. This design included across-subject and within-subject variations as well as across-case-radiologist and within-case-radiologist variations in treatment conditions. Order effects were mitigated by the randomization of treatment conditions. No enrichment was applied to the data collection process. We combined data from both experiment designs from the clinical history conditions. Further details about the data collection process are available in a separate study29, which focuses on establishing a Bayesian framework for defining optimal human–AI collaboration and characterizing actual radiologist behavior in incorporating AI assistance. The study was determined exempt by the MIT Committee on the Use of Humans as Experimental Subjects through exempt determination E-2953.

There are 15 pathologies with corresponding AI predictions: abnormal, airspace opacity, atelectasis, bacterial/lobar pneumonia, cardiomediastinal abnormality, cardiomegaly, consolidation, edema, lesion, pleural effusion, pleural other, pneumothorax, rib fracture, shoulder fracture and support device hardware. These pathologies, the interrelations among these pathologies and additional pathologies without AI predictions can be visualized in a hierarchical structure in Supplementary Fig. B.1. Radiologists were asked to familiarize themselves with the hierarchy before starting, had access to the figure throughout the experiment and had to provide predictions for pathologies following this hierarchy. This aimed to maximize clarity on the specific pathologies referenced in the experiment. When radiologists received AI assistance, they were simultaneously presented with the AI predictions for these 15 pathologies along with the patient’s chest X-ray and, if applicable, their clinical history. The AI predictions were presented in the form of prediction probabilities on a 0–100 scale. The AI predictions were generated by the CheXpert model8, which is a DenseNet121 (ref. 41)-based model for chest X-rays that has been shown to perform similarly to board-certified radiologists. The model generated a single prediction for fracture that was used as the AI prediction for both rib fracture and shoulder fracture. Authors of the CheXpert model8 decided on the 14 pathologies (with a single prediction for fracture) based on the prevalence of observations in radiology reports in the CheXpert dataset and clinical relevance, conforming to the Fleischner Society’s recommended glossary42 whenever applicable. Among the pathologies, they included ‘Pneumonia’ (corresponding to ‘bacterial/lobar pneumonia’) to indicate the diagnosis of primary infection and ‘No Finding’ (corresponding to ‘abnormal’) to indicate the absence of all pathologies. These pathologies were set in the creation of the CheXpert labeler8, which has been applied to generate labels for reports in the CheXpert dataset and MIMIC-CXR43, which are among the largest chest X-ray datasets publicly available.

The ground truth probabilities for a patient case were determined by averaging the continuous predicted probabilities of five board-certified radiologists from Mount Sinai Hospital with at least 10 years of experience and chest radiology as a subspecialty on a 0–100 scale. For instance, if the predicted probabilities of the five board-certified radiologists are 91, 92, 92, 100 and 100, respectively, the ground truth probability is 95. The prevalence of the pathologies based on a ground truth probability threshold of 50 of a pathology being present is shown in Supplementary Table 1.

The participating radiologists represent a diverse set of institutions recruited through two means. Their primary affiliations include large, medium and small clinical settings and non-clinical settings. Additionally, some radiologists are affiliated with an academic hospital, whereas others are not. Radiologists in the non-repeated-measure design were recruited from teleradiology companies. Radiologists in the repeated-measure design were recruited from the Vinmec health system in Vietnam. Details about the participating radiologists and recruitment process can be found in Supplementary Note | Participant recruitment and affiliation.

The experiment interface and instructions presented to participating radiologists can be found in Supplementary Note | Experiment interface and instructions. Before entering the experiment, radiologists were instructed to walk through the experiment instructions, the hierarchy of pathological findings, basic information and performance of the AI model, video demonstration of the experiment interface and examples, consent clauses, comprehension check questions, information on bonus payment that incentivizes effort and practice patient cases covering four treatment conditions and showing example AI predictions from the AI model used in the experiment.

Sex and gender statistics of the participating radiologists and patient cases are available in Supplementary Tables 39 and 40, respectively. Sex and gender were not considered in the original data collection procedures. Disaggregated information about sex and gender at the individual level was collected in the separate study and will be made available29.

Empirical Bayes for individual heterogeneity

We used the empirical Bayes method30 to shrink the raw mean heterogeneous treatment effects and performance metrics of individual radiologists measured on the dataset toward the grand mean to ameliorate overestimating heterogeneity due to sampling error. The values include AI’s treatment effects on error, sensitivity and specificity and performance metrics on unassisted error, sensitivity and specificity.

Assume that \({t}_{r}\) is radiologist r’s true mean treatment effect from AI assistance or any metric of interest. We observe

$$\tilde{t}_{r}={t}_{r}+{{{\eta }}}_{r}$$
(1)

which differs from \({t}_{r}\) by \({{{\eta }}}_{r}\). We use a normal distribution as the prior distribution over the metric of interest. The mean of the prior distribution can be computed as

$$E\left[\tilde{t}_{r}\right]=E\left[{t}_{r}\right],$$
(2)

the mean of the observed mean metric of interest of radiologists. The variance of the prior distribution can be computed as

$$E\left[{\Big({t}_{r}-E\left[{t}_{r}\right]\Big)}^{2}\right]=E\left[{\left(\tilde{t}_{r}-E\left[\tilde{t}_{r}\right]\right)}^{2}\right]-E\left[{{{\eta }}}_{r}^{2}\right],$$
(3)

the variance of the observed mean metric of interest of radiologists minus the estimated \(E\left[{{{\eta }}}_{r}^{2}\right]\). We can estimate \(E\left[{{{\eta }}}_{r}^{2}\right]\) with

$$E\left[{{{\eta }}}_{r}^{2}\right]=E\left[{\left(\frac{1}{{N}_{r}}\mathop{\sum }\limits_{i}{t}_{{ir}}-E\left[{t}_{{ir}}\right]\right)}^{2}\right]=E\left[\frac{{\sum }_{i}{\left({t}_{{ir}}-E\left[{t}_{{ir}}\right]\right)}^{2}}{{N}_{r}}\right]=E\left[s.e.{\left(\tilde{t}_{r}\right)}^{2}\right].$$
(4)

Denote the estimated mean and variance of the prior distribution as \({{\rm{\mu }}}_{0}\) and \({{\rm{\sigma }}}_{0}^{2}\). We can compute the mean of the posterior distribution for radiologist \(r\) as

$$\frac{{{\rm{\sigma }}}_{r}^{2}{{\rm{\mu }}}_{0}+{{\rm{\sigma }}}_{0}^{2}{{\rm{\mu }}}_{r}}{{{\rm{\sigma }}}_{0}^{2}+{{\rm{\sigma }}}_{r}^{2}}$$
(5)

where \({{\rm{\mu }}}_{r}=\widetilde{{t}}_{t}\) and \({{\rm{\sigma }}}_{r}=s.e.\left(\widetilde{{t}}_{r}\right)\); we can compute the variance of the posterior as

$$\frac{{{\rm{\sigma }}}_{0}^{2}{{\rm{\sigma }}}_{r}^{2}}{{{\rm{\sigma }}}_{0}^{2}+{{\rm{\sigma }}}_{r}^{2}}$$
(6)

where \({{\rm{\sigma }}}_{r}=s.e.\left(\widetilde{{t}}_{r}\right)\). The updated mean of the posterior distribution is the radiologist’s metric of interest after shrinkage.

For the analysis on treatment effects on absolute error, we focus on high-prevalence pathologies with prevalence greater than 10%, because radiologists’ baseline performance without AI assistance is generally highly accurate on low-prevalence pathologies, where they correctly predict that a pathology is not present, and, as a result, there is little variation in radiologists’ errors. This is especially true when computing each individual radiologist’s treatment effect. When there is zero variance in the performance of a radiologist under a treatment condition, the associated standard error estimate is zero, making it impossible to perform inference on this radiologist’s treatment effect.

Combined characteristics model for splitting radiologists into subgroups

The combined characteristics model was fitted on a training set of half of the radiologists (n = 68) to predict treatment effects of the test set of the remaining half (n = 68). The treatment effect predictions on the test set were used as the combined characteristics score for splitting the test set radiologists into binary subgroups (based on whether a particular radiologist’s combined characteristics score was smaller than or equal to the median treatment effect of radiologists computed from all available reads). Then, the same procedure was repeated after flip** the training set and test set radiologists to split the other set of radiologists into binary subgroups. The experience-based characteristics of radiologists in the randomly split training set and test set were balanced: one set contained 27 radiologists with less than or equal to 6 years of experience and 41 radiologists with more than 6 years of experience, and the other set contained 41 and 27, respectively. One set contained 47 radiologists who did not specialize in thoracic radiology and 21 radiologists who did, and the other set contained 54 and 14 radiologists, respectively. One set contained 32 radiologists without experience with AI tools and 36 radiologists with experience, and the other set contained 31 and 37, respectively.

Treatment effect models

To compute a radiologist’s observed mean treatment effect and the corresponding standard errors and the overall treatment effect of AI assistance across subgroups, we built a linear regression model with the following formulation using the statsmodels library: error ∼ 1 + C(treatment). Here, error refers to the absolute error of a radiologist prediction; 1 refers to an intercept term; and treatment refers to a binary indicator of whether the prediction is made with or without AI assistance. This formulation allows us to compute the treatment effect of AI assistance for both non-repeated-measure and repeated-measure data.

Subgroup-specific treatment effect models

For the analyses on experience-based radiologist characteristics and AI error, we computed the treatment effects of subgroups split based on the predictor of interest by building a linear regression model with the following formulation using the statsmodels library: error ∼ 1 + C(subgroup) + C(treatment):C(subgroup). Here, error refers to the absolute error of a radiologist prediction; 1 refers to an intercept term; subgroup refers to an indicator of the subgroup that the radiologist is split into; and treatment refers to a binary indicator of whether the prediction is made with or without AI assistance. This formulation allows us to compute the subgroup-specific treatment effect of AI assistance for both non-repeated-measure data and repeated-measure data.

Cluster-robust standard errors

To account for correlations of observations within patient cases and radiologists, we computed cluster-robust standard errors that are two-way clustered at the patient case and radiologist level for all inferences unless otherwise specified44,45. With the statsmodels library’s ordinary least squares (OLS) class, we used a clustered covariance estimator as the type of robust sandwich estimator and defined two-way groups based on identifiers of the patient cases and radiologists. The approach assumes that regression model errors are independent across clusters defined by the patient cases and radiologists and adjusts for correlations within clusters.

Reversion to the mean

The reversion to the mean effect and the mechanism of split sampling in avoiding reversion to the mean are explained in the following derivation:

Suppose that \({u}_{i,r}^{* }\) and \({a}_{i,r}^{* }\) are the true unassisted and assisted diagnostic error of radiologist \(r\) on patient case i. Suppose that we measure \({u}_{i,r}={u}_{i,r}^{* }+{e}_{i,r}^{u}\) and \({a}_{i,r}={a}_{i,r}^{* }+{e}_{i,r}^{a}\) where \({e}_{i,r}^{u}\) and \({e}_{i,r}^{a}\) are measurement errors. Assume that the measurement errors are independent of \({u}_{i,r}^{* }\) and \({a}_{i,r}^{* }\).

To study the relationship between unassisted error and treatment effect, we intend to build the following linear regression model:

$${u}_{r}^{* }-{a}_{r}^{* }={{\beta }}{u}_{r}^{* }+{e}_{r}^{* }$$
(7)

where the error is independent of the independent variable, and \({u}_{r}^{* }\) and \({a}_{r}^{* }\) are the mean unassisted and assisted performance of radiologist \(r\). Here, the moment condition

$$E\left[{e}_{i,r}^{* }\times {u}_{i,r}^{* }\right]=0$$
(8)

is as desired. This univariate regression estimates the true value of \({{\beta }}\), which is defined as

$$\frac{{\rm{Cov}}({{\rm{u}}}_{{\rm{r}}}^{\ast }-{{\rm{a}}}_{{\rm{r}}}^{\ast },\,{{\rm{u}}}_{{\rm{r}}}^{\ast })}{{\rm{Var}}({{\rm{u}}}_{{\rm{r}}}^{\ast })}$$
(9)

However, because we have access only to noisy measurements \({u}_{r}\) and \({a}_{r}\), consider instead an approach that builds the model

$${u}_{r}-{a}_{r}={{\beta }}{u}_{r}+{e}_{r}$$
(10)

and assumes the moment condition

$$E\left[{e}_{r}\times {u}_{r}\right]=0.$$
(11)

This linear regression model using noisy measurements instead generates the following estimate of \({{\beta }}\):

$$\frac{{Cov}\left({u}_{r}-{a}_{r},{u}_{r}\right)}{{Var}\left({u}_{r}\right)}=\frac{{Cov}\left({u}_{r}^{* }-{a}_{r}^{* },{u}_{r}^{* }\right)+{Var}\left({e}_{r}^{u}\right)}{{Var}\left({u}_{r}^{* }\right)+{Var}\left({e}_{r}^{u}\right)}$$
(12)

which is incorrect because of the additional \({{V}}\,{{ar}}\left({{{e}}}_{{{r}}}^{{{u}}}\right)\) terms in the numerator and the denominator. The additional term in the denominator represents attenuation bias, which we address in detail in a later subsection. The term in the numerator represents the reversion to the mean issue, which we now discuss in further detail.

As the equation shows, the bias caused by reversion to the mean is positive. This term exists because the moment condition \(E\left[{e}_{r}\times {u}_{r}\right]=0\), equation (11), is not valid at the true value of \({{\beta }}\) as shown in the following derivation:

$$\begin{array}{c}E\left[\left({u}_{r}-{a}_{r}-{{\beta }}{u}_{r}\right)\times {u}_{r}\right]=E\left[\left(\left(1-{{\beta }}\right){u}_{r}-{a}_{r}\right)\times {u}_{r}\right]\\ \begin{array}{c}=E\left[\left(\left(1-{{\beta }}\right)\left({u}_{r}^{* }+{e}_{r}^{u}\right)-\left({a}_{r}^{* }+{e}_{r}^{a}\right)\right)\times {u}_{r}\right]\\ \begin{array}{c}=E\left[\left(\left(\left(1-{{\beta }}\right){u}_{r}^{* }-{a}_{r}^{* }\right)+\left(1-{{\beta }}\right){e}_{r}^{u}-{e}_{r}^{a}\right)\times {u}_{r}\right]\\ \begin{array}{c}=E\left[\left({e}_{r}^{* }+\left(1-{{\beta }}\right){e}_{r}^{u}-{e}_{r}^{a}\right)\times {u}_{r}\right]\\ \begin{array}{c}=\left(1-{{\beta }}\right)E\left[{e}_{r}^{u}\times {u}_{r}\right]\\ =\left(1-{{\beta }}\right){Var}\left({e}_{r}^{u}\right)\ne 0.\end{array}\end{array}\end{array}\end{array}\end{array}$$

Split sampling solves this bias by using separate patient cases for computing unassisted error and treatment effect. A simple construction of split sampling is to use a separate case i for computing the treatment effect and using the remaining cases to compute unassisted error. With this construction, we obtain the following estimate of \({{\beta }}\):

$$\frac{{Cov}\left({u}_{i,r}-{a}_{i,r},{u}_{\ne i,r}\right)}{{Var}\left({u}_{\ne i,r}\right)}$$
(13)

where \({u}_{i,r}\) is the unassisted performance on case i for radiologist \(r\), and \({u}_{\ne i,r}\) is the mean unassisted performance computed on all unassisted cases other than i. If the errors on each case used to compute \({u}_{r}^{* }\) and \({a}_{r}^{* }\) are independent, the estimate of \({{\beta }}\) is equal to

$$\frac{{Cov}\left({u}_{r}^{* }-{a}_{r}^{* },{u}_{r}^{* }\right)}{{Var}\left({u}_{\ne i,r}\right)}$$
(14)

The remaining discrepancy in the denominator again represents attenuation bias and is addressed in a later subsection.

Data efficient split sampling construction

To study unassisted error as a predictor of treatment effect, we built a linear regression model with the following formulation using the statsmodels library: treatment effect ∼ 1+unassisted error. We designed the following split sampling construction to maximize data efficiency when computing the independent and dependent variables in the linear regression.

Let i index a patient case and \(r\) index a radiologist. Assume that a radiologist reads \({N}_{u}\) cases unassisted and \({N}_{a}\) cases assisted. Recall that the unassisted and assisted cases are disjoint for the non-repeated-measure data; they overlap exactly for the repeated-measure data.

For the non-repeated-measure design, we adopt the following construction:

$${u}_{i,r}-{a}_{r}={{\beta }}{x}_{\ne i,r}+{{\rm{\varepsilon }}}_{{u}_{i,r}}+{{\rm{\varepsilon }}}_{{a}_{r}}$$
(15)

where \({x}_{\ne i,r}=\frac{1}{{N}_{u}-1}{\sum }_{k\ne i}{u}_{k,r}\) and \({a}_{r}=\frac{1}{{N}_{a}}{\sum }_{k}{a}_{k,r}\). Here, \({x}_{\ne i,r}\) is the mean unassisted performance computed on all unassisted cases other than i; \({u}_{{i},{r}}\) is the unassisted performance on case i for radiologist \(r\); and \({a}_{r}\) is the mean assisted performance on all assisted cases for radiologist \(r\).

For the repeated-measure design, we adopt the following construction:

$${u}_{i,r}-{a}_{i,r}={{\beta }}{x}_{\ne i,r}+{{\rm{\varepsilon }}}_{{u}_{i,r}}+{{\rm{\varepsilon }}}_{{a}_{i,r}}$$
(16)

where \({x}_{\ne i,r}=\frac{1}{{N}_{u}-1}{\sum }_{k\ne i}{u}_{k,r}\). Here, \({x}_{\ne i,r}\) is the mean unassisted performance computed on all cases other than i; \({u}_{i,r}\) is the unassisted performance on case i for radiologist \(r\); and \({a}_{i,r}\) is the assisted performance on case i for radiologist \(r\).

To study unassisted error as a predictor of assisted error, we built a linear regression model with the following formulation using the statsmodels library: assisted error ∼ 1+unassisted error. We designed the following split sampling construction that maximizes data efficiency when computing the independent and dependent variables in the linear regression.

For the non-repeated-measure design, we adopt the following construction:

$${a}_{i,r}={{\beta }}{x}_{r}+{{\rm{\varepsilon }}}_{i,r}$$
(17)

where \({x}_{r}=\frac{1}{{N}_{u}}{\sum }_{k}\,{x}_{k,r}\). Here, \({x}_{r}\) is the mean unassisted performance computed on all unassisted cases, and \({a}_{i,r}\) is the assisted performance on case i for radiologist \(r\).

For the repeated-measure design, we adopt the following construction:

$${a}_{i,r}={{\beta }}{x}_{\ne i,r}+{{\rm{\varepsilon }}}_{i,r}$$
(18)

where \({x}_{\ne i,r}=\frac{1}{{N}_{u}-1}{\sum }_{k\ne i}{u}_{k,r}\). Here, \({x}_{\ne i,r}\) is the mean unassisted performance computed on all unassisted cases other than i and \({a}_{i,r}\) is the assisted performance on case i for radiologist \(r\).

The constructions above again emphasize the necessity for split sampling. Without split sampling, the mean unassisted performance, which is the independent variable of the linear regression, will be correlated with the error terms due to overlap** patient cases, leading to a bias in the regression.

Adjustment for attenuation bias

We adjusted for attenuation bias for the split sampling linear regression formulations.

We want to estimate regressions of the form

$${Y}_{r}={{{\beta }}}_{0}+{{{\beta }}}_{1}E\left[{x}_{r}\right]+{{\rm{\varepsilon }}}_{r}$$
(19)

where \({Y}_{r}\) is an outcome for radiologist \(r\) and \(E\left[{x}_{r}\right]\) is radiologist \(r\)ʼs average unassisted performance. We observe

$$\widetilde{{x}}_{r}=\frac{1}{{N}_{r}}\mathop{\sum }\limits_{i}{x}_{{ir}}=E\left[{x}_{r}\right]+{{{\eta }}}_{r}$$
(20)

where \({{{\eta }}}_{r}=\frac{1}{{N}_{r}}\mathop{\sum }\limits_{i}{x}_{{ir}}-E\left[{x}_{r}\right]\) and \(E\left[{{{\eta }}}_{r}{x}_{r}\right]=0\) and \(E\left[{{{\eta }}}_{r}{{\rm{\varepsilon }}}_{r}\right]=0\), which are justified by independent and identically distributed (i.i.d.) sampling of cases and split sampling, respectively.

Using observations from the experiment, we estimate the following regression:

$${Y}_{r}={{\rm{\gamma }}}_{0}+{{\rm{\gamma }}}_{1}\tilde{x}_{r}+{{\rm{\varepsilon }}}_{r}$$
(21)

Recall that

$$\begin{array}{rcl}{{\hat{\rm{\gamma }}}_{1}}{\to }^{p}\frac{E\left[\left({x}_{r}+{{{\eta }}}_{r}-E\left[{x}_{r}\right]\right)\left({Y}_{r}-E\left[{Y}_{r}\right]\right)\right]}{E\left[{\left({x}_{r}+{{{\eta }}}_{r}-E\left[{x}_{r}\right]\right)}^{2}\right]} =\\ \frac{E\left[\left({x}_{r}-E\left[{x}_{r}\right]\right)\left({Y}_{r}-E\left[{Y}_{r}\right]\right)\right]}{E\left[{\left({x}_{r}-E\left[{x}_{r}\right]\right)}^{2}\right]+E\left[{{{\eta }}}_{r}^{2}\right]}={{{\beta }}}_{1}{\rm{\lambda }}\end{array}$$
(22)

where \({\rm{\lambda }}=\frac{E\left[{\left({x}_{r}-E\left[{x}_{r}\right]\right)}^{2}\right]}{E\left[{\left({x}_{r}-E\left[{x}_{r}\right]\right)}^{2}\right]+E\left[{{{\eta }}}_{r}^{2}\right]}\) and \({{{\beta }}}_{1}=\frac{E\left[\left({x}_{r}-E\left[{x}_{r}\right]\right)\left({Y}_{r}-E\left[{Y}_{r}\right]\right)\right]}{E\left[{\left({x}_{r}-E\left[{x}_{r}\right]\right)}^{2}\right]}\). We can estimate \({\rm{\lambda }}\) using a plug-in estimator for each term in the data: (1)

$$\begin{array}{rcl}E\left[{{{\eta }}}_{r}^{2}\right]=E\left[{\left(\frac{1}{{N}_{r}}\mathop{\sum }\limits_{i}{x}_{{ir}}-E\left[{x}_{{ir}}\right]\right)}^{2}\right]\\=E\left[\frac{{\sum }_{i}{\left({x}_{{ir}}-E\left[{x}_{{ir}}\right]\right)}^{2}}{{N}_{r}}\right]=E\left[s.e.{\left(\tilde{x}_{r}\right)}^{2}\right].\end{array}$$
(23)

This is the standard error of the mean estimator. (2)

$$E\left[{\left({x}_{r}-E\left[{x}_{r}\right]\right)}^{2}\right]=E\left[{\left(\tilde{x}_{r}-E\left[\tilde{x}_{r}\right]\right)}^{2}\right]-E\left[{{{\eta }}}_{r}^{2}\right],$$
(24)

which can be estimated by taking the difference between the variance of the observed \(\widetilde{{x}}_{r}\)’s and the estimated \(E\left[{{{\eta }}}_{r}^{2}\right]\). The denominator of \({\rm{\lambda }}\) is effectively \(E\left[{\left(\tilde{x}_{r}-E\left[\tilde{x}_{r}\right]\right)}^{2}\right]\).

Finally, with \(\hat{{\rm{\lambda }}}\), we can estimate \({{{\beta }}}_{1}\) using the estimator

$${{{\hat{\beta }}}_{1}}={{\hat{\rm{\gamma }}}_{1}}/{\hat{\rm{\lambda }}}.$$
(25)

For inference, notice that \(\sqrt{n}\left({{\hat{\rm{\gamma }}}_{1}}-{{\rm{\gamma }}}_{1}\right){\to }^{d}N\left(0,{{\rm{\sigma }}}_{{\rm{\gamma }}}^{2}\right)\) and \(\hat{{\rm{\lambda }}}{\to }^{p}\,{\rm{\lambda }}\). By Slutsky’s theorem, we know that

$$\sqrt{n}\frac{\left(\hat{{\rm{\gamma }}}-{\rm{\gamma }}\right)}{\hat{{\rm{\lambda }}}}{\to }^{d}N\left(0,\frac{{{\rm{\sigma }}}_{{\rm{\gamma }}}^{2}}{{{\rm{\lambda }}}^{2}}\right).$$
(26)

Therefore, we divide the standard errors of \({{\hat{\rm{\gamma }}}_{1}}\) by \(\hat{{\rm{\lambda }}}\) to obtain the standard errors of \({{{\hat{\beta }}}_{1}}\).

This concludes the adjustment for attenuation bias for the slope term.

Statistical testing

To determine the amount of heterogeneity between subgroups of radiologists receiving lower versus higher treatment effects, we ran an unpaired t-test between the two subgroups of treatment effects computed using the empirical Bayes method. We used the Wald test to test regression coefficients against the null hypothesis of joint equality among treatment effects of different subgroups to determine if there is a statistically significant difference among subgroups split based on the predictor of interest. We also used the Wald test to test regression coefficients against the null hypothesis of zero to determine in a continuous analysis if the independent variable, namely unassisted error, is a predictor of the dependent variable, namely treatment effect or assisted error. We used the Benjamini–Hochberg procedure to correct for multiple hypothesis testing over 15 individual pathologies. For the analysis on treatment effect on AUROC between subgroups determined by AI error (Supplementary Table 34), we conducted an F-test to determine whether there is a statistically significant difference between treatment effects on AUROC in different bins. Specifically, we used the number of reads that fall into each bin as the group size. We used the grand mean AUROC and group AUROCs along with group sizes to compute the sum of squares between; we used the estimated standard error of each group AUROC along with the group size to compute the sum of squares within (error).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.