Main

Deep learning has enabled the automation of complex medical image interpretation tasks, such as disease diagnosis, often matching or exceeding the performance of medical experts25,26,27. In contrast, our method is able to classify pathologies without requiring the domain-specific development of an automatic labeller. The self-supervised method has the potential to alleviate the labelling bottleneck in the machine-learning pipeline for a range of medical-imaging tasks by leveraging easily accessible unstructured text data without domain-specific pre-processing efforts17. As a result, the self-supervised method opens promising avenues for approaches and applications in the medical-imaging domain, where narrative reports that describe imaging findings are common.

One notable finding is the ability of the self-supervised method to predict differential diagnoses and radiographic findings with high accuracy on a dataset that was collected in a country different from that of the training dataset19. This ability to generalize to datasets from vastly different distributions has been one of the primary challenges for the deployment of medical artificial intelligence28,29. Despite the challenges of generalization described in previous works, the self-supervised method achieves an AUC of at least 0.900 on 6 radiographic findings and at least 0.700 on 38 findings out of 57 radiographic findings where n > 50 in the PadChest test dataset (n = 39,053) (Fig. 3). We speculate that the self-supervised model can generalize better because of its ability to leverage unstructured text data, which contains more diverse radiographic information that could be applicable to other datasets. Additionally, we note that we might expect improved performance if we used alternative labels instead of the raw clinical findings in PadChest. Ultimately, the results demonstrate that the self-supervised method can generalize well on a different data distribution without having seen any explicitly labelled pathologies from PadChest during training30.

Biases may have affected the training of the self-supervised method. For example, if a pathology is never mentioned in the reports, then the method cannot be expected to predict that pathology with high accuracy during zero-shot evaluation. Furthermore, the model’s ability to predict a pathology may depend on the terminology used in the training reports. For instance, if several reports describe a condition such as atelectasis, but do not explicitly use the term, then the method may not perform well when queried with the phrase ‘has atelectasis’31. Thus, the method’s ability to predict pathologies is limited to scenarios mentioned in the text reports, and may perform less well when there are a variety of ways to describe the same pathology. To address these potential biases, we provide the model with hundreds of thousands of image–text pair samples (n = 377,110) during training, encompassing a wide variety of writing styles and descriptions of pathologies17. By validating the method on the CheXpert and PadChest datasets, which were collected at different hospitals from the one used in the training of the model, we show that site-specific biases are not inhibiting the method’s ability to predict clinically relevant pathologies with high accuracy.

This work has a few limitations. First, the self-supervised method still requires repeatedly querying performance on a labelled validation set for hyperparameter selection and to determine condition-specific probability thresholds when calculating MCC and F1 statistics. Second, the self-supervised method is currently limited to classifying image data; however, medical datasets often combine different imaging modalities, can incorporate non-imaging data from electronic health records or other sources, or can be a time series. For instance, magnetic resonance imaging and computed tomography produce three-dimensional data that have been used to train other machine-learning pipelines32,33,34. On the same note, it would be of interest to apply the method to other tasks in which medical data are paired with some form of unstructured text. For instance, the self-supervised method could leverage the availability of pathology reports that describe diagnoses such as cancer present in histopathology scans26,35,36. Lastly, future work should develop approaches to scale this method to larger image sizes to better classify smaller pathologies37,38,39,40,41,42,43,44,45.

In summary, we have designed a self-supervised method using contrastive learning that detects the presence of multiple pathologies in chest X-ray images. The self-supervised method builds on the use of image–text pairings of chest X-rays and radiology reports in ConVIRT, as well as on the multi-class zero-shot classification of natural images in Contrastive Language-Image Pre-training (CLIP) to enable the application of zero-shot approaches to medical-image interpretation. The self-supervised method matches radiologist-level performance on a chest X-ray classification task for multiple pathologies that the model was not explicitly trained to classify (Fig. 2 and Table 1). The results highlight the potential of deep-learning models to leverage large amounts of unlabelled data for a broad range of medical-image-interpretation tasks, and thereby may reduce the reliance on labelled datasets and decrease clinical-workflow inefficiencies resulting from large-scale labelling efforts.

Methods

Datasets

Training

The self-supervised method was trained on the MIMIC-CXR dataset, a publicly available dataset of chest radiographs with radiology text reports. The MIMIC-CXR dataset contains 377,110 images corresponding to 227,835 radiographic studies17. For instances where a radiographic study contains more than one chest X-ray image, the chest X-ray that is in anteroposterior/posteroanterior view was chosen to be included as part of training. Each radiographic study comes with a corresponding free-text radiology report, a summarization written by radiologists regarding their findings. Each full radiology report consists of multiple sections: examination, indication, impression, findings, technique and comparison. CheXpert is a public dataset for chest radiograph interpretation, consisting of 224,316 chest X-rays of 65,240 patients from Stanford Hospital8. The dataset is labelled for the presence of 14 different conditions: atelectasis, cardiomegaly, consolidation, oedema, enlarged cardiomediastinum, fracture, lung lesion, lung opacity, no finding, pleural effusion, pleural other, pneumonia, pneumothorax and support devices. These labels are obtained from the agreement of five board-certified radiologists. Additionally, the dataset consists of free-text radiology reports that are associated with each chest X-ray image. The CheXpert validation dataset is utilized for tuning-condition-specific probability thresholds to obtain predictions from the self-supervised model’s probabilities for the five CheXpert competition conditions of a given chest X-ray image We conduct this analysis by running inference with the self-supervised model to obtain probability values of each condition being present for all chest X-ray images. Condition-specific probability thresholds are then determined by choosing the probability values that result in the best MCC for each condition on the CheXpert validation dataset. The CheXpert validation dataset has no overlap with the CheXpert test dataset used for evaluation.

Evaluation

The self-supervised method was evaluated on two external datasets: the CheXpert test dataset and PadChest. The CheXpert test dataset is a collection of chest X-rays that are commonly used to evaluate the performance of models on chest X-ray interpretation tasks14,31. We evaluate the model on the entire CheXpert test dataset, consisting of 500 chest X-ray images labelled for the presence of 14 different conditions8. The CheXpert test dataset is utilized to calculate both the self-supervised model’s area under the receiver operating characteristic (AUROC) and MCC metrics for each of the five CheXpert competition conditions. Additionally, the test set contains predictions from three board-certified radiologists on full-resolution images with which we compare the performance of the model.

The PadChest dataset is a public dataset that contains 160,868 chest X-ray images labelled with 174 different radiographic findings, 19 differential diagnoses19. Twenty-seven per cent of the labels come from board-certified radiologists, and the rest were obtained by using a recurrent neural network with attention trained on the radiology reports. For evaluation purposes, only 39,053 examples from the dataset were utilized, each of which was annotated by board-certified radiologists. These examples were then used to calculate the self-supervised model’s AUROC for each of the different conditions described above.

Pre-processing

Each of the 377,110 chest X-rays in the MIMIC-CXR dataset were re-sized to 224 × 224 and zero padded before training. Each image was then normalized using a sample mean and standard deviation of the training dataset.

Text from radiology reports were tokenized using the byte pair encoding procedure with a vocabulary size of 49,408. For text that exceeds the maximum token sequence length of the given architecture, we truncated the text embedding to the first ‘context length tokens – 2’. The remaining two tokens were saved for the [SOS] and [EOS] tokens at the beginning and end of the text embedding, respectively.

Architecture

The uninitialized architectures consist of a Vision Transformer, ViT-B/32, for the image encoder, and a Transformer for the text encoder. We use a pre-trained Vision Transformer that accepts images of resolution 224 × 224. The text encoder Transformer has a base size of 63 million parameters, 12 layers and a width of 512 with 8 attention heads. The Transformer operates on lower-byte pair encoding representation of text and uses text embeddings with a maximum token length of 77. We use the same initialization scheme used in CLIP15.

Implementation of the method

Model pre-training

The self-supervised model consists of an image and text encoder that we jointly train on the MIMIC-CXR training dataset17. We utilize the impressions section of each text report, since it contains a concise summary of the entire report. We contrast this with a previous self-supervised method, ConVIRT, which selects a random sentence from the full-length radiology report for each image14. Although their proposed method could extract some signal, a random text input selection allows for unnecessary stochasticity that could lead to inconsistencies in training. To address this, we consistently select the text from the impressions section.

Training

We initialized the self-supervised model using the ViT-B/32and Transformer architectures with pre-trained weights from OpenAI’s CLIP model15. When training on the impressions section, we keep the maximum context length of 77 tokens as given in the CLIP architecture. We demonstrated that we can leverage the pre-trained weights from the CLIP architecture learned from natural images to train a zero-shot model with a domain-specific medical task.

To prepare the data for training, all images from the MIMIC-CXR dataset are stored in a single HDF5 file. We performed a hyperparameter sweep over the batch size and the learning rate using the CheXpert validation dataset. We compute the validation mean AUC over the five CheXpert competition pathologies after every 1,000 batches are trained, and save the model checkpoint if the model outperforms the last best model during training. The validation mean AUCs of these checkpoints are used to select models for ensembling. The best model uses stochastic gradient descent for optimization with a learning rate of 0.0001 and momentum of 0.9. The best model has a batch size of 64 and is trained for four epochs. We train the model by maximizing the cosine similarity between image and text embeddings of all valid image–report pairs in the batch while minimizing the cosine similarity between the embeddings of incorrect pairings in the batch. The method’s training procedure closely follows the implementation of CLIP15.

Softmax evaluation technique for multi-label classification

To evaluate the zero-shot performance of the model on the multi-label classification task, we used a positive–negative softmax evaluation procedure on each of the diseases. In contrast to CLIP, the proposed procedure allows us to normalize with respect to the negated version of the same disease classification instead of naively normalizing across the diseases to obtain probabilities from the logits15. The latter approach is less reasonable in this context since a single image may have multiple associated labels.

We define the procedure as follows. First, we compute logits with positive prompts (such as atelectasis) and negative prompts (that is, no atelectasis). Then, we compute the softmax between the positive and negative logits. Lastly, we keep the softmax probabilities of the positive logits as the probability that the disease is present in the chest X-ray.

Ensembling

We ensemble the top-ten model checkpoints sorted by mean AUC over the five CheXpert pathologies on the validation dataset. The probability outputs of the ensemble are computed by taking the average of the probability outputs of each model. The probabilities are averaged after softmax evaluation. These probabilities are then used for model evaluation through AUC and for prediction tasks using condition thresholds generated from the validation dataset.

Knowledge-distillation procedure

To allow for the use of the CLIP pre-trained model on full radiology reports to evaluate zero-shot performance on auxiliary tasks such as sex prediction, we use a knowledge-distillation procedure. This procedure is required as the pre-trained text encoder from the CLIP model has a context length of only 77 tokens, which is not long enough for an entire radiology report. We use the pre-trained model to train a model with a context length of 512, long enough to encompass 98% of radiology reports. In this method, the text encoder of the best-performing model trained only on impressions is used as a teacher for the text encoder of a student model. To train the student, we compute the mean squared error between the logits of the two encoders, then backpropagate across the student architecture. Once the student text encoder is trained, we replace the uninitialized image encoder in the student model with the image encoder of the teacher model. Then, the student model is contrastively trained on the MIMIC-CXR chest X-ray and full-text report pairs.

Prompt-engineering methods

We run experiments using the labels present in the test set as the prompts and creating the prompts of ‘<label>’ and ‘no <label>’ as the positive and negative prompts for the softmax evaluation procedure.

Statistical analysis

AUROC

We collect AUROC results from both the CheXpert test dataset (500 samples) as well as PadChest dataset (39,053 samples) using the self-supervised model’s predictions. The AUROC and MCC results of the five clinically relevant pathologies on the CheXpert test dataset are presented in Table 1. Table 2 consists of the mean AUROC of these five pathologies on the CheXpert test dataset along with self-supervised and supervised comparisons. The DAM supervised method is included as a comparison and currently is state-of-the-art on the CheXpert dataset. An additional supervised baseline, DenseNet121, trained on the CheXpert dataset is included as a comparison since DenseNet121 is commonly used in self-supervised approaches. Current top-performing label-efficient approaches, ConVIRT, MedAug and MoCo-CXR, are included as self-supervised comparisons.

MCC and F1 score

To obtain the MCC, we first run inference on the CheXpert test set using our softmax evaluation technique to obtain probability values for the 14 different conditions on each of the 500 chest X-ray images. The probabilities are then transformed into positive/negative predictions using the probability thresholds computed by optimizing MCC over the validation dataset. Then, the condition-based MCC scores are calculated using these predictions. We similarly compute the F1 score, but using the same thresholds as used for computing the MCC.

Confidence intervals

We use the non-parametric bootstrap to generate confidence intervals: random samples of size n (equal to the size of the original dataset) are repeatedly sampled 1,000 times from the original dataset with replacement. We then estimate the AUROC, F1 and MCC metrics (or their difference for two the methods) using each bootstrap sample. We derive confidence intervals from the relative frequency distribution of the estimates over the re-samples, using the interval between the 100 × (α/2) and 100 × (1 − α/2) percentiles; we pick α = 0.05.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.