Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning

Tiu, Ekin; Talius, Ellie; Patel, Pujan; Langlotz, Curtis P.; Ng, Andrew Y.; Rajpurkar, Pranav

doi:10.1038/s41551-022-00936-9

Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning

Article
Open access
Published: 15 September 2022

Volume 6, pages 1399–1406, (2022)
Cite this article

Download PDF

You have full access to this open access article

From

View current issue Submit your manuscript

Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning

Download PDF

Ekin Tiu^1,2^na1,
Ellie Talius^1,2^na1,
Pujan Patel^1,2^na1,
Curtis P. Langlotz³,
Andrew Y. Ng¹ &
…
Pranav Rajpurkar ORCID: orcid.org/0000-0002-8030-3727²

46k Accesses
60 Citations
253 Altmetric
22 Mentions
Explore all metrics

Abstract

In tasks involving the interpretation of medical images, suitably trained machine-learning models often exceed the performance of medical experts. Yet such a high-level of performance typically requires that the models be trained with relevant datasets that have been painstakingly annotated by experts. Here we show that a self-supervised model trained on chest X-ray images that lack explicit annotations performs pathology-classification tasks with accuracies comparable to those of radiologists. On an external validation dataset of chest X-rays, the self-supervised model outperformed a fully supervised model in the detection of three pathologies (out of eight), and the performance generalized to pathologies that were not explicitly annotated for model training, to multiple image-interpretation tasks and to datasets from multiple institutions.

Anatomy-Driven Pathology Detection on Chest X-rays

A survey of the impact of self-supervised pretraining for diagnostic tasks in medical X-ray, CT, MRI, and ultrasound

Article Open access 06 April 2024

Breaking with Fixed Set Pathology Recognition Through Report-Guided Contrastive Training

Main

Deep learning has enabled the automation of complex medical image interpretation tasks, such as disease diagnosis, often matching or exceeding the performance of medical experts^25,26,27. In contrast, our method is able to classify pathologies without requiring the domain-specific development of an automatic labeller. The self-supervised method has the potential to alleviate the labelling bottleneck in the machine-learning pipeline for a range of medical-imaging tasks by leveraging easily accessible unstructured text data without domain-specific pre-processing efforts¹⁷. As a result, the self-supervised method opens promising avenues for approaches and applications in the medical-imaging domain, where narrative reports that describe imaging findings are common.

One notable finding is the ability of the self-supervised method to predict differential diagnoses and radiographic findings with high accuracy on a dataset that was collected in a country different from that of the training dataset¹⁹. This ability to generalize to datasets from vastly different distributions has been one of the primary challenges for the deployment of medical artificial intelligence^28,29. Despite the challenges of generalization described in previous works, the self-supervised method achieves an AUC of at least 0.900 on 6 radiographic findings and at least 0.700 on 38 findings out of 57 radiographic findings where n > 50 in the PadChest test dataset (n = 39,053) (Fig. 3). We speculate that the self-supervised model can generalize better because of its ability to leverage unstructured text data, which contains more diverse radiographic information that could be applicable to other datasets. Additionally, we note that we might expect improved performance if we used alternative labels instead of the raw clinical findings in PadChest. Ultimately, the results demonstrate that the self-supervised method can generalize well on a different data distribution without having seen any explicitly labelled pathologies from PadChest during training³⁰.

Biases may have affected the training of the self-supervised method. For example, if a pathology is never mentioned in the reports, then the method cannot be expected to predict that pathology with high accuracy during zero-shot evaluation. Furthermore, the model’s ability to predict a pathology may depend on the terminology used in the training reports. For instance, if several reports describe a condition such as atelectasis, but do not explicitly use the term, then the method may not perform well when queried with the phrase ‘has atelectasis’³¹. Thus, the method’s ability to predict pathologies is limited to scenarios mentioned in the text reports, and may perform less well when there are a variety of ways to describe the same pathology. To address these potential biases, we provide the model with hundreds of thousands of image–text pair samples (n = 377,110) during training, encompassing a wide variety of writing styles and descriptions of pathologies¹⁷. By validating the method on the CheXpert and PadChest datasets, which were collected at different hospitals from the one used in the training of the model, we show that site-specific biases are not inhibiting the method’s ability to predict clinically relevant pathologies with high accuracy.

This work has a few limitations. First, the self-supervised method still requires repeatedly querying performance on a labelled validation set for hyperparameter selection and to determine condition-specific probability thresholds when calculating MCC and F1 statistics. Second, the self-supervised method is currently limited to classifying image data; however, medical datasets often combine different imaging modalities, can incorporate non-imaging data from electronic health records or other sources, or can be a time series. For instance, magnetic resonance imaging and computed tomography produce three-dimensional data that have been used to train other machine-learning pipelines^32,33,34. On the same note, it would be of interest to apply the method to other tasks in which medical data are paired with some form of unstructured text. For instance, the self-supervised method could leverage the availability of pathology reports that describe diagnoses such as cancer present in histopathology scans^26,35,36. Lastly, future work should develop approaches to scale this method to larger image sizes to better classify smaller pathologies^{37,38,39,40,41,42,43,44,45}.

In summary, we have designed a self-supervised method using contrastive learning that detects the presence of multiple pathologies in chest X-ray images. The self-supervised method builds on the use of image–text pairings of chest X-rays and radiology reports in ConVIRT, as well as on the multi-class zero-shot classification of natural images in Contrastive Language-Image Pre-training (CLIP) to enable the application of zero-shot approaches to medical-image interpretation. The self-supervised method matches radiologist-level performance on a chest X-ray classification task for multiple pathologies that the model was not explicitly trained to classify (Fig. 2 and Table 1). The results highlight the potential of deep-learning models to leverage large amounts of unlabelled data for a broad range of medical-image-interpretation tasks, and thereby may reduce the reliance on labelled datasets and decrease clinical-workflow inefficiencies resulting from large-scale labelling efforts.

Methods

Datasets

Training

The self-supervised method was trained on the MIMIC-CXR dataset, a publicly available dataset of chest radiographs with radiology text reports. The MIMIC-CXR dataset contains 377,110 images corresponding to 227,835 radiographic studies¹⁷. For instances where a radiographic study contains more than one chest X-ray image, the chest X-ray that is in anteroposterior/posteroanterior view was chosen to be included as part of training. Each radiographic study comes with a corresponding free-text radiology report, a summarization written by radiologists regarding their findings. Each full radiology report consists of multiple sections: examination, indication, impression, findings, technique and comparison. CheXpert is a public dataset for chest radiograph interpretation, consisting of 224,316 chest X-rays of 65,240 patients from Stanford Hospital⁸. The dataset is labelled for the presence of 14 different conditions: atelectasis, cardiomegaly, consolidation, oedema, enlarged cardiomediastinum, fracture, lung lesion, lung opacity, no finding, pleural effusion, pleural other, pneumonia, pneumothorax and support devices. These labels are obtained from the agreement of five board-certified radiologists. Additionally, the dataset consists of free-text radiology reports that are associated with each chest X-ray image. The CheXpert validation dataset is utilized for tuning-condition-specific probability thresholds to obtain predictions from the self-supervised model’s probabilities for the five CheXpert competition conditions of a given chest X-ray image We conduct this analysis by running inference with the self-supervised model to obtain probability values of each condition being present for all chest X-ray images. Condition-specific probability thresholds are then determined by choosing the probability values that result in the best MCC for each condition on the CheXpert validation dataset. The CheXpert validation dataset has no overlap with the CheXpert test dataset used for evaluation.

Evaluation

The self-supervised method was evaluated on two external datasets: the CheXpert test dataset and PadChest. The CheXpert test dataset is a collection of chest X-rays that are commonly used to evaluate the performance of models on chest X-ray interpretation tasks^14,31. We evaluate the model on the entire CheXpert test dataset, consisting of 500 chest X-ray images labelled for the presence of 14 different conditions⁸. The CheXpert test dataset is utilized to calculate both the self-supervised model’s area under the receiver operating characteristic (AUROC) and MCC metrics for each of the five CheXpert competition conditions. Additionally, the test set contains predictions from three board-certified radiologists on full-resolution images with which we compare the performance of the model.

The PadChest dataset is a public dataset that contains 160,868 chest X-ray images labelled with 174 different radiographic findings, 19 differential diagnoses¹⁹. Twenty-seven per cent of the labels come from board-certified radiologists, and the rest were obtained by using a recurrent neural network with attention trained on the radiology reports. For evaluation purposes, only 39,053 examples from the dataset were utilized, each of which was annotated by board-certified radiologists. These examples were then used to calculate the self-supervised model’s AUROC for each of the different conditions described above.

Pre-processing

Each of the 377,110 chest X-rays in the MIMIC-CXR dataset were re-sized to 224 × 224 and zero padded before training. Each image was then normalized using a sample mean and standard deviation of the training dataset.

Text from radiology reports were tokenized using the byte pair encoding procedure with a vocabulary size of 49,408. For text that exceeds the maximum token sequence length of the given architecture, we truncated the text embedding to the first ‘context length tokens – 2’. The remaining two tokens were saved for the [SOS] and [EOS] tokens at the beginning and end of the text embedding, respectively.

Architecture

The uninitialized architectures consist of a Vision Transformer, ViT-B/32, for the image encoder, and a Transformer for the text encoder. We use a pre-trained Vision Transformer that accepts images of resolution 224 × 224. The text encoder Transformer has a base size of 63 million parameters, 12 layers and a width of 512 with 8 attention heads. The Transformer operates on lower-byte pair encoding representation of text and uses text embeddings with a maximum token length of 77. We use the same initialization scheme used in CLIP¹⁵.

Implementation of the method

Model pre-training

The self-supervised model consists of an image and text encoder that we jointly train on the MIMIC-CXR training dataset¹⁷. We utilize the impressions section of each text report, since it contains a concise summary of the entire report. We contrast this with a previous self-supervised method, ConVIRT, which selects a random sentence from the full-length radiology report for each image¹⁴. Although their proposed method could extract some signal, a random text input selection allows for unnecessary stochasticity that could lead to inconsistencies in training. To address this, we consistently select the text from the impressions section.

Training

We initialized the self-supervised model using the ViT-B/32and Transformer architectures with pre-trained weights from OpenAI’s CLIP model¹⁵. When training on the impressions section, we keep the maximum context length of 77 tokens as given in the CLIP architecture. We demonstrated that we can leverage the pre-trained weights from the CLIP architecture learned from natural images to train a zero-shot model with a domain-specific medical task.

To prepare the data for training, all images from the MIMIC-CXR dataset are stored in a single HDF5 file. We performed a hyperparameter sweep over the batch size and the learning rate using the CheXpert validation dataset. We compute the validation mean AUC over the five CheXpert competition pathologies after every 1,000 batches are trained, and save the model checkpoint if the model outperforms the last best model during training. The validation mean AUCs of these checkpoints are used to select models for ensembling. The best model uses stochastic gradient descent for optimization with a learning rate of 0.0001 and momentum of 0.9. The best model has a batch size of 64 and is trained for four epochs. We train the model by maximizing the cosine similarity between image and text embeddings of all valid image–report pairs in the batch while minimizing the cosine similarity between the embeddings of incorrect pairings in the batch. The method’s training procedure closely follows the implementation of CLIP¹⁵.

Softmax evaluation technique for multi-label classification

To evaluate the zero-shot performance of the model on the multi-label classification task, we used a positive–negative softmax evaluation procedure on each of the diseases. In contrast to CLIP, the proposed procedure allows us to normalize with respect to the negated version of the same disease classification instead of naively normalizing across the diseases to obtain probabilities from the logits¹⁵. The latter approach is less reasonable in this context since a single image may have multiple associated labels.

We define the procedure as follows. First, we compute logits with positive prompts (such as atelectasis) and negative prompts (that is, no atelectasis). Then, we compute the softmax between the positive and negative logits. Lastly, we keep the softmax probabilities of the positive logits as the probability that the disease is present in the chest X-ray.

Ensembling

We ensemble the top-ten model checkpoints sorted by mean AUC over the five CheXpert pathologies on the validation dataset. The probability outputs of the ensemble are computed by taking the average of the probability outputs of each model. The probabilities are averaged after softmax evaluation. These probabilities are then used for model evaluation through AUC and for prediction tasks using condition thresholds generated from the validation dataset.

Knowledge-distillation procedure

To allow for the use of the CLIP pre-trained model on full radiology reports to evaluate zero-shot performance on auxiliary tasks such as sex prediction, we use a knowledge-distillation procedure. This procedure is required as the pre-trained text encoder from the CLIP model has a context length of only 77 tokens, which is not long enough for an entire radiology report. We use the pre-trained model to train a model with a context length of 512, long enough to encompass 98% of radiology reports. In this method, the text encoder of the best-performing model trained only on impressions is used as a teacher for the text encoder of a student model. To train the student, we compute the mean squared error between the logits of the two encoders, then backpropagate across the student architecture. Once the student text encoder is trained, we replace the uninitialized image encoder in the student model with the image encoder of the teacher model. Then, the student model is contrastively trained on the MIMIC-CXR chest X-ray and full-text report pairs.

Prompt-engineering methods

We run experiments using the labels present in the test set as the prompts and creating the prompts of ‘<label>’ and ‘no <label>’ as the positive and negative prompts for the softmax evaluation procedure.

Statistical analysis

AUROC

We collect AUROC results from both the CheXpert test dataset (500 samples) as well as PadChest dataset (39,053 samples) using the self-supervised model’s predictions. The AUROC and MCC results of the five clinically relevant pathologies on the CheXpert test dataset are presented in Table 1. Table 2 consists of the mean AUROC of these five pathologies on the CheXpert test dataset along with self-supervised and supervised comparisons. The DAM supervised method is included as a comparison and currently is state-of-the-art on the CheXpert dataset. An additional supervised baseline, DenseNet121, trained on the CheXpert dataset is included as a comparison since DenseNet121 is commonly used in self-supervised approaches. Current top-performing label-efficient approaches, ConVIRT, MedAug and MoCo-CXR, are included as self-supervised comparisons.

MCC and F1 score

To obtain the MCC, we first run inference on the CheXpert test set using our softmax evaluation technique to obtain probability values for the 14 different conditions on each of the 500 chest X-ray images. The probabilities are then transformed into positive/negative predictions using the probability thresholds computed by optimizing MCC over the validation dataset. Then, the condition-based MCC scores are calculated using these predictions. We similarly compute the F1 score, but using the same thresholds as used for computing the MCC.

Confidence intervals

We use the non-parametric bootstrap to generate confidence intervals: random samples of size n (equal to the size of the original dataset) are repeatedly sampled 1,000 times from the original dataset with replacement. We then estimate the AUROC, F1 and MCC metrics (or their difference for two the methods) using each bootstrap sample. We derive confidence intervals from the relative frequency distribution of the estimates over the re-samples, using the interval between the 100 × (α/2) and 100 × (1 − α/2) percentiles; we pick α = 0.05.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The main data (CheXpert data) supporting the results of this study are available at https://aimi.stanford.edu/chexpert-chest-x-rays. MIMIC-CXR data are available at https://physionet.org/content/mimic-cxr/2.0.0 for users with credentialed access. PadChest data are available at https://bimcv.cipf.es/bimcv-projects/padchest. Source data are provided with this paper.

Code availability

The code used to train and evaluate CheXzero is available on GitHub at https://github.com/rajpurkarlab/CheXzero.

References

Rajpurkar, P., et al. 2017. CheXNet: radiologist-level pneumonia detection on chest X-Rays with deep learning. ar**v https://doi.org/10.48550/ar**v.1711.05225 (2017).
Litjens, G. et al. A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017).
Article Google Scholar
Qin, C., Yao, D., Shi, Y. & Song, Z. Computer-aided detection in chest radiography based on artificial intelligence: a survey. Biomedical engineering online 17, 1–23 (2018).
Article Google Scholar
Esteva, A. et al. Deep learning-enabled medical computer vision. NPJ Digit. Med. https://doi.org/10.1038/s41746-020-00376-2 (2021).
Shen, D., Wu, G. & Suk, H.-I. Deep learning in medical image analysis. Annu. Rev. Biomed. Eng. 19, 221–248 (2017).
Article CAS Google Scholar
Fink, O. et al. Potential, challenges and future directions for deep learning in prognostics and health management applications. Eng. Appl. Artif. Intell. 92, 103678 (2020).
Article Google Scholar
Smit, A., et al. 2020. CheXbert: combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. ar**v https://doi.org/10.48550/ar**v.2004.09167 (2020).
Irvin, J., et al. Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In Proc. AAAI Conference on Artificial Intelligence, 33:590–597 (AAAI Press, 2019).
Erhan, D., A. Courville, Y. Bengio, and P. Vincent. Why does unsupervised pre-training help deep learning? In Proc. Thirteenth International Conference on Artificial Intelligence and Statistics (eds Teh, Y. W. & Titterington, T.) 9:201–208 (PMLR, 2010).
Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., & Liu, C. A survey in deep transfer learning. In Artificial Neural Networks and Machine Learning – ICANN 2018 270–279 (Springer Int. Publishing, Cham, 2018).
Chen, T., S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning 1597–1607 (PMLR, 2020).
He, K., H. Fan, Y. Wu, S. **e, and R. Girshick. Momentum contrast for unsupervised visual representation learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 9729–9738 (CVPR, 2020).
Vu, Y. N. T., et al. MedAug: contrastive learning leveraging patient metadata improves representations for chest X-ray interpretation. ar**v https://doi.org/10.48550/ar**v.2102.10663 (2021).
Zhang, Y., H. Jiang, Y. Miura, C. D. Manning, and C. P. Langlotz. Contrastive learning of medical visual representations from paired images and text. ar**v https://doi.org/10.48550/ar**v.2010.00747 (2020).
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning 39:8748–8763 (PMLR, 2021).
**an, Y., Lampert, C. H., Schiele, B. & Akata, Z. Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2251–2265 (2019).
Article Google Scholar
Johnson, A. E. W. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6, 1–8 (2019).
Article Google Scholar
Sowrirajan, H., J. Yang, A. Y. Ng, and P. Rajpurkar. MoCo-CXR: pretraining improves representation and transferability of chest X-ray models. ar**v https://doi.org/10.48550/ar**v.2010.05352 (2021).
Pooch, E. H. P., P. L. Ballester, and R. C. Barros. Can we trust deep learning models diagnosis? The impact of domain shift in chest radiograph classification. ar**v https://doi.org/10.48550/ar**v.1909.01940 (2019).
Rajpurkar, P. et al. Deep learning for chest radiograph diagnosis: a retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS Med. 15, e1002686 (2018).
Article Google Scholar
Huang, S.-C., L. Shen, M. P. Lungren, and S. Yeung. GLoRIA: a multimodal global-local representation learning framework for label-efficient medical image recognition. In Proc. IEEE/CVF International Conference on Computer Vision 3942–3951 (ICCV, 2021).
Hayat, N., H. Lashen, and F. E. Shamout. Multi-label generalized zero shot learning for the classification of disease in chest radiographs. ar**v https://doi.org/10.48550/ar**v.2107.06563 (2021).
Wang, X., Z. Xu, L. Tam, D. Yang, and D. Xu. Self-supervised image-text pre-training with mixed data in chest X-rays. ar**v https://doi.org/10.48550/ar**v.2103.16022 (2021).
Avdic, A., Marovac, U. & Jankovic, D. Automated labeling of terms in medical reports in Serbian. Turk. J. Electr. Eng. Comput. Sci. 28, 3285–3303 (2020).
Google Scholar
Haug, P. J., et al. 2014. Develo** a section labeler for clinical documents. AMIA Annu. Symp. Proc. 636–644 (2014).
Qiu, J. X., Yoon, H.-J., Fearn, P. A. & Tourassi, G. D. Deep learning for automated extraction of primary sites from cancer pathology reports. IEEE J. Biomed. Health Inform. 22, 244–251 (2018).
Article Google Scholar
Zhang, C., Bengio, S., Hardt, M., Recht, B. & Vinyals, O. Understanding deep learning (still) requires rethinking generalization. Commun. ACM 64, 107–115 (2021).
Article Google Scholar
Arjovsky, M.. Out of Distribution Generalization in Machine Learning (ed. Bottou, L.) PhD thesis, New York Univ. https://www.proquest.com/dissertations-theses/out-distribution-generalization-machine-learning/docview/2436913706/se-2 (2020).
Radford, A., et al. Learning transferable visual models from natural language supervision. ar**v https://doi.org/10.48550/ar**v.2103.00020 (2021).
Liu, P., et al. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ar**v https://doi.org/10.48550/ar**v.2107.13586 (2021).
Patterson, H. S. & Sponaugle, D. N. Is infiltrate a useful term in the interpretation of chest radiographs? Physician survey results. Radiology 235, 5–8 (2005).
Article Google Scholar
Liang, Z.-P., and P. C. Lauterbur. Principles of Magnetic Resonance Imaging (SPIE Optical Engineering Press Belllingham, 2000).
Lundervold, A. S. & Lundervold, A. An overview of deep learning in medical imaging focusing on MRI. Z. Med. Phys. 29, 102–127 (2019).
Article Google Scholar
Kim, Y. et al. Validation of deep learning natural language processing algorithm for keyword extraction from pathology reports in electronic health records. Sci. Rep. 10, 20265 (2020).
Article CAS Google Scholar
van der Laak, J., Litjens, G. & Ciompi, F. Deep learning in histopathology: the path to the clinic. Nat. Med. 27, 775–84. (2021).
Article Google Scholar
Han, Y., C. Chen, A. H. Tewfik, Y. Ding, and Y. Peng. Pneumonia detection on chest X-ray using radiomic features and contrastive learning. ar**v https://doi.org/10.48550/ar**v.2101.04269 (2021).
Kamel, S. I., Levin, D. C., Parker, L. & Rao, V. M. Utilization trends in noncardiac thoracic imaging, 2002–2014. J. Am. Coll. Radiology 14, 337–342 (2017).
Article Google Scholar
Cardoso, J., Van Nguyen, H., Heller, N., Abreu, P. H., Isgum, I., Silva, W., ... & Abbasi, S. in Interpretable and Annotation-Efficient Learning for Medical Image Computing 103–111 (Springer Nature, 2020).
Paul, A. et al. Generalized zero-shot chest X-ray diagnosis through trait-guided multi-view semantic embedding with self-training. IEEE Trans. Med. Imaging 40, 2642–2655 (2021).
Article Google Scholar
Raghu, M., C. Zhang, J. M. Kleinberg, and S. Bengio. Transfusion: understanding transfer learning with applications to medical imaging. ar**v https://doi.org/10.48550/ar**v.1902.07208 (2019).
Rezaei, M. & Shahidi, M. Zero-shot learning and its applications from autonomous vehicles to COVID-19 diagnosis: a review. Intell. Based Med. 3, 100005 (2020).
Article Google Scholar
Sennrich, R., B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. ar**v https://doi.org/10.48550/ar**v.1508.07909 (2015).
**an, Y., Lampert, C. H., Schiele, B. & Akata, Z. Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2251–2265 (2018).
Article Google Scholar
Yuan, Z., Y. Yan, M. Sonka, and T. Yang. Robust deep AUC maximization: a new surrogate loss and empirical studies on medical image classification. ar**v https://doi.org/10.48550/ar**v.2012.03173 (2020).
Pooch, E. H., Ballester, P., & Barros, R. C. Can we trust deep learning based diagnosis? The impact of domain shift in chest radiograph classification. In International Workshop on Thoracic Image Analysis pp. 74–83 (Springer, Cham, 2020).
Bustos, A., Pertusa, A., Salinas, J.-M. & de la Iglesia-Vayá, M. PadChest: a large chest X-ray image dataset with multi-label annotated reports. Med. Image Anal. 66, 101797 (2020).
Article Google Scholar
Gaillard, F. Tension pneumothorax. Case study. Radiopaedia.org https://doi.org/10.53347/rID-10558 (2010).

Download references

Acknowledgements

The authors acknowledge the contributions of the consortium working on the development of the NHLBI BioData Catalyst ecosystem.

Author information

These authors contributed equally: Ekin Tiu, Ellie Talius, Pujan Patel.

Authors and Affiliations

Stanford University Department of Computer Science, Stanford, CA, USA
Ekin Tiu, Ellie Talius, Pujan Patel & Andrew Y. Ng
Department of Biomedical Informatics, Harvard University, Boston, MA, USA
Ekin Tiu, Ellie Talius, Pujan Patel & Pranav Rajpurkar
AIMI Center, Stanford University, Palo Alto, CA, USA
Curtis P. Langlotz

Authors

Ekin Tiu
View author publications
You can also search for this author in PubMed Google Scholar
Ellie Talius
View author publications
You can also search for this author in PubMed Google Scholar
Pujan Patel
View author publications
You can also search for this author in PubMed Google Scholar
Curtis P. Langlotz
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Y. Ng
View author publications
You can also search for this author in PubMed Google Scholar
Pranav Rajpurkar
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

P.R, A.Y.N., E. Tiu, E. Talius. and P.P. conceptualized the study. P.R., E. Tiu, P.P. and E. Talius designed the study. E. Tiu., E. Talius, P.P., P.R. and C.L. performed data analysis and interpretation. E. Tiu, E. Talius, P.P. and P.R. drafted the manuscript. A.Y.N. and C.L. carried out critical revisions of the manuscript, with important intellectual content. A.Y.N. and P.R. supervised the work.

Corresponding author

Correspondence to Pranav Rajpurkar.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Biomedical Engineering thanks Namkug Kim and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Reporting Summary

Peer Review File

Source data

Source data for Fig. 2

Source data.

Source data for Fig. 3

Source data.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Tiu, E., Talius, E., Patel, P. et al. Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat. Biomed. Eng 6, 1399–1406 (2022). https://doi.org/10.1038/s41551-022-00936-9

Download citation

Received: 30 November 2021
Accepted: 07 August 2022
Published: 15 September 2022
Issue Date: December 2022
DOI: https://doi.org/10.1038/s41551-022-00936-9
Springer Nature Limited

This article is cited by

A survey of the impact of self-supervised pretraining for diagnostic tasks in medical X-ray, CT, MRI, and ultrasound
- Blake VanBerlo
- Jesse Hoey
- Alexander Wong
BMC Medical Imaging (2024)
Efficient deep learning-based automated diagnosis from echocardiography with contrastive self-supervised learning
- Gregory Holste
- Evangelos K. Oikonomou
- Rohan Khera
Communications Medicine (2024)
Heterogeneity and predictors of the effects of AI assistance on radiologists
- Feiyang Yu
- Alex Moehring
- Pranav Rajpurkar
Nature Medicine (2024)
The limits of fair medical imaging AI in real-world generalization
- Yuzhe Yang
- Haoran Zhang
- Marzyeh Ghassemi
Nature Medicine (2024)
Rapid deep learning-assisted predictive diagnostics for point-of-care testing
- Seungmin Lee
- Jeong Soo Park
- Jeong Hoon Lee
Nature Communications (2024)

Associated content

Machine learning in healthcare

Collection 10 October 2018

Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning

Abstract

Similar content being viewed by others

Main

Methods

Datasets

Training

Evaluation

Pre-processing

Architecture

Implementation of the method

Model pre-training

Training

Softmax evaluation technique for multi-label classification

Ensembling

Knowledge-distillation procedure

Prompt-engineering methods

Statistical analysis

AUROC

MCC and F1 score

Confidence intervals

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Navigation