Introduction

Bilateral blindness was estimated to be present in 9.4 million people with glaucoma in 2010, and this number is expected to rise to 11.2 million people in 20201. Glaucoma is an irreversible disease and the second most common cause of blindness worldwide1. Early diagnosis of glaucoma is hugely important for preventing blindness. In glaucoma, morphological changes at the optic disc occur in typical patterns2. Evaluation of the optic nerve head (ONH) and retinal nerve fiber layer (RNFL) around the optic disc is very important for accurate and early diagnosis of glaucoma since structural changes may precede measurable visual field (VF) loss3. With the development of imaging devices, such as optical coherence tomography (OCT)4, the Heidelberg Retina Tomograph (HRT, Heidelberg Engineering GmbH, Heidelberg, Germany) and scanning laser polarimetry (GDx: Carl Zeiss Meditec, Dublin, CA), it is possible to measure glaucomatous structural changes quantitatively and in great detail. However, a considerable limitation of these ‘high-tech’ imaging devices is that they are usually available only at specialist eye clinics or hospitals; consequently, glaucoma sufferers – who have not visited these facilities – can persist without a diagnosis for many years. Furthermore, these imaging devices are usually unavailable in poorer nations.

The two-dimensional fundus photograph is a basic ophthalmological screening tool. In Japan and many other countries, the screening of ophthalmological diseases, including glaucoma, is based on expert interpretation of two-dimensional fundus photographs, and more high-tech imaging devices are generally not used. Thus, two-dimensional fundus photography remains a key instrument to prevent blindness through early detection of glaucoma. One of the problems with such fundus photography is that diagnosis is currently based on subjective judgement. Nonetheless, optic disc morphologic information quantified from fundus photographs are highly correlated with structural measurements obtained with HRT and GDx5.

The development of deep learning methods represents a revolutionary advance in imaging recognition research6. Deep learning methods are similar to artificial neural networks, which process information via interconnected neurons, however, deep learning methods have many ‘hidden layers’ which become computable in conjunction with a feature extractor. The feature extractor transforms raw data into a suitable feature vector, which can identify patterns in the input7. The purpose of the current study was to develop a deep residual learning algorithm to screen for glaucoma from fundus photographs, and to validate its diagnostic performance using an independent dataset.

A recent study suggested the usefulness of applying a deep learning method to diagnose glaucoma8,9, however, it used a simple convolutional neural network (CNN), whereas more powerful deep learning methods, such as the Deep Residual Learning for Image Recognition (ResNet)Diagnosis by Residents in Ophthalmology

The fundus photographs of the testing dataset were reviewed by Residents in Ophthalmology (A: first year in Ophthalmology residency, B: third year in Ophthalmology residency, C: fourth year in Ophthalmology residency). All of the fundus photographs were reviewed, masking other clinical information. Each Resident made a diagnosis independent to other Residents.

Statistical analysis

Three-fold cross validation was performed; the training dataset was divided into three equally sized subsets and the deep learning algorithm was trained using two of the three arms, diagnostic accuracy was then calculated in the remaining arm. The process was iterated three times so that each of the three arms was used as a validation dataset once.

Independent validation was carried out using the testing dataset. The deep learning algorithm was built using all data in the training dataset and the area under the receiver operating characteristic curve (AROC) was calculated. AROCs were also calculated separately for the non-highly myopic and highly myopic eyes (i.e., between the G and N groups and between the mG and mN groups). Sensitivity was calculated at specificity equal to 95% in all of the analyses. AROCs were also obtained based on the diagnoses of glaucoma from the three Residents in Ophthalmology.

As a further comparison, AROC values were calculated using: (i) a CNN with 16 layers, similar to VGG1624, (ii) a support vector machine25 and (iii) a Random Forest26. The details of each method follow;

  1. i)

    CNN with 16 layers, similar to VGG16.

  2. ii)

    Support vector machine: Radial Basis Function, Penalty parameter = 1.0.

  3. iii)

    Random Forest: number of trees = 10,000, criterion = Gini index, minimum number of samples required to split an internal node = 2, The minimum number of samples required to be at a leaf node = 1.

All statistical analyses were carried out using the statistical programming language Python (ver. 2.7.9, Python Software Foundation, Beaverton, US). AROCs were compared using DeLong’s method27. Benjamini’s method28 was used to correct P values for the problem of multiple testing.

Result

The testing dataset consisted of (i) 33 eyes of 33 non-highly myopic glaucoma patients (G group), (ii) 28 eyes of 28 highly myopic glaucoma patients (mG group), (iii) 27 eyes of 27 non-highly myopic normative subjects (N group) and (iv) 22 eyes of 22 highly myopic normative subjects (mN group). Demographic data of the subjects are summarized in Table 1.

Table 1 Demographics of subjects in testing dataset.

The AROC values obtained with the internal verification are shown in Table 2. The diagnostic accuracy varied between 94.2 and 96.0%.

Table 2 AROC values obtained with internal three-fold cross validation.

Figure 1 shows the structure of the deep learning algorithm (ResNet) with various parameters (Table 3). Figure 2 shows the receiver operating characteristic curve obtained with all data in the testing dataset. The AROC with ResNet was 96.5 (95% confidence interval [CI]: 93.5 to 99.6)%. The AROCs obtained from the three Residents in Ophthalmology are also shown; their AROCs were A: 72.6 (95% CI: 64.1 to 81.1), B: 87.7 (95% CI: 82.3 to 93.2) and C: 91.2 (95% CI: 85.9 to 96.5)%. The AROC with ResNet was significantly larger than the AROCs of the Residents in Ophthalmology A and B (both p < 0.001, DeLong’s method with adjustment for multiple comparisons). There was not a significant difference between the AROC of ResNet and the AROC of the Resident in Ophthalmology C (p = 0.077).

Figure 1
figure 1

The deep residual learning algorithm to diagnose glaucoma using fundus photography. In ResNet, after two instances of convolution and batch normalization, the input is added to the raw output. (a) Shows the scheme of the classifier. The network is highly influenced by ResNet, which has skip** connections in each residual block to promote efficient training of deeper layers. This network has 18 convolutional layers in total. (b) shows the detailed explanation of residual blocks. In the case of (a), the shape of input and output will be the same. On the other hand, (b) doubles the number of channels with the second convolution, while width and height are halved with max pooling. When adding the input to output, half of the filters added are zero-padded so that the shapes match. ResNet: residual network.

Table 3 Parameters used in ResNet.
Figure 2
figure 2

External validation: Receiver operating characteristic curve. The receiver operating characteristic curve obtained in the testing dataset (N = 110). The AROC with ResNet was 96.52 (95% confidence interval [CI]: 95.6 to 98.7)%. The AROC values of the three Residents in Ophthalmology were: 72.6 (95% CI: 64.1 to 81.1)%, 87.7 (95% CI: 82.3 to 93.2)%, and 91.2 (95% CI: 85.9 to 96.47)%, which were significantly smaller than that of ResNet. P values were obtained by comparing the AUC with ResNet and those of Residents in Ophthalmology A, B, C (DeLong’s method with adjustment for multiple comparisons). AROC: area under the receiver operating characteristic curve. ResNet: residual network.

Figure 3 shows the receiver operating characteristic curve obtained with the G and N groups in the testing dataset. The AROC with ResNet was 97.1 (95% confidence interval [CI]: 93.3 to 100.0)%. The AROCs obtained from the three Residents in Ophthalmology are also shown; their AROCs were A: 77.4 (95% CI: 67.0 to 87.9) and B: 84.9 (95% CI: 76.9 to 92.8)% (p values: < 0.001 and 0.0013, respectively), but not with C: 93.7 (95% CI: 86.8 to 99.8)% (p = 0.15). The AROC with ResNet was significantly larger than the AROCs of the Residents in Ophthalmology A and B (p = 0.0014 and 0.0026, respectively). There was not a significant difference between the AROC of ResNet and the AROC of the Resident in Ophthalmology C (p = 0.29).

Figure 3
figure 3

Receiver operating characteristic curve obtained between G and N groups in testing dataset. The receiver operating characteristic curve obtained between G and N groups in the testing dataset. The AROC with ResNet was 97.1 (95% confidence interval [CI]: 93.3 to 100.0)%. The AROC values of the three Residents in Ophthalmology were: 77.4 (95% CI: 67.0 to 87.9)%, 84.9 (95% CI: 76.9 to 92.8)%, and 93.7 (95% CI: 86.8 to 99.8)%. P values were obtained by comparing the AUC with ResNet and those of Residents in Ophthalmology A, B, C (DeLong’s method with adjustment for multiple comparisons). AROC: area under the receiver operating characteristic curve. ResNet: residual network, G: non-highly myopic glaucoma patients and N: non-highly myopic normative subjects.

Figure 4 shows the receiver operating characteristic curve obtained with the mG and mN groups in the testing dataset. The AROC with ResNet was 96.4 (95% CI: 92.0 to 100.0)%. The AROCs obtained from the three Residents in Ophthalmology are also shown; A: 66.6 (95% CI: 53.4 to 79.7), B: 91.2 (95% CI: 83.9 to 98.3), C: 88.8 (95% CI: 80.3 to 97.3)%. The AROC with ResNet was significantly larger than the AROC of the Resident in Ophthalmology A (p < 0.001). There was not a significant difference between the AROC with ResNet and the AROCs of the Residents in Ophthalmology B and C (p = 0.10 and 0.072, respectively).

Figure 4
figure 4

Receiver operating characteristic curve obtained between mG and mN groups in testing dataset. The receiver operating characteristic curve was obtained between mG and mN groups in the testing dataset. The AROC with ResNet was 96.4 (95% CI: 92.0 to 100.0)%. The AROC values of the three Residents in Ophthalmology were 66.6 (95% CI: 53.4 to 79.9)%, 91.2 (95% CI: 83.9 to 98.3)%, and 88.8 (95% CI: 80.3 to 97.3)%. P values were obtained by comparing the AUC with ResNet and those of Residents in Ophthalmology A, B, C (DeLong’s method with adjustment for multiple comparisons). AROC: area under the receiver operating characteristic curve. ResNet: residual network. mG: highly myopic glaucoma patients and N: highly myopic normative subjects.

The AROC values associated with the other algorithms are shown in Table 4; these varied between 66.6 (95% CI: 53.4 to 79.7)% with the Support Vector Machine and 91.2 (95% CI: 83.5 to 99.0)% with a CNN with 16 layers, similar to VGG16.

Table 4 AROC values obtained with other models used to diagnose glaucoma.

Discussion

A deep residual learning algorithm to screen for glaucoma from fundus photographs was developed using a training dataset that consisted of 1,364 eyes with open angle glaucoma and 1,768 eyes of normative subjects. The diagnostic performance of this algorithm was validated using independent testing datasets. The AROC of the deep residual learning algorithm was 96.5% with all eyes, 97.1% between the G and N groups, and 96.4% between the mG and mN groups. These AROC values tended to be significantly larger than those from Residents in Ophthalmology. We also investigated the diagnostic performance of other deep learning models and machine learning methods, however, these algorithms resulted in much lower AROC values (see Table 4). As a scientific merit, the current results have shown the modern powerful deep learning method of the ResNet10 enabled an accurate diagnosis of glaucoma from fundus photographs. Diagnosing glaucoma in highly myopic eyes is a challenging task, because of the morphological difference from those of non-highly myopic eyes15,16, however the current results suggested the constructed algorithm had a high diagnostic power in such eyes.

Deep learning methods to diagnose disease from fundus photographs have been reported previously. Gulshan et al. developed a CNN, trained with 128,175 fundus photographs, to detect diabetic retinopathy29. The AROC of this algorithm was 99%. Takahashi et al. applied the GoogLeNet model to the same diagnostic problem and achieved 81% accuracy30. The task of glaucoma detection may be more challenging than the diagnosis of diabetic retinopathy, since diagnosis of diabetic retinopathy is based on abnormal retinal features, such as hemorrhage, microaneurysm and exudates, whereas the diagnosis of glaucoma relies on the estimation of subtle changes in the shape of the optic disc. Such alterations are better assessed using stereo-photographs, but two-dimensional fundus photographs were used in the current study, making the task even more challenging. Nonetheless, the deep residual learning model achieved very good discrimination with an AROC between 96.4 and 97.1%. It is difficult to directly compare the diagnostic performance of the current deep learning algorithm with those in recent studies8,9 since the diagnosis of glaucoma depends on many factors, including the stage of glaucoma and refractive status of the eye. In the current study, the ResNet algorithm was trained using a much smaller number of eyes (1,364 color fundus photographs with glaucomatous indications and 1,768 color fundus photographs without glaucomatous features) than in previous studies (approximately 120,000 and 40,000 fundus photographs), however, its diagnostic performance was equal to, or superior to, Residents in Ophthalmology, both in non-highly myopic and highly myopic eyes. It should be noted that high myopia was the top reason for false negative classification in8, despite the large size of the training dataset (approximately 40,000 fundus photographs). In contrast, our results suggested an accurate diagnosis can be obtained in highly myopic eyes.

The merits of applying machine learning methods to diagnose glaucoma have been widely reported. We previously applied the Random Forests method to diagnose glaucoma based on OCT measurements. As a result, the AROC to discriminate glaucomatous (from early to advanced glaucoma cases) from normative eyes was 98.5%31. We also reported that the AROC of this approach was 93.0% when discriminating early stage glaucoma patients and normative eyes32. Following great successes of deep learning methods for discrimination tasks in various fields, the application of these methods have just begun in the field of glaucoma. Indeed we have very recently reported the merit of applying deep learning methods to predict visual field sensitivity from OCT33. Although a more detailed investigation of glaucomatous retinal damage can be made using OCT, compared to fundus photography, the potential impact of the current algorithm as a screening tool cannot be exaggerated; fundus photography is commonly used at screening centers, opticians and internal medicine clinics.

The ResNet deep network (18 convolutional layers were used in the current study) enables the extraction of complex and detailed features from images. To avoid the model identifying features from other parts of the retina, outside the optic disc, images were cropped before training the model. This also reduced the model learning duration time. However, other glaucomatous findings might be observed outside the optic disc on fundus photographs, such as nerve fiber layer defects and optic disc hemorrhage. A future study should investigate whether considering this information improves diagnostic accuracy. Furthermore, it is possible that a deep learning algorithm trained on stereoscopic fundus photographs or OCT images offers better discrimination than the one built here on two-dimensional fundus photographs. The principal purpose of the current study was to build an automated screening tool for glaucoma with high discriminatory power, which could be used in the majority of screening facilities. The significant disadvantage of a screening tool based on stereoscopic fundus photography or OCT is that its use would be very limited since these technologies are equipped in only a limited number of ophthalmological facilities, while two-dimensional fundus photography is much more widely available. Nonetheless, this does not deny the value of a screening tool based on OCT or stereoscopic fundus photography, and a future study should investigate this possibility. A limitation of the current study concerns photograph selection, photographs with features that could interfere with an expert diagnosis of glaucoma were omitted. Excluding these images means further testing of the algorithm is essential to measure its performance as a screening tool in a “real world” setting. However, identifying low quality images is usually much easier than automatically screening for glaucoma.

In conclusion, a deep residual learning algorithm was developed to automatically screen for glaucoma in fundus photographs. The algorithm had a high diagnostic ability in non-highly myopic and highly myopic eyes.