Introduction

Stargardt disease (STGD1, OMIM #248200) refers to an inherited retinal disorder which causes progressive vision loss. The hallmark of STGD1 is the accumulation of bisretinoid fluorophores in the photoreceptor outer segments and lipofuscin-like fluorophores in the retinal pigment epithelium (RPE) due to the impaired flippase function of a retina-specific transmembrane protein, the adenosine triphosphate-binding cassette subfamily A member 4 (ABCA4)1. Several investigational drugs have been developed to inhibit or block biochemical pathways that lead to the formation of the bisretinoid fluorophores and their conversion to the toxic lipofuscin-like fluorophores in the RPE2,3,4,5,6. Some clinical trials have utilized fundus autofluorescence (FAF) imaging to measure fluorophore concentration and distribution in the retina as an endpoint to assess disease progression rate and therapeutic efficacy of these new agents7,8,9.

The earliest stage of STGD1 is characterized by an increase in the background FAF signal due to widespread accumulation of lipofuscin-like fluorophores within the RPE layer10. A recent multimodal imaging study showed that the formation of discrete fleck lesions, characterized by yellow pisciform deposits, which spread centrifugally and circumferentially around the fovea, correspond to clusters of degenerating photoreceptor cells8. These intensely hyperautofluorescent flecks, with FAF signal arising from photoreceptor bisretinoid fluorophores, make them a particularly prominent features on FAF imaging, even before they are clearly visible on clinical fundoscopy11,12. Over a period of years, these hyperautofluorescent flecks evolve through their own natural life cycle of enlargement, confluence and then regression into dark regions on FAF imaging due to death of the RPE9. Currently, FAF is used as a primary endpoint for novel treatment in STGD1 clinical trials (ClinicalTrials.gov identifiers NCT01736592, NCT03772665). The key parameter in FAF analysis is the quantification of the area where the autofluorescence signal is lacking due to the death of RPE cells7,9. However, a proportion of patients with childhood or late-onset diseases cannot be monitored in this way as they present prior to RPE loss. At this pre-atrophic stage, FAF has been used to quantify the signal intensity of the background diffuse autofluorescence (also known as quantitative autofluorescence, qAF, available on Heidelberg Spectralis, Heidelberg Engineering, Heidelberg, Germany) as a biomarker of RPE lipofuscin-like fluorophore accumulation13. However, qAF is not widely available and is no longer compatible with the most recent upgrade of the Heidelberg device software. Furthermore, patients are generally not diagnosed until flecks are visible and by this stage, quantification of AF can be confounded by the number and size of these flecks as they are highly hyperautofluorescent. Hence, tracking the evolution of fleck lesions using FAF individually or collectively over time as an alternative approach to monitoring STGD1 disease progression warrants further investigation. Automated segmentation of the central atrophic region on FAF imaging has been described in retinitis pigmentosa14. To the best of our knowledge, there has only been one study that described a thresholding algorithm in detecting flecks in STGD1 fundus images15. However, this algorithm was developed to identify but not quantify retinal flecks.

Herein, we used a set of manually segmented FAF images (ground truth, current gold standard) from patients with genetically confirmed STGD1 to develop a deep learning model for automated segmentation of flecks. We then compared the performance of the deep learning algorithm to manual segmentation by analyzing the same sets of longitudinal images using both approaches.

Results

Patient demographics

Forty-seven FAF images from 24 subjects (12 males, 12 females) from 19 families with a molecular diagnosis of STGD1 were utilized in the study (Table 1). The mean (range) age at which the first FAF images were obtained was 48 (12–89) years. Two patients (Patients 15 and 19) were of Asian descent while the remaining twenty-two were Caucasian. The median (range) age of onset of symptoms (or first sign of retinal lesion) was 33 (9–89) years. Seven (29%) patients had Fishman grade of 1–2 (i.e. no macular atrophy) and the remaining 17 (71%) had Fishman grade of 3 (8 with flecks localized to the posterior pole and 9 with flecks extending beyond the posterior pole). None had Fishman grade of 4.

Table 1 Summary of patient demographics, clinical data and genetic diagnoses.

Deep learning training and optimization procedure

Amongst 47 FAF images, 31 had discrete well-defined pisciform lesions and 16 had diffusely speckled lesions without pisciform lesions (Table 2, Fig. 1). Manual segmentation of the flecks marked a total of 4833 lesions and the total lesion area was 88.2 mm2 in the 47 images. FAF images with a discrete, large pisciform classification had an average of 109 (range 10–249) fleck lesions with a mean total hyperautofluorescent area of 2.35 (range 0.29–5.55) mm2 while those with a diffusely speckled FAF classification had an average of 92 (range 5–202) lesions with a mean total area of 0.96 (range 0.05–2.04) mm2.

Table 2 Characteristics of FAF images used in deep learning training.
Figure 1
figure 1

Manual and deep learning on Stargardt FAF images. (a) An example of a raw FAF image with discrete large pisciform lesions with (b) the outline of the hyperautofluorescent flecks manually marked. (c) An image mask of the fleck outline was generated for image analysis. (d) CLAHE transformation applied to the raw image in panel (a) followed by (e) fleck marking via deep learning. (f) Image mask of fleck outlines from deep learning was generated for image analysis. Dice score between the manual and deep learning segmentations is shown on the bottom left corner. (gi) Manual and (jl) deep learning segmentation of a FAF image with a diffusely speckled FAF pattern, as per panels (af), with dice score shown on the bottom left corner.

Of the 47 manually segmented FAF images, 37 were selected as the training and validation sets in deep learning and 10 used as independent testing images. The images were graded according to their clinical appearance and the senior author (FKC) ensured that, in the training, validation and testing sets, the distribution of clinical appearance was even across the three datasets. Fifty image patches (512 × 512 pixels) were generated from each of the 37 FAF images (total 1850 patches). Of these, 1750 and 100 partitioned image patches were used for training and validation, respectively. There was improvement in dice score and a decrease in validation loss with training. Both dice score and validation loss appeared to plateau from the 100th epoch onwards.

Bland–Altman analyses16 (Fig. 2) between manual and deep learning segmentation were performed in the 10 FAF images (4 with diffuse speckled and 6 with discrete fleck lesions) that underwent testing. Three concentric circles (10°, 20°, 30° diameter) centered at the fovea were placed on each image, which partitioned each image into a central disc of 10° diameter and two hollowed-out rings of 10°–20° and 20°–30° diameters. Manual and deep learning segmentation-derived fleck count were similar, with a mean difference (deep learning − manual) of − 0.60 (95% CI − 9.57 to 8.37), 4.10 (95% CI − 24.01 to 32.21) and 3.00 (95% CI − 27.21 to 33.21) for the 10° disc, 10°–20° and 20°–30° rings, respectively. Total fleck area was also similar, with a mean difference (deep learning − manual) of − 0.03 (95% CI − 0.11 to 0.06), 0.01 (95% CI − 0.24 to 0.26) and − 0.03 (95% CI − 0.32 to 0.26) mm2 for the 10° disc, 10°–20° and 20°–30° rings, respectively. The mean ± standard deviation (SD) dice score for FAF images with diffuse speckled patterns (N = 4) was lower than those with discrete flecks (N = 6); 0.54 ± 0.14 versus 0.71 ± 0.08, respectively (p < 0.04, t test).

Figure 2
figure 2

Bland–Altman plots comparing manual and deep learning segmentation methods. (a) Difference in fleck count (deep learning − manual) plotted against the mean of manual and deep learning fleck count in the central 10° ring. Solid black line indicates the mean difference and dashed black lines indicate the 95% confidence interval, gray line indicates no difference. (b) Difference in fleck area (deep learning − manual) plotted against the mean of manual and deep learning fleck area in the central 10° ring. Other details as per panel (a). (c,d) Bland–Altman plots of fleck number and fleck area in the 20° ring, respectively. Other details as per panel (a). (e,f) Bland–Altman plots of fleck number and fleck area in the 30° ring, respectively. Other details as per panel (a).

Longitudinal analysis

Right and left eye images from 6 subjects with discrete pisciform flecks and a minimum of 12 months of follow-up (mean ± SD total follow-up period: 4.2 ± 1.8 years) underwent both manual and deep learning segmentation. Longitudinal analyses are illustrated using two case examples: Patients 4b and 1 (Fig. 3). Patient 4b at 21 years old showed discrete pisciform FAF regions at the first visit (Fig. 3a), which progressed into speckled hypoautofluorescent regions at the last visit (27 years, Fig. 3b). Manual (blue outlines) and deep learning (red outlines) segmentation results at each visit are shown side by side. In all images, three concentric circles (10°, 20°, 30° diameter) were centred on the fovea, from which the number and area of FAF flecks within each ring were analyzed. Fleck number (Fig. 3c) did not significantly increase over time in the 10° ring (p = 0.06), but the deep learning method underestimated the fleck count across most follow-up visits (p < 0.05). In the 20° and 30° rings, fleck number increased over time (both p < 0.05) and the deep learning method again underestimated the fleck count in both rings (both p < 0.05). Fleck area was low with either method in the 10° ring over the five years (Fig. 3d, left). In contrast, both methods detected an increase in fleck area over time (p < 0.05) in the 20° and 30° rings, with the deep learning method overestimating fleck area (both p < 0.05). Fleck progression was also shown in an older patient (Patient 1, 56 years at first visit) with discrete pisciform lesions regions at the first visit (Fig. 3e), which progressed into discrete regions of RPE atrophy centrally and speckled hypoautofluorescent regions in the perimacula at the last visit (Fig. 3f, 62 years). There was no increase in fleck number across time in the 10° ring (p = 0.15, Fig. 3g, left) and no difference in fleck number detected with either method (p = 0.47). In the 20 and 30° rings, both manual and deep learning methods showed a significant increase in fleck number over time (p < 0.05) but with no difference in fleck number estimation between the two methods (20° ring, p = 0.15; 30° ring, p = 0.07). Fleck area (Fig. 3h) was increased significantly over 6 years in all three rings (all p < 0.05). The 10° ring showed a significantly larger area with the manual method (p < 0.05), but there was no difference between deep learning and manual segmentation in the outer two rings (20° ring, p = 0.27; 30° ring, p = 0.06).

Figure 3
figure 3

Two examples of longitudinal data analysed via manual or deep learning segmentation. (a) Manual (blue outlines) and deep learning (red outlines) segmentation of the hyperautofluorescent flecks of Patient 4b, 21 years old, at the first visit. Each image is sub-divided into three rings (10°, 10°–20°, 20°–30° diameter), centred on the fovea. (b) Manual and deep learning segmentation of the same eye as panel (a) 6 years later at 27 years old. All other details as per panel (a). (c) Fleck number plotted against time after first visit using manual (filled) and deep learning (unfilled) segmentation in the 10° (left), 10°–20° (middle) and 20°–30° (right) rings. (d) Fleck area plotted against time after first visit using manual (filled) and deep learning (unfilled) segmentation in the 10° (left), 10°–20° (middle) and 20°–30° (right) rings. (eh) Manual versus deep learning longitudinal results in Patient 1, 56 years old at first visit and 62 years old at last visit. Other details as per (ad).

Overall, a total of 82 images across 12 eyes were used for longitudinal analysis. The deep learning segmentation method provided a lower fleck count than manual segmentation in 69 (84%), 72 (88%) and 70 (85%) of the 82 FAF images in the central 10° disc, 10°–20° and 20°–30° rings, respectively (Fig. 4). Conversely, the deep learning segmentation method provided a greater fleck area than manual segmentation in 52 (63%), 74 (90%) and 68 (83%) of the 82 FAF images in the central 10° disc, 10°–20° and 20°–30° rings, respectively (Fig. 4). The mean ± SD changes in fleck count derived from manual segmentation were − 1.5 ± 3.6, 8.9 ± 7.4, 13.4 ± 13.5 flecks/year for the 10° disc, 10°–20° and 20°–30° rings, respectively. Deep learning segmentation provided an estimated change in fleck count number ± SD by − 2.2 ± 3.4, 6.9 ± 3.3, 10.8 ± 9.8 flecks/year for the 10° disc, 10°–20° and 20°–30° rings, respectively. The mean ± SD change in fleck area derived from manual segmentation was − 0.01 ± 0.05, 0.05 ± 0.26, 0.12 ± 0.18 mm2/year for the 10° disc, 10°–20° and 20°–30° rings, respectively. Deep learning segmentation provided an estimated change in mean fleck area ± SD by − 0.02 ± 0.06, 0.07 ± 0.20 and 0.15 ± 0.25 mm2/year for the same three rings.

Figure 4
figure 4

Manual versus deep learning image segmentation in longitudinal data. (a) Difference in fleck number between deep learning and manual segmentation in all 82 images available for longitudinal analysis within the 10° (left), 10°–20° (middle) and 20°–30° (right) rings. (b) Difference in fleck area between deep learning and manual segmentation in all 82 images available for longitudinal analysis, other details as per panel (a).

Discussion

The U-Net deep learning architecture employed in the current study has been developed for biomedical image segmentation17, with various adaptations of the original architecture proposed, such as M-Net18, TernausNet (17, composing of encoder (down-sampling) and decoder (up-sampling) portions. The key difference between the ResNet-UNet and the UNet is that the down-sampling encoder adopts the ResNet-34 model33, which is widely applied in image classification area and benefits from the advanced deep residual learning method. The ResNet-34 used here consists sequentially of firstly a 7 × 7 convolutional layer (CL), a max pooling layer, and the following 16 residual blocks. Each residual block contains two 3 × 3 convolutional layers with ReLu and a batch normalization identity shortcut connection (Fig. 5). Therefore, the ResNet down-sampling portion totally consists of 34 layers. To match the image size changes [32, 64, 128, 256] in the up-sampling portion, the corresponding 4 feature maps from the down-sampling structure are: (1) output from the 7 × 7 CL (feature map size 256 × 256); (2) output from the first 3 residual blocks (feature map size 128 × 128); (3) output from the second 6 residual blocks (feature map size 64 × 64); (4) output from the third 4 residual blocks (feature map size 32 × 32). The up-sampling portion starts from the 512-channel 16 × 16 feature map. It is convoluted by a 2 × 2 transposed convolution with up-sampling factor 2 (stride = 2). The up-sampled output (128-channel 32 × 32 feature map) is concatenated with 1 × 1 convolutional output (128-channel) of the corresponding feature map from the down-sampling counterpart. The procedure is repeated until the output reaches image size 256 × 256. Then, the final layer transposes the 256-channel 256 × 256 feature map to a 1-channel 512 × 512 feature map, giving a probability map of the segmented objects.

Figure 5
figure 5

Architecture of ResNet-UNet. The structure of the ResNet-UNet uses the traditional UNet structure, which comprises of encoder (down-sampling) and decoder (up-sampling) portions. The down-sampling encoder is replaced by the ResNet-34 model. Conv n*n indicates n × n convolutional layer. TransConv 2*2, up 2 indicates 2 × 2 transposed convolution and kee** stride of 2.

Contrast Limited Adaptive Histogram Equalization (CLAHE)34 was applied to the raw FAF images prior to deep learning training. The number of images in the training set was increased by extracting image patches from cropped portions (512 × 512 pixels) of the full-size CLAHE-FAF images. The training set was further augmented by using patches that have been randomly (1) rotated through 90°, 180° and 270°; (2) inverted horizontally or vertically; and (3) magnified by a scale of 1.0–1.3. The overall model was trained in two stages in order to take advantage of the ResNet-3433 with its pre-trained weights. The first stage involved freezing the down-sampling encoder portion of the model and only optimized the weights of the up-sampling decoder portion. For the second stage, the ResNet-34 down-sampling portion was unfrozen to fine-tune the weights of the entire model. The learning rates were adjusted during the training. Adam optimization35 and BCEWithLogitsLoss were adopted in the overall training process. Dice score between the learning results and the ground truth was used as performance metric. The model was trained on Google cloud (Google Colab). The computing configuration was a single 12 GB NVIDIA Tesla K80 GPU. The training data batch size was set as 6. The initial training rate was 0.001 and the learning rate was adjusted during the training for the different layers. The model training was completed within 11 h (138 epochs).

Applying deep learning algorithm to segment and analyze hyperautofluorescent flecks

The FAF images were first processed by CLAHE. The size of the original images was 1536 × 1536 pixels, which was partitioned into 9 non-overlap** image patches (3 × 3 grid, each grid 512 × 512 pixels). Each 512 × 512 pixel image patch was rotated 90°, 180° and 270° to generate 3 new image patches. This was followed by horizontally flip** the original image patch then rotation of the flipped image by 90°, 180°, 270° to generate 4 more image patches. All 8 patches were processed by the ResNet-UNet deep learning model for hyperautofluorescent region segmentations and consequently 8 outputs were obtained. The final lesion mask for the patch is the average of the 8 patches from the reverse transformations of the 8 model outputs. The hyperautofluorescent mask of the entire image was then formed from the 9 partitioned 512 × 512 pixel image patches.

Longitudinal analysis was performed using deep learning in the same subset of 12 eyes with manually segmented flecks. Following the delineation of the fleck outlines by deep learning, a mask of the fleck outline was generated and the same registration approach described in the manual segmentation section was utilized to measure the number and the total size of flecks within the three concentric rings (radius 5°, 10° and 15°) centred at fovea. Figure 6 illustrates the workflow for both manual and deep learning segmentation employed in the study.

Figure 6
figure 6

Workflow of longitudinal image processing for hyperautofluorescent flecks for both manual and deep learning segmentation. Baseline and follow-up images were segmented via both manual and deep learning before image registration. Fleck quantifications from both methods were then compared against each other.

Validation outcome measure

In order to evaluate the model’s segmentation performance of the lesions, dice score (i.e. Intersection over Union, Eq. 1) was used, where the dice score (D) is equal to two times the common elements between the manual (Y) and deep learning (Ŷ) results divided by the total number of elements in each set.

$$D(Y,\;\hat{Y}) = \frac{{2|Y \cap \hat{Y}|}}{{|Y| + |\hat{Y}|}}.$$
(1)

Dice score was calculated for each image by comparing manually delineated regions to that marked by deep learning.

Statistical analysis

Bland–Altman analysis16 was employed to evaluate the difference in parameters extracted from deep learning and manual segmentation methods in the testing set. In a subset of 12 eyes from 6 subjects, disease progression over time was assessed by using both deep learning and manual segmentation methods. Fleck count and area were extracted from all images from both methods and were compared across time via two-way ANOVA. All data were summarized as mean ± SD.