Background

Colorectal cancer (CRC) is the third most common cancer and the third leading cause of cancer death in the world [1]. In percentage terms, CRC accounts for 10% of the worldwide cancer incidence and 9–10% of the global cancer deaths [2]. Lymph node metastasis (LNM) is the main metastasis mode of CRC. Accurate diagnosis of LNM provides a solid foundation for the subsequent postoperative management and prognostic estimation. Patients diagnosed with LNM should undergo lymph node dissection surrounding the colon region to prevent further spreading. However, the diagnostic results of LNM are usually artificially given by clinicians with reference to medical images, which may cause inaccurate diagnosis when clinicians are under heavy work and long-time fatigue operation. Hence, an automatic and reliable LNM diagnosis is highly demanded for assisting clinicians in the diagnostic process.

In recent years, convolutional neural networks (CNNs) have shown great potential in the field of medicine, and more specifically in diagnostic medicine, initial results from the application of deep learning to metastasis diagnosis are very promising [3, 4]. Within CNNs, architectures inspired from the U-Net [5] have been widely used for medical segmentation due to their unique ability to analyze features with an encoder–decoder structure [6,11] stacks two U-Net architectures on top of each other. The additional U-Net network is adopted to learn high-level global features, and then these features are fused with the results from the original U-Net in the final decoder. Triple U-Net [12] includes an RGB branch, a HE branch and a segmentation branch. The features extracted from RGB and HE branches are fused to the segmentation branch to learn better representations. Attention U-Net [14,15,16]. This problem prevents neural networks from effectively learning general patterns of LNM. To overcome this problem, it is necessary to consider the precise boundaries of different LNM regions and explore their contextual dependences, so that LNM regions can be completely segmented from the intricate tissue background. Thus, the key challenge of this problem is how to achieve the aggregation of detailed and non-local contextual information.

In the visual neuroscience, the aggregation process belongs to a high visual acuity system, where the retinal fovea contributes to resolving fine spatial detail and the other portion of the retina receive a blurred but wide range field of view [17,18,19]. For example, in Fig. 1, the distribution of retinal photoreceptor cells on the eyeball is hugely uneven, and that many of them concentrate at the fovea. While in the peripheral portion of the fovea, photoreceptor cells decline rapidly with increasing distance from the fovea. In other words, the fovea has high resolution and the peripheral portion has low resolution. Thus, the fovea can clearly distinguish and recognize the detailed information, and the low-resolution portion surrounding the fovea can obtain the non-local contextual information for quick judgment. Inspired by the fovea of the human visual system, the paper proposes the Fovea-Unet, a lightweight architecture that performs effective LNM segmentation of medical images by devising a Fovea Pooling (FP) method to aggregate the detailed and non-local contextual information in the U-Net encoder. The FP consists of an importance-aware module and the pooling layer with adaptive radius. First, the pixel-level importance of features in the spatial domain is calculated through the importance-aware module that is built on the attention mechanism. Then, the pooling layer aggregates the features with variable pooling radius with an inverse trend of importance. The proposed FP is used in aggregating detailed and non-local contextual information by applying adaptive pooling layers with different radii which handle the segmentation of the region most relevant to LNM at different resolutions. Unlike other U-Net variants adding attention mechanisms, FP overcomes the inherent limitation that the CNN and codec structure cannot balance detailed and non-local contextual information by improving pooling. This operation ensures that the FP can better obtain the non-local contextual information in a full field of view while kee** the reservation of detailed information.

Fig. 1
figure 1

Inspiration of Fovea-UNet. Left, the map shows the LNM images from the perspective of the human eyeball, and the isodensity lines of retinal photoreceptors in the human retina are drawn on the eyeball. Right, examples from different resolutions, correspond to portions with different photoreceptor densities

However, the remarkable thing is the importance-aware module of FP will bring the huge computational burden to the entire network. The feasible solution is to reduce the calculation burden via carrying out an efficient and lightweight neural architecture design [20,21,22]. To this end, we introduce the GhostNet [23] as backbone network for feature extraction, which is a lightweight network that can reduce the calculation cost while retaining the intrinsic features. But directly applying it as the backbone network of LNM segmentation will degrade the segmentation performance because of the intrinsic feature maps calculated by the normal convolution layers may yield insufficient detailed information. Inspired by the theory of neural network representations similarity [24], which has successfully shown to be powerful in providing insights on properties of representations within the neural network, we adopt Hilbert–Schmidt independence criterion (HSIC) to improve the diversity of features. Thus, the improved GhostNet, called nHSIC–GhostNet (H-GhostNet), has the ability to learn the full intrinsic information of the input medical images. Specifically, a similarity constraint, namely, HSIC, within each layer is used as the regularization of the training process and targets to boost the diversity of intrinsic features. Through the HSIC regularization term, the proposed H-GhostNet has the capability of obtaining more feature information and redundancy, which facilitates the accuracy of segmentation results.

In summary, the main contributions of this paper are as follows.

  1. (1)

    We develop an importance-aware Fovea Pooling (FP) to enable the network can focus on the region most relevant to LNM, which is a novel adaptive aggregation pooling method. FP provides a better alternative trade-off that takes both detailed information and non-local contextual information into consideration.

  2. (2)

    We propose an improved H-GhostNet as lightweight backbone network to promote the ability of discriminative and heterogeneous feature extraction through an intrinsic feature-based regularization term. The proposed training strategy cooperates the Ghost convolution layer and HSIC regularization to gain the effective feature representations while maintaining a little amount of computation.

  3. (3)

    We demonstrate the effectiveness of our proposed Fovea-Unet on a practical diagnostic task challenging task. The LNM for colorectal cancer dataset is collected and well-annotated. In addition, comprehensive experiments are conducted and show that our proposed method outperforms state-of-the-art metastasis segmentation methods in the segmentation accuracy and efficiency.

The paper is organized as follows. The LNM segmentation experimental results are given in “Results” section. The discussion based on the experimental results is give in “Discussion” section. This paper is summarized in “Conclusion” section. The proposed Fovea-Unet neural network is given in “Methods” section.

Results

Datasets description

In this work, we design the Fovea-Unet to detect colorectal cancer metastasis and segment lymph node metastasis regions. We collected paraffin samples from curative resection of colorectal cancer with lymph node metastasis from Tangshan Gongren Hospital from January 2016 to December 2018. All samples were followed by the process of hematoxylin–eosin (HE) staining, soaked in 10% neutral formalin solution for half an hour to fix the shape of tissues, wrapped into paraffin for half an hour for dehydrating, sectioned on a paraffin microtome, and then dewaxed and stained by HE. Finally, the digital tomography scanner was applied to scan pathological sections into 81 whole slide images (WSIs). The lymph node metastasis regions in WSIs were viewed with K-Viewer software (version 1.5.2.5; KFBIO; http://www.kfbio.cn) at a specific rate, such as × 10 magnification and × 20 magnification, which ensure that the field of view can cover the single metastatic region. In this way, all metastatic regions from the WSIs were manually extracted and resized to 1024*1024 uniformly. After labeling, these metastatic regions with annotations were used to construct the dataset. Table 1 shows the partition of the dataset. Metastatic regions were adopted as input images to achieve high-precision pixel-level segmentation. It should be noted that the collection of these images was approved by the Ethics Committee of Tangshan Gongren Hospital (Grant No. GRYY-LL-2019-50).

Table 1 Overview of the training and testing LNM datasets

Experiment settings

The proposed Fovea-Unet is implemented based on the Pytorch 1.8 framework and is trained with one NVIDIA A100 SXM4 GPU with 80 GB RAM. In the model, we use the Adam with parameters beta1 of 0.9 and beta2 of 0.999 to train the whole end-to-end network. The backbone is based on GhostNet pre-trained on ImageNet. In the training stage of 300 iterations, the freeze training method is adopted at the first 30 iterations to put more computing resources on training the network parameters containing the FP modules while preventing the pre-trained weights of backbone network from being destroyed, which can improve the training efficiency. In this stage, the learning rate is set to 10–4. After the freezing stage, all the parameters in the model participate in the training process and the learning rate is set to 10–5, the mini-batch size is set to 4. In each stage of the encoder subnetwork, all feature maps are first reduced to one-quarter of the original number of channels using the prior convolution layer. In the FP, we use the adaptive reflect padding to retain the same size. Besides, data augmentation strategies are utilized to enhance the dataset diversity, and the dataset was randomly divided into the training set and the test set in a ratio of 8:2.

To accurately evaluate the segmentation accuracy, in this paper, we used the intersection over union (IoU), dice similarity coefficient (DSC), Sensitivity (Sen), Specificity (Sp), and Precision (Pre) as the main evaluation metrics, which are defined below:

$$\begin{gathered} IoU = \frac{TP}{{TP + FP + FN}} \hfill \\ DSC = \frac{2 * TP}{{2 * TP + FP + FN}} \hfill \\ Sen = \frac{TP}{{TP + FN}} \hfill \\ Sp = \frac{TN}{{TN + FP}} \hfill \\ Pre = \frac{TP}{{TP + FP}} \hfill \\ \end{gathered}$$
(1)

where FP, TN, FN, TP denote the number of false positive, true negative, false negative and true positive pixels, respectively. IoU ∈ (0, 1) is the ratio between the intersection and union of LNM regions in the ground truth and network segmentation results. The higher the IoU, the better the image segmentation result. DSC ∈ (0, 1) is an evaluation matrix often used to evaluate the similarity between the ground truth and the segmentation results in medical image segmentation. The higher the DSC, the better the image segmentation results. For Sen, Sp and Pre ∈ (0, 1), the closer they are to 1, the better the segmentation effect.

Experiment results

This section shows the segmentation results on lymph node metastasis dataset. In this paper, U-Net is set as the basic reference network. Based on this, we first assess the performance of the proposed Fovea-Unet and compare it with other improved networks based on U-Net. For a fair comparison, we implement their network architectures and utilize the same data preparation methods. Table 2 compares the segmentation results of the U-Net, U-Net + + , Double U-Net, Triple U-Net, and Attention U-Net in terms of all metrics used in our experiments. Analysis of Table 2 shows that all the improved networks achieve performance improvement compared with the original U-Net. As shown in Table 2, the Fovea-Unet achieves the best performance on five evaluation metrics except for Sp score, reaching 79.38%, 88.51%, 92.82%, 96.80%, and 84.57% for IoU, DSC, Sen, Sp, and Pre, respectively. Compared with the basic U-Net, Fovea-Unet increases its IoU, DSC, Sen, Sp, and Pre by 12.94%, 8.67%, 7.53%, 2.12% and 9.53%, respectively. In addition, compared with other networks, the detailed and non-local contextual information aggregation capability of Fovea-Unet improves the accuracy, such as IoU and DSC. Attention U-Net, with the advantage of attention, produces IoU and DSC results of 78.22% and 87.78%, respectively, which are only lower than those of our network. Significantly, the parameter amount of the proposed Fovea-Unet is only 23.23 MB, which even lower than Attention U-Net by 152.5 MB.

Table 2 Comparison results of the proposed network with other networks based on U-Net

To further verify the effectiveness and robustness of the Fovea-Unet proposed in this paper for lymph node metastasis segmentation, we selected some state-of-the-art segmentation networks for comparison, including three typical networks, namely, U-Net [5], SegNet [25], DeepLabv3 + [26], and two lightweight segmentation networks, namely, Enet [3 lists the segmentation results of different networks for several typical metastasis images on the lymph node metastasis data set and compare the corresponding segmentation prediction generated by overlaying segmentation masks on the input images. It is obvious that existing state-of-the-art networks under-segment regions with irregular shapes and low contrast characteristics, while Fovea-Unet performs extremely well (Rows 1–3).

Table 3 Comparison results of the proposed network with other state-of-the-art segmentation networks
Fig. 3
figure 3

Segmentation results on LNM dataset of different networks. a original input images. b labels. c Fovea-UNet. d DeeplabV3 + . e U-Net. f ENet. g LEDNet. h SegNet

Discussion

The proposed Fovea-Unet improves the effectiveness and efficiency due to its several advantages:

  1. 1.

    The importance-aware Fovea Pooling (FP) is proposed to aggregate the detailed information and non-local contextual information which has ability to focus on what the region most relevant to LNM.

  2. 2.

    The improved H-GhostNet is proposed as a lightweight backbone network, promoting the ability of discriminative and heterogeneous feature extraction, improving the computation speed.

Impact of different pooling strategies in FP

To discuss the effectiveness of the FP in Fovea-Unet, we conduct comprehensive ablation studies in terms of the aggregation method of the pooling layer and the boundary of pooling radius.

Effect of different pooling aggregation methods

The proposed Fovea-Unet respectively designs four FP in four stage of the encoder sub-network to refine and aggregate the information. To justify the effectiveness of FP, we first compare the results obtained when the FP is removed and the different pooling methods are employed including Lp Pooling [29], Average Pooling, and Mixed Pooling [30]. It should be noted that the Max Pooling was not used as a comparative method, because the characteristic of only selecting the largest element will harm the network segmentation performance.

As detailed in Table 4, the three different pooling methods in the FP greatly improve the segmentation performance with the baseline denoted as identity map, where the IM denotes the identity map, LP denotes the Lp Pooling method, AP denotes the Average Pooling, and MP denotes the Mixed Pooling. For the evaluation metrics on four different methods, the IoU increase by 12.95%, 11.70%, and 12.92%, respectively, and DSC increase by 8.66%, 7.89% and 8.66%, respectively, which signifies the effectiveness of FP for segmentation tasks. Segmentation metrics show that three methods all achieve good performance and have an average score of 79.98% and 88.25% in terms of IoU and DSC. Moreover, Mix Pooling and Lp Pooling get a relatively higher score than the Average Pooling, indicating that appropriate proportion of maximum information is important for good segmentation performance on LNM region. From the results presented in Table 4, it is obvious that the great improvement is brought by FP with low correlation to the chosen pooling methods in the FP.

Table 4 Comparison results of the proposed network under different pooling aggregation methods

Effect of different pooling boundaries

When the importance of a specific element \(z_{k}^{i}\) is set to zero, the pooling radius reaches its maximum, i.e., \(r = e^{\varsigma }\). \(\varsigma\) is an empirical value associated with the maximum pooling radius, namely, pooling boundaries [see the Eq. (6) in “Methods” section for details]. In this section, we discuss the impact of different pooling boundaries on the segmentation performance and how to set the value of pooling boundaries in a comparative experiment conducting with five different scales in each stage of encoder sub-network. It is worth noting that we use a normalization term \(s = e^{\varsigma } /w_{i}\) to denote the pooling boundary, where \(w_{i}\) represent the spatial size of feature \(i\). For the output features of the four stages in the encoder, the parameters \(s\) is first set to 1/8, and then the parameters are adjusted stage by stage until the best results are obtained. In each stage, five different experiments of pooling boundaries from 1/2 to 1/32 are conducted, which is shown in Fig. 4.

Fig. 4
figure 4

Performance of different pooling boundaries in each stage. a stage 1. b stage 2. c stage 3 d stage 4

From the comparative results, the Fovea-Unet gets the best performance when \(s = 1/16\) or \(s = 1/8\), while suboptimal performance is got with reduction of pooling boundary and sharp decline of performance is shown with increasingly pooling boundary. In the deeper stage of 3 and 4, the trend of peak value is strengthened in both extreme, at the same time, variance of performance is also nearly doubled compared to the stage 1 and stage 2. The main reasons are as follows. In the low-level feature maps, each location represents small local neighborhood information and the shallow features take a majority of images information, which is responsible for detailed contextual information but making decision in a small extent. In contrast, with the increased receptive filed gradually, each element of the high-level feature maps has larger non-local perception and semantic information that contributes to the segmentation results in a greater extent. Hence, in different stages, it is more advantageous to employ a proper combination of pooling boundaries to explore both the detailed information and non-local contextual information for a better guidance of the FP, so as to improve the performance of segmentation network. The optimal value of \(s\) in each stage should be set according to Fig. 4, i.e., \(s = 1/8\) in stage 1, \(s = 1/16\) in stage 2, \(s = 1/8\) in stage 3, and \(s = 1/8\) in stage 4.

Impact of different backbones in Fovea-Unet

We also compare the proposed backbone H-GhostNet with other backbones. Moreover, we demonstrate the effectiveness of HSIC regularization.

Quantitative analysis of different backbones

To investigate the effectiveness of different backbones to the proposed method, we compare the proposed H-GhostNet with five different state-of-the-art backbones, which include three normal backbones, namely, VGGNet [

Table 5 Comparison results of the proposed network under different backbones

Effectiveness of HSIC regularization

The effect of HSIC regularization is further explored through the visualization of channelwise feature similarity. We continue our investigation using CKA to study the internal representation structure of specific layers, which enable quantitative comparisons of features within networks [35]. As shown in Fig. 5, the first 50 intrinsic feature maps within a specific layer are taken as the input to generate a heatmap with the x and the y axes indexing ordered representations. Darker color represents the higher similarity when the Fovea-Unet is trained without the regularization, it is observed that intrinsic features extracted by a specific layer have different statistic properties with different training strategies. In Fig. 5a–e, features extracted without regularization tend to be homogenous, we visualize the same situation except for the extra similarity regularization \(L_{HSIC}\) in Fig. 5f–j. It results in relatively low channelwise similarity, which confirms that the H-GhostNet regularized by the similarity constraint can effectively promote the capability of Fovea-Unet. In the future, the devised H-GhostNet can be utilized to facilitate the medical segmentation tasks with the complementary knowledge of features.

Fig. 5
figure 5

CKA similarity heatmap of GhostNet backbone among the first fifty channels of intrinsic features for two cases, including without LHSIC ae and with LHSIC fj. a, f layer 8. b, g layer 10. c, h layer 12. d, i layer 14. e, j layer 16

Limitation and future work

Although promising results have been obtained, there still are some limitations in the proposed Fovea-Unet that should be taken into consideration. On one hand, the attention-based importance-aware modules would result in large number of floating-point operations per second (FLOPs) with high computational costs and the calculation process of pooling radius is relatively tedious. On the other hand, the single-head FP would hard to cope with the situations of extremely scattered metastasis. In the future work, more efficient computing methods can be used in the importance-aware modules, and the multi-head FP can be developed with reference to the multi-head attention mechanism in Transformer, which makes the segmentation network more flexible in feature aggregation and further improves the quality of LNM segmentation.