Background

The adrenal glands are important secretory organs that are positioned above the kidneys on both sides of the body. The adrenal glands can be divided into cortical and medullary layers, which mainly secrete adrenal corticosteroids, epinephrine, and norepinephrine, and play an important role in maintaining the normal functioning of body organs [1, 2]. Adrenal tumors are formed by abnormal proliferation of local adrenal tissue cells and are pathologically classified into benign tumors (expansive growth) and malignant tumors (invasive growth) according to their biological behavior [3,4,5]. Both benign and malignant tumor types can affect hormone secretion and lead to hypertension, hyperglycemia, and cardiovascular diseases [6,7,8].

Computed tomography (CT) is an effective tool for diagnosing adrenal tumors, because it is non-invasive, produces clear images, and has high diagnostic efficiency. Because the morphology of the tumor (mainly refers to the smoothness of tumor edge) and the clinical statistics (mainly refers to the difference in density and gray level within the tumor) of the tumor area are two crucial diagnostic and differential diagnostic features, precise segmentation of the tumor is essential. There have been many studies evaluating traditional methods for adrenal tumor segmentation, such as classifying CT images pixel-by-pixel with a random forest classifier [9], and applying a localized region-based level set method (LRLSM) to segment the region of interest (ROI) [10]. Some studies have proposed segmentation frameworks that combine various algorithms to obtain better segmentation performance [11,12,13]. These include the K-means singular value factorization (KSVD) algorithm, region growing (RG), K-means clustering, and image erosion.

In recent years, with the continuous improvement in computing power and the development of deep learning techniques [14,30] is a relatively mature application, and segmentation of brain tumors [31, 32], kidney [33], kidney tumors [34, \(\frac{\mathrm{W}}{16}\times \frac{\mathrm{H}}{16}\times \frac{\mathrm{D}}{16}\). Instead of directly segmenting the original image into patches, the Transformer can model local contextual features in the spatial and depth dimensions of the high-level feature map. The Transformer requires the input to be sequence data, so we use the Reshape operation to flatten the feature map to \(256 \times \mathrm{N}\), where \(\mathrm{N}=\frac{\mathrm{W}}{16}\times \frac{\mathrm{H}}{16}\times \frac{\mathrm{D}}{16}\).

The importance of the position of each pixel in the image cannot be ignored, and such spatial information is indispensable for the accurate segmentation of tumors. Therefore, we also add learnable position embedding to reconstruct location information.

(2) Transformer layer

The architecture of the Transformer layer, which consists of a multi-head attention (MHA) and feed forward network, is shown in Fig. 1.

The MHA consists of eight single attention heads, which can be viewed as map** a collection of query vectors to output vectors based on key-value pairs. The details are shown in Formulas 1 and 2.

$$\begin{array}{c}Multi\,Head\left(\mathrm{Q},\mathrm{K},\mathrm{V}\right)=Concat\left({\mathrm{head}}_{1},\dots ,{\mathrm{head}}_{8}\right){\mathrm{W}}^{\mathrm{O}}\\ where\, hea{\mathrm{d}}_{\mathrm{i}}=Attention\left({\mathrm{QW}}_{\mathrm{i}}^{\mathrm{Q}},{\mathrm{KW}}_{\mathrm{i}}^{\mathrm{K}},{\mathrm{VW}}_{\mathrm{i}}^{\mathrm{V}}\right)\end{array}$$
(1)
$$\begin{array}{c}Attention\left(\mathrm{Q},\mathrm{K},\mathrm{V}\right)=softmax\left(\frac{{\mathrm{QK}}^{\mathrm{T}}}{\sqrt{{\mathrm{d}}_{\mathrm{k}}}}\right)V\end{array}$$
(2)

where \({\mathrm{W}}_{\mathrm{i}}^{\mathrm{O}}\in {\mathrm{R}}^{8{\mathrm{d}}_{\mathrm{v}}\times \mathrm{D}}\), \({\mathrm{W}}_{\mathrm{i}}^{\mathrm{Q}}\in {\mathrm{R}}^{\mathrm{D}\times {\mathrm{d}}_{\mathrm{k}}}\) and \({\mathrm{W}}_{\mathrm{i}}^{\mathrm{V}}\in {\mathrm{R}}^{\mathrm{D}\times {\mathrm{d}}_{\mathrm{v}}}\) are learnable parameter matrices, and Q, K, and V are query, key, and value, respectively.

The Feed Forward Network comprises a fully connected neural network and an activation function.

Network decoder

Before up-sampling, patches need to be mapped (feature map**) to the original space, then the up-sampling operation is performed. The decoder also up-samples four times, with the overall operation corresponding to the down-sampling. Because some spatial context information is inevitably lost during down-sampling, we use a skip connection to connect the feature maps corresponding to the down-sampling stage. This skip connection ensures that the new feature maps contain both shallow low-level information and high-level abstract semantic information.

Comparison details

To verify the effectiveness of our proposed method, we make a comparison with the mainstream medical image segmentation methods. The implementation details of the compared methods are as follows:

  1. a.

    3D U-Net first constructs a \(7\times 7\) convolution block, then constructs four encoder and decoder blocks. Finally, a final convolution block is constructed, including a transposed convolution and two sub-convolution blocks.

  2. b.

    TransBTS is constructed with a series of components. It starts with four encoder blocks, which are then followed by a classical Transformer module containing four Transformer layers, each equipped with eight heads of attention. Subsequently, four decoder blocks are added to the model. To complete the architecture, TransBTS finally incorporates a convolutional layer and utilizes the softmax function.

  3. c.

    ResUNet combines ResNet and U-Net by integrating the residual block in each encoder and decoder block. During skip-connection phase, convolutional blocks are additionally constructed to match the dimensions of the encoder output and the decoder output at the corresponding stage.

  4. d.

    UNet++ aggregates 1 to 4 layers of U-Net together and builds a convolutional layer and sigmoid function at the end.

  5. e.

    The structure of Attention U-Net is generally the same as that of U-Net, with the difference that Attention U-Net adds a layer of attention gates before skip-connection.

  6. f.

    Channel U-Net builds six encoder and decoder blocks and adds the Global Attention Upsample module before skip-connection.

Training details

All networks were implemented based on the PyTorch framework, and four NVIDIA RTX 3080 with 10 GB memory were used for training. We divided the entire data set into 80% training set and 20% testing set. The testing set is finally used to test the segmentation performance of our proposed model, and the results can be seen in Fig. 1. Given the large size of a single sample, 32 consecutive slices are randomly selected from a sample as input data. We adopted the Adam optimizer in the training process. The weight decay was set to \(1\times {10}^{-5}\), the learning rate was \(2\times {10}^{-4}\) and \(4\times {10}^{-7}\) when the epoch, respectively, reached 0 and 999 (all networks were trained for 1000 epochs), the batch size was set to 2, and the random number seed was set to 1000.

Evaluation metrics

To evaluate the effectiveness of our proposed method, we used the Dice coefficient (DSC) and Intersection over union (IOU), which are widely used to evaluate the similarity between segmentation results and ground truth data in medical image segmentation. Furthermore, we used the Hausdorff distance and Average surface distance (ASD) to evaluate the similarity of the surface between the segmentation results and ground truth. Mean average error (MAE) is used to assess the absolute error. These metrics are defined in Eqs. 3, 4, 5, 6, and 7, respectively:

$$\begin{array}{c}Dice=\frac{2\left|\mathrm{X}\bigcap \mathrm{Y}\right|}{\left|\mathrm{X}\right|+\left|\mathrm{Y}\right|}\end{array}$$
(3)
$$\begin{array}{c}IOU=\frac{\mathrm{X}\bigcap \mathrm{Y}}{\mathrm{X}\bigcup \mathrm{Y}}\end{array}$$
(4)
$$\begin{array}{c}H\left(\mathrm{A},\mathrm{B}\right)=max\left(\mathrm{h}\left(\mathrm{A},\mathrm{B}\right),\mathrm{h}\left(\mathrm{B},\mathrm{A}\right)\right)\\ where \,h\left(\mathrm{A},\mathrm{B}\right)=\underset{\mathrm{a}\in \mathrm{A}}{\mathrm{max}}\{\underset{\mathrm{b}\in \mathrm{B}}{\mathrm{min}}\Vert \mathrm{a}-\mathrm{b}\Vert ,h\left(\mathrm{B},\mathrm{A}\right)=\underset{\mathrm{b}\in \mathrm{B}}{\mathrm{max}}\{\underset{\mathrm{a}\in \mathrm{A}}{\mathrm{min}}\Vert \mathrm{b}-\mathrm{a}\Vert \}\end{array}$$
(5)
$$\begin{array}{c}ASD\left(\mathrm{x},\mathrm{y}\right)={\sum }_{\mathrm{x}\in \mathrm{X}}{\mathrm{min}}_{\mathrm{y}\in \mathrm{Y}}d\left(\mathrm{x},\mathrm{y}\right)/\left|\mathrm{X}\right|\end{array}$$
(6)
$$\begin{array}{c}MAE=\frac{1}{\mathrm{m}}{\sum }_{\mathrm{i}=1}^{\mathrm{m}}\left|{\mathrm{y}}_{\mathrm{i}}-\mathrm{f}\left({\mathrm{x}}_{\mathrm{i}}\right)\right|\end{array}$$
(7)

For statistical analysis, we compare the difference between the prediction results of the proposed method and other methods. We first use the Levene test to check the homogeneity of variance and then conduct Student's t test between our proposed method and other methods.

Table 3 Quantitative analysis of different transformer layers