Learning class-agnostic masks with cross-task refinement for weakly supervised semantic segmentation

Xu, Lian; Bennamoun, Mohammed; Boussaid, Farid; Ouyang, Wanli; Xu, Dan

doi:10.1007/s00521-023-08826-0

Learning class-agnostic masks with cross-task refinement for weakly supervised semantic segmentation

Original Article
Open access
Published: 19 July 2023

Volume 35, pages 20189–20205, (2023)
Cite this article

Download PDF

You have full access to this open access article

Neural Computing and Applications Aims and scope Submit manuscript

Learning class-agnostic masks with cross-task refinement for weakly supervised semantic segmentation

Download PDF

Lian Xu ORCID: orcid.org/0000-0002-1759-2941¹,
Mohammed Bennamoun¹,
Farid Boussaid²,
Wanli Ouyang³ &
…
Dan Xu⁴

1135 Accesses
Explore all metrics

Abstract

Weakly supervised semantic segmentation (WSSS) commonly relies on Class Activation Map** (CAM) to produce pseudo semantic labels using image-level annotations. However, because CAM maps often form sparse object regions with poor boundaries, they cannot provide sufficient segmentation supervision. Because off-the-shelf saliency maps can provide rich object boundaries that can be leveraged to improve semantic segmentation, we propose to jointly learn semantic segmentation and class-agnostic masks by using image-level annotations and off-the-shelf saliency maps as supervision. We also propose a cross-task label refinement mechanism, which takes advantage of the learned class-agnostic masks and semantic segmentation masks, to refine the pseudo labels and provide more accurate supervision to both tasks. Moreover, we introduce a new normalization method for CAM to generate more complete class-specific localization maps. The improved CAM maps complement our learned class-agnostic masks, leading to high-quality pseudo semantic segmentation labels. Extensive experiments demonstrate the effectiveness of the proposed approach, with state-of-the-art WSSS results established on PASCAL VOC 2012 and MS COCO.

Weakly-Supervised Semantic Segmentation Based on Improved CAM

Distinct Class-Specific Saliency Maps for Weakly Supervised Semantic Segmentation

Employing Multi-estimations for Weakly-Supervised Semantic Segmentation

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Semantic segmentation is a fundamental visual task of assigning a class to every pixel of a given image [1,2,3]. With deep convolutional neural networks (CNNs), semantic segmentation has progressed noticeably in fully supervised settings [4,5,6]. This however needs massive pixel-level dense annotations, which are difficult to acquire due to the expensive and laborious data-labeling process. It is thus desirable to develop segmentation techniques, which can rely on weak supervision to achieve performance on par with the one achieved with strong supervision. Recent years have witnessed many research efforts in semantic segmentation using weak labels, such as points [7], scribbles [8, 9], bounding boxes [10, 11] and image labels [12,13,14]. In particular, image-level labels are easy to acquire and annotate, while they also indicate the least information, i.e., the classes present in an image without information about object localization.

A critical step for WSSS is to use weak labels to produce pseudo segmentation labels. Given image labels, techniques for model interpretation such as CAM [15] and Grad-CAM [16] are able to extract object localization maps from the intermediate layers of CNNs. However, these CAM maps only indicate the most discriminative object regions, which are incomplete and do not provide sufficient semantic segmentation supervision. To obtain advanced segmentation pseudo labels, prior works have developed various strategies to discover non-discriminative object regions [17,18,19] and expand CAM maps. However, resulting CAM maps still exhibit inaccurate boundaries, leading to incorrect segmentation predictions. Besides, previous methods [20,21,22] commonly applied a pre-trained saliency detector on a target segmentation dataset, to extract useful object localization information to assist pseudo semantic label generation.

However, for the task of semantic segmentation, off-the-shelf saliency maps can also introduce misleading object information due to the gap between the pre-trained salient objects and the object of interest. For example, the detected saliency object by a pre-trained saliency model is the dog house (Fig. 1b top), while the target object for segmentation is the dog (Fig. 1a bottom). This could thus lead to inaccurate pseudo labels, and the network training is vulnerable to those errors.

These observations indicate that having precise object boundary information is important to both segmentation prediction and pseudo segmentation label generation. We thus propose to construct a class-agnostic mask learning task while exploiting supervision from off-the-shelf saliency maps. This is performed jointly with semantic segmentation using a multi-branch network. The benefits are two-fold: (i) the multi-task joint learning can regularize feature learning and learn more robust features for semantic segmentation, especially to differentiate between foreground objects and backgrounds; Moreover, it can effectively improve the generalization ability of the network by leveraging the useful information from related tasks resulting in a stronger inductive bias compared to single-task learning, thus reducing the overfitting problem; (ii) the learned class-agnostic maps can also contribute to the pseudo segmentation label generation. In particular, off-the-shelf saliency maps are not only utilized to generate initial pseudo semantic labels (Fig. 1b bottom), and they are also used to provide initial supervision to learning class-agnostic masks, by incorporating online class-agnostic mask predictions for self-refinement. Compared to off-the-shelf saliency maps, such learned class-agnostic masks are more adaptive to the target semantic segmentation task and contain more accurate object localization information. We further propose a cross-task label refinement mechanism to take advantage of the learned semantic segmentation masks and class-agnostic masks, thereby producing refined pseudo labels for both tasks (Fig. 1c). Moreover, we propose a new normalization method for CAM to generate class-specific localization maps, which can cover entire object regions. By combining the improved CAM maps and the proposed discriminative foreground-background class-agnostic masks, pseudo semantic labels can be substantially improved to better optimize the whole deep neural network.

Our contributions are summarized as follows:

We propose to jointly learn class-agnostic masks and semantic segmentation using image labels and off-the-shelf saliency maps. Such an approach is shown to lead to an improved segmentation performance, and this is also shown to provide more reliable class-agnostic masks for pseudo label generation. We leverage a new normalization method for CAM to produce class-specific localization maps (i.e., pCAM), which can cover entire object regions. The resulting pCAM maps complement the class-agnostic maps, producing high-quality pseudo semantic labels.
We introduce a cross-task label refinement mechanism, which jointly leverages predictions from the tasks of class-agnostic and semantic segmentation with pCAM maps to refine their pseudo labels. This mechanism is shown to effectively correct the errors brought by the pre-trained saliency model, providing more accurate supervision to learn semantic segmentation and class-agnostic masks.
The proposed method achieves superior WSSS results compared to state-of-the-art methods on PASCAL VOC 2012 and MS COCO (Sec. 4.2).

The rest of this paper is organized as follows. We review related work in Sect. 2 and describe the proposed approach in Sect. 3. Section 4 presents experimental details with ablation studies and discusses the results. Section 5 concludes the paper.

2 Related work

This section presents a literature review on recent image-label supervised semantic segmentation approaches, including CAM based refinement and semantic prediction-based refinement.

2.1 CAM map-based refinement

Most existing WSSS approaches have utilized the object localization information of the CAM maps to produce pseudo semantic labels. However, the raw CAM maps only indicate the most discriminative object regions, which are small and sparse. A typical example of improving CAM maps is the heuristics-based object mining. By iteratively erasing the detected object regions of input images [23], the network is driven to learn new patterns for classification. Similar techniques based on heuristic erasing have been presented in [24, 25]. Jiang et al. [21] observe that the classification network attends to different object regions during training and obtain an integral object localization map by accumulating CAM maps online. Since the sole reliance on the conventional classification objective loss function leads to incomplete CAM maps, prior works apply different regularization methods for training the classification network to obtain improved CAM maps. Wang et al. [26] suggest that imposing an equivariance constraint on the CAM maps under any spatial affine transformation can result in maps which better fit the shape of objects. Fan et al. [18] observe that the standard classification objective only focuses on the discrimination between different object classes, ignoring the boundaries between each class and the backgrounds. They thus propose to learn an intra-class boundary based on the implicit feature manifold. Chang et al. [27] re-formulate the problem into a fine-grained classification task, for which the pseudo labels of sub-categories are extracted from unsupervised feature clustering. Moreover, cross-image relations have also been explored to enhance the representations for extracting CAM maps, e.g., in [19] and [28]. A recent work by Zhang et al. [29] argues that it is the confounding context from the dataset that causes the ambiguous boundaries of CAM maps. Subsequently, they propose to use class-specific average segmentation masks to approximate the confounding set and incorporate it into the image classification to obtain better CAM maps. In contrast, we propose a different normalization method for CAM, which generates more complete CAM maps, compared to the original CAM maps.

2.2 Semantic prediction-based refinement

There are several other methods which focus on refining the pseudo segmentation labels, which are usually produced by the raw CAM maps, by taking advantage of the segmentation predictions. For instance, Wang et al. [30] refine semantic pseudo labels by discovering object affinities based on super-pixel regions derived from the segmentation prediction. Similarly, Wang et al. [31] iteratively select reliable regions from the segmentation outputs to learn pixel-wise affinities, which are then utilized to refine the segmentation results and produce pseudo segmentation labels. Araslanov et al. [32] propose to refine the segmentation results based on image local consistency so as to obtain pseudo semantic labels to enable the optimization of the segmentation network.

Although more non-discriminative object regions are discovered by these complex methods, their resulting pseudo segmentation labels generally have coarse boundaries. Therefore, a number of methods [19,20,21, 23, 24, 33, 34] have exploited background cues from off-the-shelf saliency maps to assist pseudo semantic label generation. However, these pre-trained saliency models are not generally adapted well to the semantic segmentation task. In this work, we address this problem by formulating a task of learning class-agnostic masks and incorporating it into a joint learning framework with semantic segmentation to obtain more generalizable representations. Moreover, in order to provide better supervision to the learning of class-agnostic maps, we combine the pre-trained saliency maps and the online predicted class-agnostic maps which can provide complementary and progressively more accurate class-agnostic localization information. We also propose a cross-task label refinement mechanism to further refine pseudo labels to learn both class-agnostic and semantic segmentation masks.

3 The proposed method

This section starts with an overview, followed by the description of our weakly supervised multi-task network architecture. The subsequent subsection describes the proposed normalization method for CAM to produce improved class-specific localization maps, which constitute a key component of the pseudo semantic label generation process. The following subsection elaborates the details of the proposed cross-task pseudo label generation for class-agnostic and semantic segmentation tasks. The final subsection presents the model training and inference processes.

3.1 Overview

Figure 2 presents an overview of the proposed approach. We build a multi-branch network to jointly perform semantic segmentation, class-agnostic segmentation and image classification tasks, with only image-level annotations. Besides, we utilize a general pre-trained saliency model to generate binary maps as a guide to provide supervision for the learning of the other two tasks. More specifically, we propose a different normalization method for CAM maps, generating more complete class-specific object localization maps. The improved CAM maps are combined with the pre-trained saliency maps to produce better initial pseudo semantic segmentation labels. For the class-agnostic learning, the pre-trained saliency maps are initially used as pseudo labels and are gradually refined by combining the online class-agnostic predictions. Once the training is complete, we propose a cross-task label refinement mechanism, which jointly takes advantage of the class-agnostic and semantic segmentation predictions to produce improved pseudo class-agnostic and semantic segmentation labels. The refined pseudo labels are then leveraged to fine-tune the multi-task network, leading to improved semantic segmentation results.

3.2 Weakly supervised multi-task network architecture

We build our deep network based on ResNet38 [35], which has 38 convolutional layers with wide channels. Following [36], we make modifications to the original ResNet38 to construct a backbone network with an output stride of 8. In order to learn more robust and informative representations for weakly supervised semantic segmentation, we adopt three branches following the backbone network, i.e., an image classifier, a class-agnostic segmentation decoder, and a semantic segmentation decoder. More specifically, given a RGB image as input, the backbone network produces an activation map ${\textbf{F}} \in {\mathbb {R}}^{H\times W\times K}$, with K, H and W indicating its number of channels and two spatial dimensions, respectively. For the classification task, a Global Average Pooling (GAP) layer is applied on the backbone feature maps. The resulting feature vector is forwarded into a fully connected (fc) layer, predicting the class probabilities. For the class-agnostic segmentation branch, the backbone feature maps are forwarded to a DenseASPP module [37], which is composed of three cascaded atrous convolutional layers (aconv) (rates = 6, 12, and 18). Finally, a 1$\times $1 convolutional (conv) layer, with a sigmoid layer, is applied to predict the class-agnostic masks. Moreover, the segmentation decoder includes three aconv layers (rates = 6, 12, and 18), and one last $1\times 1$ conv layer, with a softmax layer, for semantic segmentation prediction.

3.3 Generating class probability-based CAM maps

We use CAM [15] to produce class-specific localization maps for the generation of semantic segmentation pseudo labels. More specifically, for a given class c and spatial coordinates (i, j), the CAM map is calculated as follows:

$$\begin{aligned} \textbf{CAM}_c(i, j) = \sum _{k}^{K}{\textbf{W}}_{k}^{c}{\textbf{F}}_{k}(i,j), \end{aligned}$$

(1)

where ${\textbf{W}} \in {\mathbb {R}}^{K\times C}$ is the weight matrix of the last fc layer, with C denoting the number of classes, and ${\textbf{W}}_{k}^{c}$ represents the importance score of the channel k to the class c. As shown in Fig. 3, the generated CAM map for the class c is processed via the min-max normalization along the spatial dimensions, referred to as $\textrm{sCAM}$:

$$\begin{aligned} \textbf{sCAM}_{c}(i,j)=\frac{\textrm{ReLU}(\textbf{CAM}_c(i,j))}{\max _{(i,j)}\textbf{CAM}_{c}}. \end{aligned}$$

(2)

In contrast to sCAM, we propose to use a different normalization method to generate CAM maps based on class probabilities, hereinafter referred to as pCAM. More specifically, as illustrated in Fig. 3, pCAM maps are produced by applying the softmax operation along the channel dimension. As a result, each spatial vector of the resulting pCAM map represents the class probability distribution of the corresponding pixel:

$$\begin{aligned} \textbf{pCAM}_{c}(i,j) = \frac{\exp {(\textbf{CAM}_{c}(i,j)})}{\sum \nolimits _c\exp {(\textbf{CAM}_{c}(i,j))}}. \end{aligned}$$

(3)

sCAM tends to highlight the most discriminative regions among all spatial locations. In contrast, the proposed pCAM focuses on highlighting the pixels which have large probabilities for the given class. For the classes present in a given image, their corresponding CAM maps tend to have higher activation values, compared to those CAM maps of classes absent in the image. Therefore, the class activated regions by pCAM are larger than those given by sCAM.

3.4 Cross-task pseudo label generation

This section presents the proposed two-step method of the generation of class-agnostic and semantic segmentation pseudo labels.

3.4.1 Initial pseudo label generation

To learn class-agnostic masks, given no ground truth, we propose to utilize a coarse saliency label map ${{\textbf{P}}}{{\textbf{t}}}_{sal}$ estimated by a pre-trained saliency model as initial guide and incorporate complementary information from online predictions. More specifically, the pre-trained saliency model generally yields reasonable results on the source images, and it is however error-prone when applied on complex images with low contrast or complex backgrounds due to its limited generalization ability on different target datasets. Moreover, the detected salient object may not be the object of interest for the target task. In contrast, with the shared backbone features, the predicted class-agnostic mask, denoted as ${{\textbf{P}}}{{\textbf{r}}}_\text {ca}$, contains useful object localization information, which becomes more reliable with the learning. Therefore, we propose to generate pseudo class-agnostic label masks ${\textbf{G}}^{init}_\text {ca}$ by fusing these two complementary sources through a Conditional Random Field (CRF) model:

$$\begin{aligned} \mathbf {{\textbf{G}}}^{init}_\text {ca} = \textrm{CRF}_d\left(\frac{\mathbf {{{\textbf{P}}}{{\textbf{r}}}}_\text {ca} + \mathbf {{{\textbf{P}}}{{\textbf{t}}}}_{sal}}{2}\right), \end{aligned}$$

(4)

where $\textrm{CRF}_d(\cdot )$ denotes a densely connected CRF [38] which uses the average of ${{\textbf{P}}}{{\textbf{r}}}_\text {ca}$ and ${{\textbf{P}}}{{\textbf{t}}}_{sal}$ as a unary term. The fused output from the CRF model is more adapted to the target dataset, thereby providing better supervision to learn class-agnostic masks. To generate initial pseudo segmentation labels ${\textbf{G}}^{init}_\text {seg}$, we follow previous works [19,

3.4.2 Cross-task label refinement

When the joint multi-task optimization converges, the improved predictions from all three tasks can be utilized for cross-task refinement. This yields improved pseudo labels for class-agnostic and semantic segmentation, which can further boost multi-task learning. Figure 4 depicts the computation flow of the proposed cross-task refinement module. Given the predicted class-agnostic mask ${{\textbf{P}}}{{\textbf{r}}}_\text {ca}$ and the predicted semantic map ${{\textbf{P}}}{{\textbf{r}}}_\text {seg}$, we perform a structured fusion of these two types of predictions to obtain the refined class-agnostic pseudo label mask ${\textbf{G}}^{ref}_\text {ca}$ as follows:

$$\begin{aligned} {\textbf{G}}^{ref}_\text {ca} = \textrm{CRF}_d\left(\frac{{{\textbf{P}}}{{\textbf{r}}}_\text {ca} + \textrm{Br}_s({{\textbf{P}}}{{\textbf{r}}}_\text {seg})}{2}\right), \end{aligned}$$

(5)

where $\textrm{Br}_s(\cdot )$ is a binarization operation on the segmentation probability map, outputting a one-channel map ${{\textbf{P}}}{{\textbf{r}}}'_\text {seg}$ with values of 0 and 1; the model $\textrm{CRF}_d$ shares the same parameters with that used in Eq. 4. More specifically, $\textrm{Br}_s$ first converts the segmentation map ${{\textbf{P}}}{{\textbf{r}}}_\text {seg}$ into a one-channel map and then, binarize it with label 1 representing ‘foreground’ and label 0 ‘background’ as follows:

$$\begin{aligned} {{\textbf{P}}}{{\textbf{r}}}'_\text {seg}&= \mathop {\mathrm {arg\,max}}\limits \limits _{c}\textrm{Supp}({{\textbf{P}}}{{\textbf{r}}}_\text {seg}), \end{aligned}$$

(6)

$$\begin{aligned} {{\textbf{P}}}{{\textbf{r}}}'_\text {seg}(i,j)&= {\left\{ \begin{array}{ll} 1&{} \text { if } {{\textbf{P}}}{{\textbf{r}}}'_\text {seg}(i,j) > 0, \\ 0&{} \text { if } {{\textbf{P}}}{{\textbf{r}}}'_\text {seg}(i,j) = 0, \end{array}\right. } \end{aligned}$$

(7)

where $\textrm{Supp}(\cdot )$ denotes a suppression function, which multiplies ${{\textbf{P}}}{{\textbf{r}}}_\text {seg}$ by the image-level labels across the class channel to suppress incorrect predictions. Then, similar to the procedures of initial pseudo semantic label generation, we combine the pCAM map and the refined class-agnostic pseudo label mask ${\textbf{G}}^{ref}_\text {ca}$ to obtain refined pseudo semantic label ${\textbf{G}}^{ref}_\text {seg}$. Finally, the refined pseudo class-agnostic label masks ${\textbf{G}}^{ref}_\text {ca}$ and the refined pseudo semantic label masks ${\textbf{G}}^{ref}_\text {seg}$ are used together with the image labels to re-train the overall network.

3.5 Training and inference

3.5.1 Training

Our overall learning objective function is formulated as follows:

$$\begin{aligned} {\mathcal {L}}_\text {all}= {} {\mathcal {L}}_\text {cls} + {\mathcal {L}}_\text {ca} + {\mathcal {L}}_\text {seg}, \end{aligned}$$

(8)

$$\begin{aligned}{{\cal L}_{{\text{cls}}}} = - \sum\limits_{i = 1}^N {\left[ {{\bf{G}}_{{\text{cls}}}^i\log \frac{{\exp ({\bf{Pr}}_{{\text{cls}}}^i)}}{{1 + \exp ({\bf{Pr}}_{{\text{cls}}}^i)}}{\text{ }} + (1 - {\bf{G}}_{{\text{cls}}}^i)\log \frac{1}{{1 + \exp ({\bf{Pr}}_{{\text{cls}}}^i)}}} \right]} \end{aligned}$$

(9)

$$\begin{aligned} {\mathcal {L}}_\text {ca}= -\sum \limits ^{M}_{j=1}\big [{\textbf{G}}_\text {ca}^{j}\log {{\textbf{P}}}{{\textbf{r}}}_\text {ca}^{j}+ (1-{\textbf{G}}_\text {ca}^{j})\log (1-{{\textbf{P}}}{{\textbf{r}}}_\text {ca}^{j})], \end{aligned}$$

(10)

$$\begin{aligned} {\mathcal {L}}_\text {seg}= -\sum \limits ^{M}_{j=1}\sum \limits ^{N}_{i=1} {\textbf{G}}_\text {seg}^{i,j}\log {{\textbf{P}}}{{\textbf{r}}}_\text {seg}^{i,j}, \end{aligned}$$

(11)

where ${\mathcal {L}}_\text {cls}$ is a multi-label soft margin loss calculated between the class predictions ${{\textbf{P}}}{{\textbf{r}}}_\text {cls}$ and the multi-hot image labels ${\textbf{G}}_\text {cls}$; ${\mathcal {L}}_\text {ca}$ is a binary cross-entropy loss computed between the predicted class-agnostic masks ${{\textbf{P}}}{{\textbf{r}}}_\text {ca}$ and the class-agnostic pseudo label masks ${\textbf{G}}_\text {ca}$; and ${\mathcal {L}}_\text {seg}$ is a pixel-wise cross-entropy loss computed between the semantic segmentation predictions ${{\textbf{P}}}{{\textbf{r}}}_\text {seg}$ and the pseudo semantic labels ${\textbf{G}}_\text {seg}$. N and M denote the numbers of classes of a dataset and pixels of an input image, respectively.

Figure 5 illustrates the proposed pipeline. More specifically, the classification branch of the multi-task network is first trained with other two branches frozen for 15 epochs to extract pCAM maps. The initial pseudo semantic label masks are then produced by fusing pCAM maps and off-the-shelf saliency maps. With initial pseudo labels, the entire network is then trained for 15 epochs. Afterward, we perform the cross-task label refinement using the learned class-agnostic masks and semantic segmentation masks and obtain refined pseudo labels for the two tasks. Subsequently, the overall multi-task model is re-trained for 15 epochs with these updated refined pseudo labels.

4 Experiments

4.1 Experimental settings

4.1.1 Datasets

To evaluate the proposed method, we conducted experiments on PASCAL VOC 2012 [39] and MS COCO datasets [40]. PASCAL VOC has 20 object classes and one background class for semantic segmentation. This dataset is split into training (train), validation (val) and test sets with 1464, 1449 and 1456 images, respectively. Following common practice, e.g., [41, 42], additional images from [43] are used to augment the training set, resulting in a total of 10,582 training images. MS COCO has 80 object classes, one background class, 80K images for training and 40K images for validation.

4.1.2 Evaluation metrics

We followed the prior works [17, 20, 21, 30, 33, 36, 44] to use the mean Intersection-over-Union (mIoU) of all classes between the predicted semantic segmentation masks and the pixel-wise ground-truth label masks to evaluate the segmentation performance of the proposed method. Moreover, mIoU and F1-score were used to evaluate the quality of the pseudo segmentation labels. The results on the PASCAL VOC test set were obtained from the official PASCAL VOC online evaluation server.

Table 1 Segmentation results of WSSS methods in mIoU (%) on the PASCAL VOC val and test sets

Full size table

4.1.3 Implementation details

We used PyTorch [51] to implement all experiments. To train the proposed network, we used data augmentation techniques including random horizontal flip**, random scaling with a factor of $\pm\, 0.3$, random crop** to size $321\times 321$ and color jittering. Besides, we used the stochastic gradient descent (SGD) optimizer with a mini-batch of 4, and we set the initial learning rate as 0.001 using the polynomial with a power of 0.9. The off-the-shelf saliency maps were generated from the pre-trained DSS model [38] (widely adopted in prior arts [18, 19, 21, 24]), except if specified otherwise. For the pseudo semantic label generation, the thresholds to determine the potential object regions from class-agnostic masks and pCAM maps were empirically set as 0.5 and 0.8, respectively. For testing, we used CRFs with the hyper-parameters suggested in [41] to postprocess the segmentation predictions.

Table 2 Per-class segmentation results of state-of-the-art WSSS methods in IoU (%) on PASCAL VOC

Full size table

4.2 Comparisons with state-of-the-arts

4.2.1 PASCAL VOC

Table 1 reports the segmentation results of the proposed approach against that of state-of-the-art WSSS approaches on PASCAL VOC. The proposed approach achieved mIoUs of 69.7% and 69.9% on the val and test sets, respectively, outperforming other methods. In particular, the proposed network obtained superior results even without exploiting cross-task label refinement compared to most recent methods. Detailed per-class segmentation IoU results are shown in Table 2. Figure 6 visualizes our predicted segmentation masks on PASCAL VOC val set, showing accurate boundaries with fine-grained details.

Table 3 Segmentation results of WSSS methods in mIoU (%) on the MS COCO val set

Full size table

4.2.2 MS COCO

We also compared our results with recent WSSS methods on the MS COCO val set in Table 3 and provided the detailed results of per-class IoU in Table 4. Our method achieves 33.3% on mIoU against state-of-the-art approaches. Several qualitative segmentation results in Fig. 7 show that our approach can well segment objects at different scales in various indoor and outdoor scenes.

4.3 Ablation analysis

4.3.1 Effect of jointly learning multiple tasks

In Table 5, We compared the performance of jointly learning multiple weakly supervised tasks with the baseline method which only performs semantic segmentation. Note that we used the same initial pseudo semantic segmentation labels (see Fig. 5a) to train the different variants of the network. We can observe that jointly learning either image classification or class-agnostic masks with semantic segmentation under weak supervision significantly improves the segmentation results. In particular, learning class-agnostic masks attain a larger performance boost of 2%. Furthermore, jointly learning all these three tasks attains the best mIoU of 65% without postprocessing. This indicates that jointly learning weakly supervised multiple tasks can boost the feature learning of semantic segmentation to achieve more accurate predictions.

Table 4 Per-class segmentation results of state-of-the-art WSSS methods in IoU (%) on the COCO validation set

Full size table

Table 5 Segmentation performance using different architecture configurations on PASCAL VOC 2012 val set in mIoU (%)

Full size table

Table 6 Comparison between sCAM and pCAM in terms of their resulting semantic segmentation (SS) pseudo labels and semantic segmentation performance on PASCAL VOC

Full size table

Table 7 Evaluation of semantic segmentation pseudo labels before and after applying the proposed cross-task label refinement (CTLR) with different pre-trained saliency models on the PASCAL VOC train set

Full size table

4.3.2 Comparison of different CAM maps

We compared two types of CAM maps (i.e., pCAM and sCAM using different normalization methods). As shown in Fig. 8, the sCAM maps only focus on small and local discriminative regions. In contrast, the proposed pCAM maps cover entire object regions. We also compared the pseudo semantic labels generated by pCAM to that by sCAM incorporating the same off-the-shelf saliency maps in mIoU and F1-score. As shown in Table 6, the generated semantic segmentation pseudo labels of pCAM are significantly better compared to that of sCAM in both the mIoU and F1-score results. Accordingly, pCAM achieves superior segmentation results, outperforming sCAM by a large margin.

4.3.3 Effect of cross-task label refinement

We evaluated the quality of pseudo semantic labels using mIoU and F1-score, which account for both precision and recall measurements and thus are indicative of the accuracy and completeness of the labeling, as well as the segmentation performance. Table 7 shows that after applying the proposed cross-task label refinement, both the mIoU and F1-score of the pseudo segmentation ground truth are increased significantly. Figure 9 reports similar increasing trends on the segmentation performance using DSS [38] and DHS [52] pre-trained saliency models, respectively. As visualized in Fig. 10, compared to the off-the-shelf saliency maps from the pre-trained DSS model, the refined class-agnostic masks exhibit more accurate object boundary information in various challenging scenarios, such as images of multiple object instances or objects with low contrast or with complex background. As shown in the last two rows, there are more pixels being correctly labeled in the updated semantic segmentation pseudo labels which are more proximate to ground truth. This indicates that the proposed cross-task label refinement can provide better pseudo semantic labels.

4.3.4 Effect of postprocessing

We evaluated the effects of two postprocessing methods, which are commonly used in fully supervised semantic segmentation, i.e., (i) testing with inputs of multiple scales (e.g., 0.5, 0.75, 1.0, 1.25, 1.5 are used in this experiment) and (ii) using CRF. As shown in Table 8, without postprocessing, the proposed method produces an mIoU of 66.1%. Fusing the results of multi-scale inputs by max-pooling boosts mIoU to 67.2%. Only using CRF brings an improvement of 2.1%, compared to not using any postprocessing method. With both multi-scale testing and CRF, the proposed model yields the best segmentation result of 69.7%.

Table 8 Segmentation performance of the proposed approach with different postprocessing methods on the PASCAL VOC val set

Full size table

5 Conclusion

In this work, we propose to improve WSSS by learning and refining class-agnostic masks. This brings two significant benefits, i.e., the enhanced feature representation for semantic segmentation and the improved object localization information for pseudo semantic label generation. For the latter, we propose a new normalization method to generate improved CAM maps. In addition, we propose a cross-task label refinement mechanism to jointly use class-agnostic and semantic segmentation predictions to generate further refined pseudo labels for both tasks. We have conducted extensive experiments on PASCAL VOC and MS COCO, achieving state-of-the-art WSSS results. The limitation of the proposed approach lies in that it relies on a multi-step training procedure. A future potential improvement of the proposed approach would be to develop an end-to-end framework which integrates the cross-task pseudo label updating process into the training of the weakly supervised multi-task network to achieve online cross-task label refinement.

Data availability statement

The PASCAL VOC dataset is available at https://doi.org/10.1007/s11263-009-0275-4. The MS COCO dataset is available at https://doi.org/10.1007/978-3-319-10602-1_48.

References

Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2018) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. PAMI 40(4):834–848
Article Google Scholar
Zhang L, Xu D, Arnab A, HS Torr P (2020) Dynamic graph message passing networks. In: CVPR
Xu D, Alameda-Pineda X, Ouyang W, Ricci E, Wang X, Sebe N (2020) Probabilistic graph attention network with conditional kernels for pixel-wise prediction. TPAMI. https://doi.org/10.1109/TPAMI.2020.3043781
Article Google Scholar
Zhang L, Sheng Z, Li Y, Sun Q, Zhao Y, Feng D (2020) Image object detection and semantic segmentation based on convolutional neural network. Neural Comput Appl 32(7):1949–1958
Article Google Scholar
Jiang F, Grigorev A, Rho S, Tian Z, Fu Y, Jifara W, Adil K, Liu S (2018) Medical image semantic segmentation based on deep learning. Neural Comput Appl 29(5):1257–1265
Article Google Scholar
Meraj T, Rauf HT, Zahoor S, Hassan A, Lali MI, Ali L, Bukhari SAC, Shoaib U (2021) Lung nodules detection using semantic segmentation and classification with optimal features. Neural Comput Appl 33(17):10737–10750
Article Google Scholar
Bearman A, Russakovsky O, Ferrari V, Fei-Fei L (2016) What’s the point: semantic segmentation with point supervision. In: ECCV
Lin D, Dai J, Jia J, He K, Sun J (2016) Scribblesup: scribble-supervised convolutional networks for semantic segmentation. In: CVPR
Tang M, Djelouah A, Perazzi F, Boykov Y, Schroers C (2018) Normalized cut loss for weakly-supervised cnn segmentation. In: CVPR
Hu R, Dollár P, He K, Darrell T, Girshick R (2018) Learning to segment every thing. In: CVPR
Song C, Huang Y, Ouyang W, Wang L (2019) Box-driven class-wise region masking and filling rate guided loss for weakly supervised semantic segmentation. In: CVPR
Zhang L, Gao Y, **a Y, Lu K, Shen J, Ji R (2014) Representative discovery of structure cues for weakly-supervised image segmentation. TMM 16(2):470–479
Google Scholar
Zhang T, Lin G, Cai J, Shen T, Shen, C, Kot AC (2019) Decoupled spatial neural attention for weakly supervised semantic segmentation. TMM
Zhou L, Gong C, Liu Z, Fu K (2021) Sal: selection and attention losses for weakly supervised semantic segmentation. TMM 23:1035–1048
Google Scholar
Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. In: CVPR
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: ICCV
Huang Z, Wang X, Wang J, Liu W, Wang J (2018) Weakly-supervised semantic segmentation network with deep seeded region growing. In: CVPR
Fan J, Zhang Z, Song C, Tan T (2020) Learning integral objects with intra-class discriminator for weakly-supervised semantic segmentation. In: CVPR
Sun G, Wang W, Dai J, Van Gool L (2020) Mining cross-image semantics for weakly supervised semantic segmentation. In: ECCV
Wei Y, **ao H, Shi H, Jie Z, Feng J, Huang TS (2018) Revisiting dilated convolution: a simple approach for weakly-and semi-supervised semantic segmentation. In: CVPR
Jiang, P.-T., Hou, Q., Cao, Y., Cheng, M.-M., Wei, Y., **ong, H.-K.: Integral object mining via online attention accumulation. In: ICCV (2019)
Xu L, Xue H, Bennamoun M, Boussaid F, Sohel F (2021) Atrous convolutional feature network for weakly supervised semantic segmentation. Neurocomputing 421:115–126
Article Google Scholar
Wei Y, Feng J, Liang X., Cheng MM, Zhao Y, Yan S (2017) Object region mining with adversarial erasing: a simple classification to semantic segmentation approach. In: CVPR
Hou Q, Jiang P, Wei Y, Cheng MM (2018) Self-erasing network for integral object attention. In: NeurIPS
Li K, Wu Z, Peng KC, Ernst J, Fu Y Tell me where to look: Guided attention inference network. In: CVPR (2018)
Wang Y, Zhang J, Kan M, Shan S, Chen X (2020) Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In: CVPR
Chang YT, Wang Q, Hung WC, Piramuthu R, Tsai YH, Yang MH (2020) Weakly-supervised semantic segmentation via sub-category exploration. In: CVPR
Fan J, Zhang Z, Tan T, Song C, **ao J (2020) Cian: Cross-image affinity net for weakly supervised semantic segmentation. In: AAAI
Zhang D, Zhang H, Tang J, Hua X, Sun Q(2020) Causal intervention for weakly-supervised semantic segmentation. In: NeurIPS
Wang X, You S, Li X, Ma H (2018) Weakly-supervised semantic segmentation by iteratively mining common object features. In: CVPR
Wang X, Liu S, Ma H, Yang M-H (2020) Weakly-supervised semantic segmentation by iterative affinity learning. IJCV 128(6):1736–1749
Article MathSciNet MATH Google Scholar
Araslanov N, Roth S (2020) Single-stage semantic segmentation from image labels. In: CVPR
Lee J, Kim E, Lee S, Lee J, Yoon S (2019) Ficklenet: Weakly and semi-supervised segmentation using stochastic inference. In: CVPR
Xu Y, Xu D, Hong X, Ouyang W, Ji R, Zhao G (2019) Structured modeling of joint deep feature and prediction refinement for salient object detection. In: ICCV
Wu Z, Shen C, Van Den Hengel A (2019) Wider or deeper: revisiting the resnet model for visual recognition. Pattern Recogn 90:119–133
Article Google Scholar
Ahn J, Kwak S(2018) Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In: CVPR
Zhang J, Yu X, Li A, Song P, Liu B, Dai Y (2020) Weakly-supervised salient object detection via scribble annotations. In: CVPR
Hou Q, Cheng M, Hu X, Borji A, Tu Z, Torr P (2019) Deeply supervised salient object detection with short connections. PAMI 41(4):815–828
Article Google Scholar
Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. IJCV 88(2):303–338
Article Google Scholar
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: ECCV
Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2015) Semantic image segmentation with deep convolutional nets and fully connected crfs. In: ICLR
Kolesnikov A, Lampert CH (2016) Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In: ECCV
Hariharan B, Arbeláez P, Bourdev L, Maji S, Malik J (2011) Semantic contours from inverse detectors. In: ICCV
Xu D, Ouyang W, Wang X, Sebe N (2018) Pad-net: Multi-tasks guided prediciton-and-distillation network for simultaneous depth estimation and scene parsing. In: CVPR
Zhang B, **ao J, Wei Y, Sun M, Huang K (2020) Reliability does matter: an end-to-end weakly supervised semantic segmentation approach. In: AAAI
Luo W, Yang M (2020)Learning saliency-free model with generic features for weakly-supervised semantic segmentation. In: AAAI
Zhang T, Lin G, Liu W, Cai J, Kot A (2020) Splitting vs. merging: Mining object regions with discrepancy and intersection loss for weakly supervised semantic segmentation. In: ECCV
Yao Y, Chen T, **e GS, Zhang C, Shen F, Wu Q, Tang Z, Zhang J (2021) Non-salient region object mining for weakly supervised semantic segmentation. In: CVPR
Xu L, Ouyang W, Bennamoun M, Boussaid F, Sohel F, Xu D (2021) Leveraging auxiliary tasks with affinity learning for weakly supervised semantic segmentation. In: ICCV
Al-Huda Z, Peng B, Yang Y, Algburi RNA, Ahmad M, Khurshid F, Moghalles K (2021) Weakly supervised semantic segmentation by iteratively refining optimal segmentation with deep cues guidance. Neural Comput Appl, pp 1–26
Paszke A, Gross S, Chintala S, Chanan G (2017) Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration
Liu N, Han J (2016) Dhsnet: Deep hierarchical saliency network for salient object detection. In: CVPR

Download references

Acknowledgements

This research is supported by Australian Research Council Grant DP150104251.

Funding

Open Access funding enabled and organized by CAUL and its Member Institutions.

Author information

Authors and Affiliations

Department of Computer Science and Software Engineering, University of Western Australia, Stirling Hwy, Perth, WA, 6009, Australia
Lian Xu & Mohammed Bennamoun
Department of Electrical, Electronics and Computer Engineering, University of Western Australia, Stirling Hwy, Perth, WA, 6009, Australia
Farid Boussaid
Department of Electrical and Information Engineering, University of Sydney, City Road, Sydney, NSW, 2006, Australia
Wanli Ouyang
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear Water Bay Peninsula, Hong Kong, 999077, China
Dan Xu

Authors

Lian Xu
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed Bennamoun
View author publications
You can also search for this author in PubMed Google Scholar
Farid Boussaid
View author publications
You can also search for this author in PubMed Google Scholar
Wanli Ouyang
View author publications
You can also search for this author in PubMed Google Scholar
Dan Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lian Xu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Xu, L., Bennamoun, M., Boussaid, F. et al. Learning class-agnostic masks with cross-task refinement for weakly supervised semantic segmentation. Neural Comput & Applic 35, 20189–20205 (2023). https://doi.org/10.1007/s00521-023-08826-0

Download citation

Received: 01 December 2021
Accepted: 28 June 2023
Published: 19 July 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s00521-023-08826-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Learning class-agnostic masks with cross-task refinement for weakly supervised semantic segmentation

Abstract

Similar content being viewed by others

Weakly-Supervised Semantic Segmentation Based on Improved CAM

Distinct Class-Specific Saliency Maps for Weakly Supervised Semantic Segmentation

Employing Multi-estimations for Weakly-Supervised Semantic Segmentation

1 Introduction

2 Related work

2.1 CAM map-based refinement

2.2 Semantic prediction-based refinement

3 The proposed method

3.1 Overview

3.2 Weakly supervised multi-task network architecture

3.3 Generating class probability-based CAM maps

3.4 Cross-task pseudo label generation

3.4.1 Initial pseudo label generation

3.4.2 Cross-task label refinement

3.5 Training and inference

3.5.1 Training

4 Experiments

4.1 Experimental settings

4.1.1 Datasets

4.1.2 Evaluation metrics

4.1.3 Implementation details

4.2 Comparisons with state-of-the-arts

4.2.1 PASCAL VOC

4.2.2 MS COCO

4.3 Ablation analysis

4.3.1 Effect of jointly learning multiple tasks

4.3.2 Comparison of different CAM maps

4.3.3 Effect of cross-task label refinement

4.3.4 Effect of postprocessing

5 Conclusion

Data availability statement

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation