1 Introduction

Semantic segmentation is a fundamental visual task of assigning a class to every pixel of a given image [1,2,3]. With deep convolutional neural networks (CNNs), semantic segmentation has progressed noticeably in fully supervised settings [4,5,6]. This however needs massive pixel-level dense annotations, which are difficult to acquire due to the expensive and laborious data-labeling process. It is thus desirable to develop segmentation techniques, which can rely on weak supervision to achieve performance on par with the one achieved with strong supervision. Recent years have witnessed many research efforts in semantic segmentation using weak labels, such as points [7], scribbles [8, 9], bounding boxes [10, 11] and image labels [12,13,14]. In particular, image-level labels are easy to acquire and annotate, while they also indicate the least information, i.e., the classes present in an image without information about object localization.

A critical step for WSSS is to use weak labels to produce pseudo segmentation labels. Given image labels, techniques for model interpretation such as CAM [15] and Grad-CAM [16] are able to extract object localization maps from the intermediate layers of CNNs. However, these CAM maps only indicate the most discriminative object regions, which are incomplete and do not provide sufficient semantic segmentation supervision. To obtain advanced segmentation pseudo labels, prior works have developed various strategies to discover non-discriminative object regions [17,18,19] and expand CAM maps. However, resulting CAM maps still exhibit inaccurate boundaries, leading to incorrect segmentation predictions. Besides, previous methods [20,21,22] commonly applied a pre-trained saliency detector on a target segmentation dataset, to extract useful object localization information to assist pseudo semantic label generation.

Fig. 1
figure 1

Column a shows a supervised semantic segmentation training example including an image (top) and a pixel-wise class label mask (bottom) which is not available in the WSSS task. We propose to learn class-agnostic masks for WSSS. Column b shows a pre-trained saliency map (top) which is used to initially supervise the class-agnostic mask learning and to guide the generation of an initial pseudo semantic label mask (bottom). Column c shows the improved pseudo labels for both tasks using the proposed cross-task label refinement, which provides better supervision for the network training

However, for the task of semantic segmentation, off-the-shelf saliency maps can also introduce misleading object information due to the gap between the pre-trained salient objects and the object of interest. For example, the detected saliency object by a pre-trained saliency model is the dog house (Fig. 1b top), while the target object for segmentation is the dog (Fig. 1a bottom). This could thus lead to inaccurate pseudo labels, and the network training is vulnerable to those errors.

These observations indicate that having precise object boundary information is important to both segmentation prediction and pseudo segmentation label generation. We thus propose to construct a class-agnostic mask learning task while exploiting supervision from off-the-shelf saliency maps. This is performed jointly with semantic segmentation using a multi-branch network. The benefits are two-fold: (i) the multi-task joint learning can regularize feature learning and learn more robust features for semantic segmentation, especially to differentiate between foreground objects and backgrounds; Moreover, it can effectively improve the generalization ability of the network by leveraging the useful information from related tasks resulting in a stronger inductive bias compared to single-task learning, thus reducing the overfitting problem; (ii) the learned class-agnostic maps can also contribute to the pseudo segmentation label generation. In particular, off-the-shelf saliency maps are not only utilized to generate initial pseudo semantic labels (Fig. 1b bottom), and they are also used to provide initial supervision to learning class-agnostic masks, by incorporating online class-agnostic mask predictions for self-refinement. Compared to off-the-shelf saliency maps, such learned class-agnostic masks are more adaptive to the target semantic segmentation task and contain more accurate object localization information. We further propose a cross-task label refinement mechanism to take advantage of the learned semantic segmentation masks and class-agnostic masks, thereby producing refined pseudo labels for both tasks (Fig. 1c). Moreover, we propose a new normalization method for CAM to generate class-specific localization maps, which can cover entire object regions. By combining the improved CAM maps and the proposed discriminative foreground-background class-agnostic masks, pseudo semantic labels can be substantially improved to better optimize the whole deep neural network.

Our contributions are summarized as follows:

  • We propose to jointly learn class-agnostic masks and semantic segmentation using image labels and off-the-shelf saliency maps. Such an approach is shown to lead to an improved segmentation performance, and this is also shown to provide more reliable class-agnostic masks for pseudo label generation. We leverage a new normalization method for CAM to produce class-specific localization maps (i.e., pCAM), which can cover entire object regions. The resulting pCAM maps complement the class-agnostic maps, producing high-quality pseudo semantic labels.

  • We introduce a cross-task label refinement mechanism, which jointly leverages predictions from the tasks of class-agnostic and semantic segmentation with pCAM maps to refine their pseudo labels. This mechanism is shown to effectively correct the errors brought by the pre-trained saliency model, providing more accurate supervision to learn semantic segmentation and class-agnostic masks.

  • The proposed method achieves superior WSSS results compared to state-of-the-art methods on PASCAL VOC 2012 and MS COCO (Sec. 4.2).

The rest of this paper is organized as follows. We review related work in Sect. 2 and describe the proposed approach in Sect. 3. Section 4 presents experimental details with ablation studies and discusses the results. Section 5 concludes the paper.

2 Related work

This section presents a literature review on recent image-label supervised semantic segmentation approaches, including CAM based refinement and semantic prediction-based refinement.

2.1 CAM map-based refinement

Most existing WSSS approaches have utilized the object localization information of the CAM maps to produce pseudo semantic labels. However, the raw CAM maps only indicate the most discriminative object regions, which are small and sparse. A typical example of improving CAM maps is the heuristics-based object mining. By iteratively erasing the detected object regions of input images [23], the network is driven to learn new patterns for classification. Similar techniques based on heuristic erasing have been presented in [24, 25]. Jiang et al. [21] observe that the classification network attends to different object regions during training and obtain an integral object localization map by accumulating CAM maps online. Since the sole reliance on the conventional classification objective loss function leads to incomplete CAM maps, prior works apply different regularization methods for training the classification network to obtain improved CAM maps. Wang et al. [26] suggest that imposing an equivariance constraint on the CAM maps under any spatial affine transformation can result in maps which better fit the shape of objects. Fan et al. [18] observe that the standard classification objective only focuses on the discrimination between different object classes, ignoring the boundaries between each class and the backgrounds. They thus propose to learn an intra-class boundary based on the implicit feature manifold. Chang et al. [27] re-formulate the problem into a fine-grained classification task, for which the pseudo labels of sub-categories are extracted from unsupervised feature clustering. Moreover, cross-image relations have also been explored to enhance the representations for extracting CAM maps, e.g., in [19] and [28]. A recent work by Zhang et al. [29] argues that it is the confounding context from the dataset that causes the ambiguous boundaries of CAM maps. Subsequently, they propose to use class-specific average segmentation masks to approximate the confounding set and incorporate it into the image classification to obtain better CAM maps. In contrast, we propose a different normalization method for CAM, which generates more complete CAM maps, compared to the original CAM maps.

2.2 Semantic prediction-based refinement

There are several other methods which focus on refining the pseudo segmentation labels, which are usually produced by the raw CAM maps, by taking advantage of the segmentation predictions. For instance, Wang et al. [30] refine semantic pseudo labels by discovering object affinities based on super-pixel regions derived from the segmentation prediction. Similarly, Wang et al. [31] iteratively select reliable regions from the segmentation outputs to learn pixel-wise affinities, which are then utilized to refine the segmentation results and produce pseudo segmentation labels. Araslanov et al. [32] propose to refine the segmentation results based on image local consistency so as to obtain pseudo semantic labels to enable the optimization of the segmentation network.

Although more non-discriminative object regions are discovered by these complex methods, their resulting pseudo segmentation labels generally have coarse boundaries. Therefore, a number of methods [19,20,21, 23, 24, 33, 34] have exploited background cues from off-the-shelf saliency maps to assist pseudo semantic label generation. However, these pre-trained saliency models are not generally adapted well to the semantic segmentation task. In this work, we address this problem by formulating a task of learning class-agnostic masks and incorporating it into a joint learning framework with semantic segmentation to obtain more generalizable representations. Moreover, in order to provide better supervision to the learning of class-agnostic maps, we combine the pre-trained saliency maps and the online predicted class-agnostic maps which can provide complementary and progressively more accurate class-agnostic localization information. We also propose a cross-task label refinement mechanism to further refine pseudo labels to learn both class-agnostic and semantic segmentation masks.

Fig. 2
figure 2

An overview of the proposed method. We propose to improve WSSS by learning class-agnostic masks, which is formulated as a multi-task problem, i.e., semantic segmentation (SS), class-agnostic (CA) segmentation and multi-label image classification. Because only image labels are provided, to supervise the learning of class-agnostic and semantic segmentation masks, we use a pre-trained saliency map as (f) an initial pseudo class-agnostic mask and progressively refine it by incorporating the online class-agnostic prediction. Moreover, we combine (c) the proposed pCAM maps (illustrated in Fig. 3) extracted from the classification branch and a pre-trained saliency map to generate (e) an initial pseudo semantic segmentation mask using the conventional thresholding-based procedures. Once the initial network training converges, we introduce a cross-task label refinement module (illustrated in Fig. 4) to use (b) the predicted semantic segmentation mask and (d) the predicted class-agnostic mask to produce (g) a refined pseudo semantic segmentation mask and (h) a refined pseudo class-agnostic mask to re-train the network

3 The proposed method

This section starts with an overview, followed by the description of our weakly supervised multi-task network architecture. The subsequent subsection describes the proposed normalization method for CAM to produce improved class-specific localization maps, which constitute a key component of the pseudo semantic label generation process. The following subsection elaborates the details of the proposed cross-task pseudo label generation for class-agnostic and semantic segmentation tasks. The final subsection presents the model training and inference processes.

3.1 Overview

Figure 2 presents an overview of the proposed approach. We build a multi-branch network to jointly perform semantic segmentation, class-agnostic segmentation and image classification tasks, with only image-level annotations. Besides, we utilize a general pre-trained saliency model to generate binary maps as a guide to provide supervision for the learning of the other two tasks. More specifically, we propose a different normalization method for CAM maps, generating more complete class-specific object localization maps. The improved CAM maps are combined with the pre-trained saliency maps to produce better initial pseudo semantic segmentation labels. For the class-agnostic learning, the pre-trained saliency maps are initially used as pseudo labels and are gradually refined by combining the online class-agnostic predictions. Once the training is complete, we propose a cross-task label refinement mechanism, which jointly takes advantage of the class-agnostic and semantic segmentation predictions to produce improved pseudo class-agnostic and semantic segmentation labels. The refined pseudo labels are then leveraged to fine-tune the multi-task network, leading to improved semantic segmentation results.

3.2 Weakly supervised multi-task network architecture

We build our deep network based on ResNet38 [35], which has 38 convolutional layers with wide channels. Following [36], we make modifications to the original ResNet38 to construct a backbone network with an output stride of 8. In order to learn more robust and informative representations for weakly supervised semantic segmentation, we adopt three branches following the backbone network, i.e., an image classifier, a class-agnostic segmentation decoder, and a semantic segmentation decoder. More specifically, given a RGB image as input, the backbone network produces an activation map \({\textbf{F}} \in {\mathbb {R}}^{H\times W\times K}\), with K, H and W indicating its number of channels and two spatial dimensions, respectively. For the classification task, a Global Average Pooling (GAP) layer is applied on the backbone feature maps. The resulting feature vector is forwarded into a fully connected (fc) layer, predicting the class probabilities. For the class-agnostic segmentation branch, the backbone feature maps are forwarded to a DenseASPP module [37], which is composed of three cascaded atrous convolutional layers (aconv) (rates = 6, 12, and 18). Finally, a 1\(\times \)1 convolutional (conv) layer, with a sigmoid layer, is applied to predict the class-agnostic masks. Moreover, the segmentation decoder includes three aconv layers (rates = 6, 12, and 18), and one last \(1\times 1\) conv layer, with a softmax layer, for semantic segmentation prediction.

3.3 Generating class probability-based CAM maps

We use CAM [15] to produce class-specific localization maps for the generation of semantic segmentation pseudo labels. More specifically, for a given class c and spatial coordinates (ij), the CAM map is calculated as follows:

$$\begin{aligned} \textbf{CAM}_c(i, j) = \sum _{k}^{K}{\textbf{W}}_{k}^{c}{\textbf{F}}_{k}(i,j), \end{aligned}$$
(1)

where \({\textbf{W}} \in {\mathbb {R}}^{K\times C}\) is the weight matrix of the last fc layer, with C denoting the number of classes, and \({\textbf{W}}_{k}^{c}\) represents the importance score of the channel k to the class c. As shown in Fig. 3, the generated CAM map for the class c is processed via the min-max normalization along the spatial dimensions, referred to as \(\textrm{sCAM}\):

$$\begin{aligned} \textbf{sCAM}_{c}(i,j)=\frac{\textrm{ReLU}(\textbf{CAM}_c(i,j))}{\max _{(i,j)}\textbf{CAM}_{c}}. \end{aligned}$$
(2)
Fig. 3
figure 3

Illustration of the normalization methods to generate sCAM and the proposed pCAM maps: sCAM maps [15] are generated by applying the max-min normalization along the spatial dimensions (i.e., \(H~\times ~W\)); our proposed pCAM maps are generated by applying softmax along the channel dimension (i.e., C)

In contrast to sCAM, we propose to use a different normalization method to generate CAM maps based on class probabilities, hereinafter referred to as pCAM. More specifically, as illustrated in Fig. 3, pCAM maps are produced by applying the softmax operation along the channel dimension. As a result, each spatial vector of the resulting pCAM map represents the class probability distribution of the corresponding pixel:

$$\begin{aligned} \textbf{pCAM}_{c}(i,j) = \frac{\exp {(\textbf{CAM}_{c}(i,j)})}{\sum \nolimits _c\exp {(\textbf{CAM}_{c}(i,j))}}. \end{aligned}$$
(3)

sCAM tends to highlight the most discriminative regions among all spatial locations. In contrast, the proposed pCAM focuses on highlighting the pixels which have large probabilities for the given class. For the classes present in a given image, their corresponding CAM maps tend to have higher activation values, compared to those CAM maps of classes absent in the image. Therefore, the class activated regions by pCAM are larger than those given by sCAM.

3.4 Cross-task pseudo label generation

This section presents the proposed two-step method of the generation of class-agnostic and semantic segmentation pseudo labels.

3.4.1 Initial pseudo label generation

To learn class-agnostic masks, given no ground truth, we propose to utilize a coarse saliency label map \({{\textbf{P}}}{{\textbf{t}}}_{sal}\) estimated by a pre-trained saliency model as initial guide and incorporate complementary information from online predictions. More specifically, the pre-trained saliency model generally yields reasonable results on the source images, and it is however error-prone when applied on complex images with low contrast or complex backgrounds due to its limited generalization ability on different target datasets. Moreover, the detected salient object may not be the object of interest for the target task. In contrast, with the shared backbone features, the predicted class-agnostic mask, denoted as \({{\textbf{P}}}{{\textbf{r}}}_\text {ca}\), contains useful object localization information, which becomes more reliable with the learning. Therefore, we propose to generate pseudo class-agnostic label masks \({\textbf{G}}^{init}_\text {ca}\) by fusing these two complementary sources through a Conditional Random Field (CRF) model:

$$\begin{aligned} \mathbf {{\textbf{G}}}^{init}_\text {ca} = \textrm{CRF}_d\left(\frac{\mathbf {{{\textbf{P}}}{{\textbf{r}}}}_\text {ca} + \mathbf {{{\textbf{P}}}{{\textbf{t}}}}_{sal}}{2}\right), \end{aligned}$$
(4)

where \(\textrm{CRF}_d(\cdot )\) denotes a densely connected CRF [38] which uses the average of \({{\textbf{P}}}{{\textbf{r}}}_\text {ca}\) and \({{\textbf{P}}}{{\textbf{t}}}_{sal}\) as a unary term. The fused output from the CRF model is more adapted to the target dataset, thereby providing better supervision to learn class-agnostic masks. To generate initial pseudo segmentation labels \({\textbf{G}}^{init}_\text {seg}\), we follow previous works [19,

Fig. 4
figure 4

An illustration of the proposed cross-task label refinement for the pseudo label update of the class-agnostic (CA) and the semantic segmentation (SS) tasks

3.4.2 Cross-task label refinement

When the joint multi-task optimization converges, the improved predictions from all three tasks can be utilized for cross-task refinement. This yields improved pseudo labels for class-agnostic and semantic segmentation, which can further boost multi-task learning. Figure 4 depicts the computation flow of the proposed cross-task refinement module. Given the predicted class-agnostic mask \({{\textbf{P}}}{{\textbf{r}}}_\text {ca}\) and the predicted semantic map \({{\textbf{P}}}{{\textbf{r}}}_\text {seg}\), we perform a structured fusion of these two types of predictions to obtain the refined class-agnostic pseudo label mask \({\textbf{G}}^{ref}_\text {ca}\) as follows:

$$\begin{aligned} {\textbf{G}}^{ref}_\text {ca} = \textrm{CRF}_d\left(\frac{{{\textbf{P}}}{{\textbf{r}}}_\text {ca} + \textrm{Br}_s({{\textbf{P}}}{{\textbf{r}}}_\text {seg})}{2}\right), \end{aligned}$$
(5)

where \(\textrm{Br}_s(\cdot )\) is a binarization operation on the segmentation probability map, outputting a one-channel map \({{\textbf{P}}}{{\textbf{r}}}'_\text {seg}\) with values of 0 and 1; the model \(\textrm{CRF}_d\) shares the same parameters with that used in Eq. 4. More specifically, \(\textrm{Br}_s\) first converts the segmentation map \({{\textbf{P}}}{{\textbf{r}}}_\text {seg}\) into a one-channel map and then, binarize it with label 1 representing ‘foreground’ and label 0 ‘background’ as follows:

$$\begin{aligned} {{\textbf{P}}}{{\textbf{r}}}'_\text {seg}&= \mathop {\mathrm {arg\,max}}\limits \limits _{c}\textrm{Supp}({{\textbf{P}}}{{\textbf{r}}}_\text {seg}), \end{aligned}$$
(6)
$$\begin{aligned} {{\textbf{P}}}{{\textbf{r}}}'_\text {seg}(i,j)&= {\left\{ \begin{array}{ll} 1&{} \text { if } {{\textbf{P}}}{{\textbf{r}}}'_\text {seg}(i,j) > 0, \\ 0&{} \text { if } {{\textbf{P}}}{{\textbf{r}}}'_\text {seg}(i,j) = 0, \end{array}\right. } \end{aligned}$$
(7)

where \(\textrm{Supp}(\cdot )\) denotes a suppression function, which multiplies \({{\textbf{P}}}{{\textbf{r}}}_\text {seg}\) by the image-level labels across the class channel to suppress incorrect predictions. Then, similar to the procedures of initial pseudo semantic label generation, we combine the pCAM map and the refined class-agnostic pseudo label mask \({\textbf{G}}^{ref}_\text {ca}\) to obtain refined pseudo semantic label \({\textbf{G}}^{ref}_\text {seg}\). Finally, the refined pseudo class-agnostic label masks \({\textbf{G}}^{ref}_\text {ca}\) and the refined pseudo semantic label masks \({\textbf{G}}^{ref}_\text {seg}\) are used together with the image labels to re-train the overall network.

3.5 Training and inference

3.5.1 Training

Our overall learning objective function is formulated as follows:

$$\begin{aligned} {\mathcal {L}}_\text {all}= {} {\mathcal {L}}_\text {cls} + {\mathcal {L}}_\text {ca} + {\mathcal {L}}_\text {seg}, \end{aligned}$$
(8)
$$\begin{aligned}{{\cal L}_{{\text{cls}}}} = - \sum\limits_{i = 1}^N {\left[ {{\bf{G}}_{{\text{cls}}}^i\log \frac{{\exp ({\bf{Pr}}_{{\text{cls}}}^i)}}{{1 + \exp ({\bf{Pr}}_{{\text{cls}}}^i)}}{\text{ }} + (1 - {\bf{G}}_{{\text{cls}}}^i)\log \frac{1}{{1 + \exp ({\bf{Pr}}_{{\text{cls}}}^i)}}} \right]} \end{aligned}$$
(9)
$$\begin{aligned} {\mathcal {L}}_\text {ca}= -\sum \limits ^{M}_{j=1}\big [{\textbf{G}}_\text {ca}^{j}\log {{\textbf{P}}}{{\textbf{r}}}_\text {ca}^{j}+ (1-{\textbf{G}}_\text {ca}^{j})\log (1-{{\textbf{P}}}{{\textbf{r}}}_\text {ca}^{j})], \end{aligned}$$
(10)
$$\begin{aligned} {\mathcal {L}}_\text {seg}= -\sum \limits ^{M}_{j=1}\sum \limits ^{N}_{i=1} {\textbf{G}}_\text {seg}^{i,j}\log {{\textbf{P}}}{{\textbf{r}}}_\text {seg}^{i,j}, \end{aligned}$$
(11)

where \({\mathcal {L}}_\text {cls}\) is a multi-label soft margin loss calculated between the class predictions \({{\textbf{P}}}{{\textbf{r}}}_\text {cls}\) and the multi-hot image labels \({\textbf{G}}_\text {cls}\); \({\mathcal {L}}_\text {ca}\) is a binary cross-entropy loss computed between the predicted class-agnostic masks \({{\textbf{P}}}{{\textbf{r}}}_\text {ca}\) and the class-agnostic pseudo label masks \({\textbf{G}}_\text {ca}\); and \({\mathcal {L}}_\text {seg}\) is a pixel-wise cross-entropy loss computed between the semantic segmentation predictions \({{\textbf{P}}}{{\textbf{r}}}_\text {seg}\) and the pseudo semantic labels \({\textbf{G}}_\text {seg}\). N and M denote the numbers of classes of a dataset and pixels of an input image, respectively.

Fig. 5
figure 5

The learning pipeline of the proposed approach. Our training process has four steps. a The classification branch is first optimized to produce pCAM maps, which are then utilized to produce initial pseudo semantic label masks with the guide of pre-trained saliency maps. b Then, the entire multi-branch network is trained to predict semantic segmentation and class-agnostic masks using the initial pseudo semantic labels and the pre-trained saliency maps. c The network predictions are jointly used in the proposed cross-task label refinement, producing refined pseudo semantic and class-agnostic label masks. d Finally, we re-train the multi-task network using the updated pseudo labels produced by (c)

Figure 5 illustrates the proposed pipeline. More specifically, the classification branch of the multi-task network is first trained with other two branches frozen for 15 epochs to extract pCAM maps. The initial pseudo semantic label masks are then produced by fusing pCAM maps and off-the-shelf saliency maps. With initial pseudo labels, the entire network is then trained for 15 epochs. Afterward, we perform the cross-task label refinement using the learned class-agnostic masks and semantic segmentation masks and obtain refined pseudo labels for the two tasks. Subsequently, the overall multi-task model is re-trained for 15 epochs with these updated refined pseudo labels.