Log in

OCNet: Object Context for Semantic Segmentation

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

In this paper, we address the semantic segmentation task with a new context aggregation scheme named object context, which focuses on enhancing the role of object information. Motivated by the fact that the category of each pixel is inherited from the object it belongs to, we define the object context for each pixel as the set of pixels that belong to the same category as the given pixel in the image. We use a binary relation matrix to represent the relationship between all pixels, where the value one indicates the two selected pixels belong to the same category and zero otherwise. We propose to use a dense relation matrix to serve as a surrogate for the binary relation matrix. The dense relation matrix is capable to emphasize the contribution of object information as the relation scores tend to be larger on the object pixels than the other pixels. Considering that the dense relation matrix estimation requires quadratic computation overhead and memory consumption w.r.t. the input size, we propose an efficient interlaced sparse self-attention scheme to model the dense relations between any two of all pixels via the combination of two sparse relation matrices. To capture richer context information, we further combine our interlaced sparse self-attention scheme with the conventional multi-scale context schemes including pyramid pooling (Zhao et al. 2017) and atrous spatial pyramid pooling (Chen et al. 2018). We empirically show the advantages of our approach with competitive performances on five challenging benchmarks including: Cityscapes, ADE20K, LIP, PASCAL-Context and COCO-Stuff.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. There exist two kinds of multiple scales problem: (i) objects of different categories have multiple scales given their distances to the camera are the same, e.g., the “car” is larger than the “person”. (ii) the objects of the same category have multiple scales given their distances to the camera are different, e.g., the closer “person” is larger than the distance “person”.

  2. https://www.cityscapes-dataset.com/

  3. https://groups.csail.mit.edu/vision/datasets/ADE20K/

  4. http://sysu-hcp.net/lip/

  5. https://cs.stanford.edu/~roozbeh/pascal-context/

  6. https://github.com/nightrome/cocostuff

References

  • Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12), 2481–2495.

  • Bulò, S.R., Porzi, L., & Kontschieder, P. (2018). In-place activated batchnorm for memory-optimized training of dnns. In: 2018 IEEE conference on computer vision and pattern recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp 5639–5647.

  • Caesar, H., Uijlings, J.R.R., & Ferrari, V. (2018). Coco-stuff: Thing and stuff classes in context. In: 2018 IEEE conference on computer vision and pattern recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp 1209–1218.

  • Chen, L., Papandreou, G., Schroff, F., & Adam, H. (2017). Rethinking atrous convolution for semantic image segmentation. CoRR. ar**v:1706.05587.

  • Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.

    Article  Google Scholar 

  • Cheng, B., Chen, L., Wei, Y., Zhu, Y., Huang, Z., **ong, J., Huang, T.S., Hwu, W., & Shi, H. (2019). Spgnet: Semantic prediction guidance for scene parsing. In: 2019 IEEE/CVF international conference on computer vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp 5217–5227.

  • Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). Generating long sequences with sparse transformers. CoRR. ar**v:1904.10509.

  • Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In: 2016 IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp 3213–3223.

  • Ding, H., Jiang, X., Shuai, B., Liu, A.Q., & Wang, G. (2018). Context contrasted feature and gated multi-scale aggregation for scene segmentation. In: 2018 IEEE conference on computer vision and pattern recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp 2393–2402.

  • Ding, H., Jiang, X., Liu, A.Q., Magnenat-Thalmann, N., & Wang, G. (2019a). Boundary-aware feature propagation for scene segmentation. In: 2019 IEEE/CVF international conference on computer vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp 6818–6828.

  • Ding, H., Jiang, X., Shuai, B., Liu, A.Q., & Wang, G. (2019b). Semantic correlation promoted shape-variant context for segmentation. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp 8885–8894.

  • Divvala, S. K., Hoiem, D., Hays, J., Efros, A. A., & Hebert, M. (2009). An empirical study of context in object detection. 2009 IEEE computer society conference on computer vision and pattern recognition (CVPR 2009), 20–25 June 2009 (pp. 1271–1278). Miami: Florida, USA.

  • Ferrari, V., Hebert, M., Sminchisescu, C., & Weiss, Y. (eds) (2018). Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part I, Lecture Notes in Computer Science, vol 11205, Springer.

  • Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., & Lu, H. (2019a). Dual attention network for scene segmentation. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp 3146–3154.

  • Fu, J., Liu, J., Wang, Y., Li, Y., Bao, Y., Tang, J., & Lu, H. (2019b). Adaptive context network for scene parsing. In: 2019 IEEE/CVF international conference on computer vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp 6747–6756.

  • Gong, K., Liang, X., Zhang, D., Shen, X., & Lin, L. (2017). Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. In: 2017 IEEE conference on computer vision and pattern recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp 6757–6765.

  • Gonzalez-Garcia, A., Modolo, D., & Ferrari, V. (2018). Objects as context for detecting their semantic parts. In: 2018 IEEE conference on computer vision and pattern recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp 6907–6916.

  • Greenspun, P. (1999). Philip and Alex’s guide to Web publishing. Morgan Kaufmann.

  • He, J., Deng, Z., Zhou, L., Wang, Y., & Qiao, Y. (2019). Adaptive pyramid context network for semantic segmentation. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp 7519–7528.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp 770–778.

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R.B. (2017). Mask R-CNN. In: IEEE international conference on computer vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp 2980–2988.

  • Hoyer, L., Munoz, M., Katiyar, P., Khoreva, A., Fischer, V., & (2019) Grid saliency for context explanations of semantic segmentation. In: Advances in neural information processing systems 32: Annual conference on neural information processing systems 2019, NeurIPS 2019(December), pp. 8–14, . (2019). Vancouver (pp. 6459–6470). Canada: BC.

  • Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., & Liu, W. (2019). Ccnet: Criss-cross attention for semantic segmentation. In: 2019 IEEE/CVF international conference on computer vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp 603–612.

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd international conference on machine learning, ICML 2015, Lille, France, 6-11 July 2015, pp 448–456.

  • Kong, S., & Fowlkes, C.C. (2018). Recurrent scene parsing with perspective understanding in the loop. In: 2018 IEEE conference on computer vision and pattern recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp 956–965.

  • Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., et al. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1), 32–73.

    Article  MathSciNet  Google Scholar 

  • Kuo, W., Angelova, A., Malik, J., & Lin, T. (2019). Shapemask: Learning to segment novel objects by refining shape priors. In: 2019 IEEE/CVF international conference on computer vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp 9206–9215.

  • Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., & Liu, H. (2019). Expectation-maximization attention networks for semantic segmentation. In: 2019 IEEE/CVF international conference on computer vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp 9166–9175.

  • Li, Y., & Gupta, A. (2018). Beyond grids: Learning graph representations for visual recognition. In: Advances in neural information processing systems 31: Annual conference on neural information processing systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp 9245–9255.

  • Liang, X., Hu, Z., Zhang, H., Lin, L., & **ng, E.P. (2018a). Symbolic graph reasoning meets convolutions. In: Advances in neural information processing systems 31: Annual conference on neural information processing systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp 1858–1868.

  • Liang, X., Zhou, H., & **ng, E.P. (2018b). Dynamic-structured semantic propagation network. In: 2018 IEEE conference on computer vision and pattern recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp 752–761.

  • Liang, X., Gong, K., Shen, X., & Lin, L. (2019). Look into person: Joint body parsing & pose estimation network and a new benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(4), 871–885.

    Article  Google Scholar 

  • Lin, G., Milan, A., Shen, C., & Reid, I.D. (2017a). Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In: 2017 IEEE conference on computer vision and pattern recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp 5168–5177.

  • Lin, T., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft COCO: common objects in context. In: Computer Vision - ECCV 2014 - 13th European conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, pp 740–755.

  • Lin, T., Dollár, P., Girshick, R.B., He, K., Hariharan, B., & Belongie, S.J. (2017b). Feature pyramid networks for object detection. In: 2017 IEEE conference on computer vision and pattern recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp 936–944.

  • Liu, W., Rabinovich, A., & Berg, A. C. (2015). ParseNet: Looking wider to see better. CoRR. ar**v:1506.04579.

  • Luo, Y., Zheng, Z., Zheng, L., Guan, T., Yu, J., & Yang, Y. (2018). Macro-micro adversarial network for human parsing. In: Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part IX, pp 424–440.

  • Ma, N., Zhang, X., Zheng, H., & Sun, J. (2018). Shufflenet V2: practical guidelines for efficient CNN architecture design. In: Computer Vision - ECCV 2018 - 15th European conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIV, pp 122–138.

  • Massa, F., & Girshick, R. (2018). Maskrcnn-benchmark: Fast, modular reference implementation of instance segmentation and object detection algorithms in PyTorch. https://github.com/facebookresearch/maskrcnn-benchmark.

  • Mottaghi, R., Chen, X., Liu, X., Cho, N., Lee, S., Fidler, S., Urtasun, R., & Yuille, A.L. (2014). The role of context for object detection and semantic segmentation in the wild. In: 2014 IEEE conference on computer vision and pattern recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pp 891–898.

  • Nie, X., Feng, J., & Yan, S. (2018). Mutual learning to adapt for joint human parsing and pose estimation. In: Computer Vision - ECCV 2018 - 15th European conference, Munich, Germany, September 8-14, 2018, Proceedings, Part V, pp 519–534.

  • Pang, Y., Li, Y., Shen, J., & Shao, L. (2019). Towards bridging semantic gap to improve semantic segmentation. In: 2019 IEEE/CVF International conference on computer vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp 4229–4238.

  • Roelofs, G., & Koman, R. (1999) PNG: the definitive guide. O’Reilly & Associates, Inc.

  • Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015 - 18th International conference Munich, Germany, October 5 - 9, 2015, Proceedings, Part III, pp 234–241.

  • Ruan, T., Liu, T., Huang, Z., Wei, Y., Wei, S., & Zhao, Y. (2019). Devil in the details: Towards accurate single and multiple human parsing. In: The Thirty-Third AAAI conference on artificial intelligence, AAAI 2019, The Thirty-first innovative applications of artificial intelligence conference, IAAI 2019, The Ninth AAAI symposium on educational advances in artificial intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp 4814–4821.

  • Shelhamer, E., Long, J., & Darrell, T. (2017). Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 640–651.

    Article  Google Scholar 

  • Shen, Z., Zhang, M., Zhao, H., Yi, S., & Li, H. (2018). Efficient attention: Attention with linear complexities. ar**v:1812.01243.

  • Shetty, R., Schiele, B., & Fritz, M. (2019). Not using the car to see the sidewalk - quantifying and controlling the effects of context in classification and segmentation. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp 8218–8226.

  • Shuai, B., Zuo, Z., Wang, B., & Wang, G. (2018). Scene segmentation with dag-recurrent neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6), 1480–1493.

    Article  Google Scholar 

  • Sun, K., **ao, B., Liu, D., & Wang, J. (2019a). Deep high-resolution representation learning for human pose estimation. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp 5693–5703.

  • Sun, K., Zhao, Y., Jiang, B., Cheng, T., **ao, B., & Liu, D., et al. (2019b). High-resolution representations for labeling pixels and regions. CoRR. ar**v:1904.04514.

  • Tian, Z., He, T., Shen, C., Yan, Y. (2019). Decoders matter for semantic segmentation: Data-dependent decoding enables flexible feature aggregation. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp 3126–3135.

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017). Long Beach (pp. 5998–6008). USA: CA.

    Google Scholar 

  • Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X., & Cottrell, G.W. (2018a). Understanding convolution for semantic segmentation. In: 2018 IEEE winter conference on applications of computer vision, WACV 2018, Lake Tahoe, NV, USA, March 12-15, 2018, pp 1451–1460.

  • Wang, W., Zhang, Z., Qi, S., Shen, J., Pang, Y., & Shao, L. (2019). Learning compositional neural information fusion for human parsing. In: 2019 IEEE/CVF international conference on computer vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp 5702–5712.

  • Wang, X., Girshick, R.B., Gupta, A., & He, K. (2018b). Non-local neural networks. In: 2018 IEEE conference on computer vision and pattern recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp 7794–7803.

  • Wu, T., Tang, S., Zhang, R., Cao, J., & Li, J. (2019). Tree-structured kronecker convolutional network for semantic segmentation. In: IEEE international conference on multimedia and expo, ICME 2019, Shanghai, China, July 8-12, 2019, pp 940–945.

  • **ao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018). Unified perceptual parsing for scene understanding. In: Computer Vision - ECCV 2018 - 15th European conference, Munich, Germany, September 8-14, 2018, Proceedings, Part V, pp 432–448.

  • **e, G., Wang, J., Zhang, T., Lai, J., Hong, R., & Qi, G. (2018). Interleaved structured sparse convolutional neural networks. In: 2018 IEEE conference on computer vision and pattern recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp 8847–8856.

  • Yang, M., Yu, K., Zhang, C., Li, Z., & Yang, K. (2018). Denseaspp for semantic segmentation in street scenes. In: 2018 IEEE conference on computer vision and pattern recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp 3684–3692.

  • Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., & Sang, N. (2018a). Bisenet: Bilateral segmentation network for real-time semantic segmentation. In: Computer Vision - ECCV 2018 - 15th European conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIII, pp 334–349.

  • Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., & Sang, N. (2018b). Learning a discriminative feature network for semantic segmentation. In: 2018 IEEE conference on computer vision and pattern recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp 1857–1866.

  • Yu, F., & Koltun, V. (2016). Multi-scale context aggregation by dilated convolutions. In: 4th international conference on learning representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.

  • Yuan, Y., Chen, X., & Wang, J. (2020). Object-contextual representations for semantic segmentation. In: Computer Vision - ECCV 2020 - 16th European conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part VI, pp 173–190.

  • Yue, K., Sun, M., Yuan, Y., Zhou, F., Ding, E., & Xu, F. (2018). Compact generalized non-local network. In: Advances in neural information processing systems 31: Annual conference on neural information processing systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp 6511–6520.

  • Zhang, F., Chen, Y., Li, Z., Hong, Z., Liu, J., Ma, F., Han, J., & Ding, E. (2019a). Acfnet: Attentional class feature network for semantic segmentation. In: 2019 IEEE/CVF international conference on computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp 6797–6806.

  • Zhang, H., Dana, K.J., Shi, J., Zhang, Z., Wang, X., Tyagi, A., & Agrawal, A. (2018). Context encoding for semantic segmentation. In: 2018 IEEE conference on computer vision and pattern recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp 7151–7160.

  • Zhang, H., Zhang, H., Wang, C., & **e, J. (2019b). Co-occurrent features in semantic segmentation. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp 548–557.

  • Zhang, R., Tang, S., Zhang, Y., Li, J., & Yan, S. (2017a). Scale-adaptive convolutions for scene parsing. In: IEEE international conference on computer vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp 2050–2058.

  • Zhang, T., Qi, G., **ao, B., Wang, J. (2017b). Interleaved group convolutions. In: IEEE international conference on computer vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp 4383–4392.

  • Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In: 2017 IEEE conference on computer vision and pattern recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp 6230–6239.

  • Zhao, H., Zhang, Y., Liu, S., Shi, J., Loy, C.C., Lin, D., & Jia, J. (2018). Psanet: Point-wise spatial attention network for scene parsing. In: Computer Vision - ECCV 2018 - 15th European conference, Munich, Germany, September 8-14, 2018, Proceedings, Part IX, pp 270–286.

  • Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ADE20K dataset. In: 2017 IEEE conference on computer vision and pattern recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp 5122–5130.

  • Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., & Liang, J. (2018). Unet++: A nested u-net architecture for medical image segmentation. In: Deep learning in medical image analysis - and - multimodal learning for clinical decision support - 4th international workshop, DLMIA 2018, and 8th international workshop, ML-CDS 2018, held in conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, proceedings, pp 3–11.

  • Zhu, Z., Xu, M., Bai, S., Huang, T., & Bai, X. (2019). Asymmetric non-local neural networks for semantic segmentation. In: 2019 IEEE/CVF international conference on computer vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp 593–602.

Download references

Acknowledgements

This work is supported in part by the National Nature Science Foundation of China under Grant 62071013 and 61671027, and in part by National Key R&D Program of China under Grant 2018AAA0100300.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuhui Yuan.

Additional information

Communicated by Cha Zhang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 A. More Discussions on the Benefits of object context

We give a further explanation on why we believe OC is superior to the previous two representation methods including PPM (Zhao et al. 2017) and ASPP (Chen et al. 2017) as following:

Fig. 10
figure 10

Effect of background context on semantic segmentation and the context explanation provided by grid saliency for an erroneous prediction, the image is taken from MS COCO Lin et al. (2014). The grid saliency (a) shows the responsible context for misclassifying the cow (green) as horse (purple) in the semantic segmentation (b). It shows the training bias that horses are more likely on road than cows. Removing the background context context (c) yields a correctly classified cow (d)

  • In theory, enhancing the object information in the context can decrease the variance of the context information, in other words, the context of PPM and ASPP suffers from larger variance than the OC context. Because the OC context only contains the variance of the {object information} while the context of PPM/ASPP further contains the variance of {object information, useful background information, irrelevant background information }. The recent study (Hoyer et al. 2019; Shetty et al. 2019) has verified that the overuse of the noisy context information based on PPM suffers from poor generalization ability. For example, the “cow” pixels might be mis-classified as “horse” pixels when the “cow” appears on the road. We directly use the Fig. 10 from Hoyer et al. (2019) to support our point. In summary, explicitly enhancing the object information might decrease the variance of the context information, thus, increases the generalization ability of model.

  • In experiments, according to the results in the Table 2 (in the paper), we have verified that the OCNet outperforms both PSPNet and DeepLabv3 under the fair comparison settings.

1.2 B. Formulation of PPM Context

We illustrate the definition of the context \({\mathcal {I}}_i\) based on PPM (Zhao et al. 2017) scheme:

$$\begin{aligned} {\mathcal {I}}_i = \left\{ j\in {\mathcal {I}}~|~\left\lfloor \frac{(j - 1)k}{N} \right\rfloor = \left\lfloor \frac{(i - 1)k}{N} \right\rfloor \right\} , \end{aligned}$$
(22)

where \(k \in \{2, 3, 6\}\) represents different pyramid region partitions. Such context is a aggregation of the pixels the the same quotient.

1.3 C. Formulation of Permutation Matrix

We illustrate the definition of each value \(p_{i,j}\) in the permutation matrix \({\mathbf {P}}\):

$$\begin{aligned} p_{i,j} = {\left\{ \begin{array}{ll} 1, &{} j = ((i-1) \bmod P) \times P + \lfloor \frac{i-1}{P}\rfloor + 1; \\ 0, &{} \mathrm {otherwise}, \end{array}\right. } \end{aligned}$$
(23)

where, according to \({\mathbf {W}} = {\mathbf {W}}^l {\mathbf {P}}^{\top } {\mathbf {W}}^g {\mathbf {P}}\), we permute the i-th column of \({\mathbf {W}}^g\) / \({\mathbf {W}}^l\) to the j-th column if \(p_{i,j}=1\) / \(p_{j,i}=1\) when multiplying permutation matrix \({\mathbf {P}}\) / \({\mathbf {P}}^{\top }\) on the right side of \({\mathbf {W}}^g\) / \({\mathbf {W}}^l\) respectively.

1.4 D. Why the Sparse Relation is More Efficient?

For the convenience of analysis, we rewrite the mathematical formulation of computing the context representations (w/o considering the transform functions \(\delta (\cdot )\) and \(\rho (\cdot )\)) based on dense relation scheme and sparse relation scheme as following. The formulation of dense relation scheme is \({\mathbf {Z}} = {\mathbf {W}} {\mathbf {X}}\) and the formulation of sparse relation scheme is \({\mathbf {Z}} =({\mathbf {W}}^{l}{\mathbf {P}}^{\top }{\mathbf {W}}^{g}{\mathbf {P}}){\mathbf {X}}\). We can see that the formulation of the sparse relation scheme still requires \({\mathcal {O}}(N^2)\) GPU memory to store the reconstructed dense relation matrix \({\mathbf {W}}^{l}{\mathbf {P}}^{\top }{\mathbf {W}}^{g}{\mathbf {P}}\). To avoid such expensive GPU memory consumption, we rewrite the formulation of the sparse relation scheme as \({\mathbf {Z}} ={\mathbf {W}}^{l}({\mathbf {P}}^{\top }({\mathbf {W}}^{g}({\mathbf {P}}{\mathbf {X}})))\) according to the associative laws. Because both \({\mathbf {W}}^{l}\) and \({\mathbf {W}}^{g}\) are sparse block matrices and each block is independent from the other blocks, we compute the multiple block matrices concurrently via transforming these block matrices to align on the batch dimension. Besides, we also implement the permutation matrix via the combination of permute and reshape operation provided in PyTorch. More details are illustrated in the discussion in Sect. 3.3 and Algorithm 1

Fig. 11
figure 11

Illustrating how the sparse relation approximates the dense relation. We use \(\mathrm {A}_1,\mathrm {B}_1,...,\mathrm {B}_3\) to represent the different input positions. The gray arrows represent the information propagation path from one input position to one output position. In (a) Dense Relation, each output position connects with all input positions directly, thus, the relation matrix is fully dense. We use the dense relation matrix \({\mathbf {W}}\) to record the weights on all connections. In (b) Sparse Relation, we have two relation matrices and each relation matrix only contains the sparse connections to a small set of selected pixels, and the combination of the two sparse connections ensures that each output position has direct or indirect relations with all input positions. We use the two sparse relation matrices \({\mathbf {W}}^g\) and \({\mathbf {W}}^l\) to record the weights on all sparse connections

Fig. 12
figure 12

Illustrating the Interlaced Sparse Self-Attention with Indices Permutation. We mark the positions in the input feature map with the indices from a to p, e.g., the index a represents the spatial position (1, 1) and the index p represents the spatial position (4, 4). We illustrate how we permute the indices in all stages as following: (i) For the Permute in the global relation stage, we permute the positions according to the remainder of the indices divided by the group numbers, e.g., 2 for both height and width dimension. For example, all positions including \(\{(1, 1), (1, 3), (3, 1), (2, 2)\}\) share the same remainder (1, 1) when we divide the indices by 2 for both dimensions, thus, we group the positions \(\{\mathrm{a}, \mathrm{c}, \mathrm{i}, \mathrm{k}\}\) together. Similarly, we get the other 3 groups of positions: \(\{\mathrm{b}, \mathrm{d}, \mathrm{j}, \mathrm{l}\}\) (share the same remainder (1, 0)), \(\{\mathrm{e}, \mathrm{g}, \mathrm{m}, \mathrm{o}\}\) (share the same remainder (0, 1)) and \(\{\mathrm{f}, \mathrm{h}, \mathrm{n}, \mathrm{p}\}\) (share the same remainder (0, 0)). (ii) For the Permute in the local relation stage, we permute the positions according to the quotient of the indices divided by the group numbers (2, 2). Similarly, we get 4 groups of positions: \(\{\mathrm{a}, \mathrm{b}, \mathrm{e}, \mathrm{f}\}\) (share the same quotient (0, 0)), \(\{\mathrm{c}, \mathrm{d}, \mathrm{g}, \mathrm{h}\}\) (share the same quotient (0, 1)), \(\{\mathrm{i}, \mathrm{j}, \mathrm{m}, \mathrm{n}\}\) (share the same quotient (1, 0)) and \(\{\mathrm{k}, \mathrm{l}, \mathrm{o}, \mathrm{p}\}\) (share the same quotient (1, 1))

1.5 E. Intuitive Example of the Sparse Relation Scheme

We use an one-dimensional example in Fig. 11 to explain why the combination of two sparse relation matrices are capable to approximate the dense relation matrix. In other words, both dense relation and sparse relation ensure that each output position is connected with all input positions. Specifically, in Fig. 11 (b), the output position \({\mathrm{A}}_1\) has direct relations with \(\{{\mathrm{A}}_2, {\mathrm{A}}_3, {\mathrm{B}}_1\}\) and indirect relations with \(\{{\mathrm{B}}_2, {\mathrm{B}}_3\}\) via \({\mathrm{B}}_1\).

1.6 F. Complexity Analysis

We illustrate the proof of the complexity of interlaced sparse self-attention scheme:

Proof

. The shapes of the input & output are: \({\mathbf {X}}\) is of shape \(HW\times C\), \(\theta ({\mathbf {X}}), \phi ({\mathbf {X}}), \delta ({\mathbf {X}}) \in {\mathbb {R}}^{HW\times \frac{C}{2}}\), \(\rho ({\mathbf {X}}) \in {\mathbb {R}}^{HW\times C}\).

In the formulation of the ISA’s global relation stage, the overall complexity of \(\theta (\cdot )\), \(\phi (\cdot )\), \(\delta (\cdot )\), and \(\rho (\cdot )\) are \({\mathcal {O}}(HWC^2)\). The overall complexity of \(\theta ({\mathbf {X}}^{g}_p)\phi ({\mathbf {X}}^{g}_p)^{\top }\) and \({\mathbf {W}}_p^{g} \delta ({\mathbf {X}}^{g}_p)\) are \({\mathcal {O}}((\frac{HW}{P_{h} P_{w}})^{2}C)\) within each group of positions. There exist \(P_{h}P_{w}\) groups in the global relation stage and \(P_{h}P_{w}\). Thus, the overall complexity of the global relation stage of ISA is:

$$\begin{aligned} \begin{aligned} T({\mathrm{ISA/global}}) =&T(\theta (\cdot )) + T(\phi (\cdot )) + T(\delta (\cdot )) + T(\rho (\cdot )) \\&+ P_{h}P_{w} T(\theta ({\mathbf {X}}^{g}_p)\phi ({\mathbf {X}}^{g}_p)^{\top }) \\ =&{\mathcal {O}}\left( HWC^2+(HW)^2{C}\frac{1}{{P}_{h}{P}_{w}}\right) , \end{aligned} \end{aligned}$$
(24)

Similarly, we can get the complexity of the lobal relation stage in ISA:

$$\begin{aligned} \begin{aligned} T({\mathrm{ISA/local}}) = {\mathcal {O}}\left( HWC^2+(HW)^2{C}\frac{1}{{Q}_{h}{Q}_{w}}\right) , \end{aligned} \end{aligned}$$
(25)

In summary, we can compute the final complexity of ISA via adding \(T({\mathrm{ISA/global}})\) and \(T({\mathrm{ISA/local}})\),

$$\begin{aligned} \begin{aligned} T({\mathrm{ISA}})&= {\mathcal {O}}\left( HWC^2+(HW)^2{C}\left( \frac{1}{{P}_{h}{P}_{w}}+\frac{1}{{Q}_{h}{Q}_{w}}\right) \right) \end{aligned} \end{aligned}$$
(26)

where we can achieve the minimized computation complexity of \({\mathcal {O}}(HWC^2+(HW)^{\frac{3}{2}}{C})\) when \({P_{h}P_{w}} = {Q_{h}Q_{w}}\) is satisfied (according to arithmetic mean \(\ge \) geometric mean). \(\square \)

1.7 G. Illustrating the Permutation Scheme of ISA

To help the readers to understand how we select and permute the indices within Interlaced Sparse Self-Attention, we use an example in Fig. 12 to explain the details.

1.8 H. More Details of Pyramid-OC

We explain the details of Pyramid-OC as following: Given an input feature map \({\mathbf {X}}\) of shape \({H\times W\times C}\), we first divide it into \(k\times k\) groups (\(k\in \{1,2,3,6\}\)) following the pyramid partitions of PPM (Zhao et al. 2017):

$$\begin{aligned} {\mathbf {X}} \rightarrow \begin{bmatrix} {\mathbf {X}}_{1,1} &{} {\mathbf {X}}_{1,2} &{} \cdots &{} {\mathbf {X}}_{1,k} \\ {\mathbf {X}}_{2,1} &{} {\mathbf {X}}_{2,2} &{} \cdots &{} {\mathbf {X}}_{2,k} \\ \vdots &{}\vdots &{} \ddots &{} \vdots \\ {\mathbf {X}}_{k,1} &{} {\mathbf {X}}_{k,2} &{} \cdots &{} {\mathbf {X}}_{k,k} \\ \end{bmatrix}, \end{aligned}$$
(27)

where each \({\mathbf {X}}_{i,j}\) is of shape \({\frac{H}{k}\times \frac{W}{k}\times C}, \forall ~i, j\in \{1,2,..,k\}\). We apply the object context pooling (OCP) on each group \({\mathbf {X}}_{i,j}\) (note that the parameters of OCP are shared across the groups within the same partition) to compute the context representations, then, we concatenate the context representations to obtain the output feature \({\mathbf {Z}}^{k}\) of shape \({H\times W\times C}\):

$$\begin{aligned} \begin{bmatrix} \mathrm {OCP}({\mathbf {X}}_{1,1}) &{} \mathrm {OCP}({\mathbf {X}}_{1,2}) &{} \cdots &{} \mathrm {OCP}({\mathbf {X}}_{1,k}) \\ \mathrm {OCP}({\mathbf {X}}_{2,1}) &{} \mathrm {OCP}({\mathbf {X}}_{2,2}) &{} \cdots &{} \mathrm {OCP}({\mathbf {X}}_{2,k}) \\ \vdots &{}\vdots &{} \ddots &{} \vdots \\ \mathrm {OCP}({\mathbf {X}}_{k,1}) &{} \mathrm {OCP}({\mathbf {X}}_{k,2}) &{} \cdots &{} \mathrm {OCP}({\mathbf {X}}_{k,k}) \\ \end{bmatrix} \rightarrow {\mathbf {Z}}^{k}. \end{aligned}$$
(28)

We compute four different context feature maps \(\{{\mathbf {Z}}^{1}, {\mathbf {Z}}^{2}, {\mathbf {Z}}^{3},\)\({\mathbf {Z}}^{6}\}\) based on four different pyramid partitions. Last, we concatenate these context feature maps:

$$\begin{aligned} {\mathbf {Z}} = \mathrm{concate}({\mathbf {Z}}^{1}, {\mathbf {Z}}^{2}, {\mathbf {Z}}^{3}, {\mathbf {Z}}^{6}). \end{aligned}$$
(29)
Fig. 13
figure 13

Illustrating that the shape of the checkerboard in dense relation (based on interlaced sparse self-attention) is the same as the shape of the global relation map. The left / right two columns present the example from the 2-rd / 1-st row of Fig. 6 (b) / (c) separately

1.9 I. Checkered Artefact with ISA

We can observe checkered artefact in the Figs. 6 and  7, which is caused by our implementation on the visualization of the global relation, local relation and dense relation. We illustrate the related pseudo-code in Algorithm 2. Specifically speaking, for a selected pixel, we multiply a set of global relation matrices (associates with the pixels that belong to the same group as the selected pixel in the local relation stage) with the its local relation matrix. Therefore, the shape of the checkerboard is exactly the same as the shape of the global relation map. For example, we zoom in the dense relation and global relation of some examples in the Fig. 13.

figure c

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yuan, Y., Huang, L., Guo, J. et al. OCNet: Object Context for Semantic Segmentation. Int J Comput Vis 129, 2375–2398 (2021). https://doi.org/10.1007/s11263-021-01465-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-021-01465-9

Keywords

Navigation