NASformer: Neural Architecture Search for Vision Transformer

  • Conference paper
  • First Online:
Pattern Recognition (ACPR 2021)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13188))

Included in the following conference series:

Abstract

Vision transformer has shown strong representation power by modeling the long-range dependencies. However, the self-attention mechanism in transformer has a quadratic complexity to the sequence length, limiting the generalization to dense prediction downstream tasks. Besides, it is challenging to design the architecture for vision transformer manually. To alleviate these two issues, we propose an efficient and parameter-free self-attention mechanism, named dilated window, which limits the self-attention to non-overlap** windows but still retains the ability to refer to global features. The dilated window scheme relies only on changes to data layout (reshapes and transpositions), which can be implemented with one line code. Furthermore, based on that, we proposed an efficient and effective hierarchical vision transformer architecture called NASformer by using one-shot neural architecture search. The searched architectures are superior to achieve better performances than recent state-of-the-arts, such as ViT, DeiT, and Swin on many vision tasks, including image classification, object detection, and semantic segmentation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bender, G., Kindermans, P.J., Zoph, B., Vasudevan, V., Le, Q.: Understanding and simplifying one-shot architecture search. In: ICML (2018)

    Google Scholar 

  2. Brock, A., Lim, T., Ritchie, J.M., Weston, N.: Smash: one-shot model architecture search through hypernetworks. In: ICLR (2018)

    Google Scholar 

  3. Chen, C.F., Panda, R., Fan, Q.: Regionvit: regional-to-local attention for vision transformers. ar**v preprint ar**v:2106.02689 (2021)

  4. Chen, K., et al.: MMDetection: open MMLab detection toolbox and benchmark. ar**v preprint ar**v:1906.07155 (2019)

  5. Chen, X., **e, L., Wu, J., Tian, Q.: Progressive differentiable architecture search: bridging the depth gap between search and evaluation. In: ICCV (2019)

    Google Scholar 

  6. Chu, X., et al.: Twins: revisiting the design of spatial attention in vision transformers. ar**v preprint ar**v:2104.13840 (2021)

  7. Chu, X., et al.: Conditional positional encodings for vision transformers. ar**v preprint ar**v:2102.10882 (2021)

  8. Contributors, M.: MMSegmentation: openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation (2020)

  9. Contributors, O.: Open neural network exchange (2020). https://github.com/onnx/onnx

  10. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR (2009)

    Google Scholar 

  11. Dong, X., Yang, Y.: Searching for a robust neural architecture in four GPU hours. In: CVPR (2019)

    Google Scholar 

  12. Dosovitskiy, A., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale. In: ICLR (2021)

    Google Scholar 

  13. Ghiasi, G., Lin, T.Y., Le, Q.V.: Nas-fpn: Learning scalable feature pyramid architecture for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7036–7045 (2019)

    Google Scholar 

  14. Guo, Z., et al.: Single path one-shot neural architecture search with uniform sampling. In: ECCV (2020)

    Google Scholar 

  15. Han, K., **ao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. ar**v preprint ar**v:2103.00112 (2021)

  16. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)

    Google Scholar 

  17. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  18. Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S.J.: Rethinking spatial dimensions of vision transformers. ar**v preprint ar**v:2103.16302 (2021)

  19. Hu, H., Zhang, Z., **e, Z., Lin, S.: Local relation networks for image recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3464–3473 (2019)

    Google Scholar 

  20. Kendall, M.G.: A new measure of rank correlation. Biometrika 30(1/2), 81–93 (1938)

    Article  Google Scholar 

  21. Li, C., et al.: Bossnas: exploring hybrid CNN-transformers with block-wisely self-supervised neural architecture search. ar**v preprint ar**v:2103.12424 (2021)

  22. Li, L., Talwalkar, A.: Random search and reproducibility for neural architecture search. In: UAI (2019)

    Google Scholar 

  23. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  24. Liu, H., Simonyan, K., Yang, Y.: DARTS: differentiable architecture search. In: ICLR (2019)

    Google Scholar 

  25. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. ar**v preprint ar**v:2103.14030 (2021)

  26. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

    Google Scholar 

  27. Luo, R., Tian, F., Qin, T., Chen, E., Liu, T.Y.: Neural architecture optimization. In: NeurIPS (2018)

    Google Scholar 

  28. Pham, H., Guan, M., Zoph, B., Le, Q., Dean, J.: Efficient neural architecture search via parameters sharing. In: ICML (2018)

    Google Scholar 

  29. Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K., Dollár, P.: Designing network design spaces. In: CVPR (2020)

    Google Scholar 

  30. Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., Shlens, J.: Stand-alone self-attention in vision models. ar**v preprint ar**v:1906.05909 (2019)

  31. Real, E., Aggarwal, A., Huang, Y., Le, Q.V.: Regularized evolution for image classifier architecture search. In: AAAI (2019)

    Google Scholar 

  32. So, D., Le, Q., Liang, C.: The evolved transformer. In: International Conference on Machine Learning. PMLR (2019)

    Google Scholar 

  33. Srinivas, A., Lin, T.Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A.: Bottleneck transformers for visual recognition. ar**v preprint ar**v:2101.11605 (2021)

  34. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., J’egou, H.: Training data-efficient image transformers and distillation through attention. ar**v preprint ar**v:2012.12877 (2020)

  35. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. ar**v preprint ar**v:2103.17239 (2021)

  36. Vaswani, A., Ramachandran, P., Srinivas, A., Parmar, N., Hechtman, B., Shlens, J.: Scaling local self-attention for parameter efficient visual backbones. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12894–12904 (2021)

    Google Scholar 

  37. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  38. Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I.: Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. In: ACL (2019)

    Google Scholar 

  39. Wang, H., Wu, Z., Liu, Z., Cai, H., Zhu, L., Gan, C., Han, S.: Hat: Hardware-aware transformers for efficient natural language processing. In: ACL (2020)

    Google Scholar 

  40. Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., Chen, L.-C.: Axial-deepLab: stand-alone axial-attention for panoptic segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 108–126. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_7

    Chapter  Google Scholar 

  41. Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. ar**v preprint ar**v:2102.12122 (2021)

  42. Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3–4), 229–256 (1992). https://doi.org/10.1007/BF00992696

    Article  MATH  Google Scholar 

  43. **e, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR (2017)

    Google Scholar 

  44. Yang, Z., et al.: Cars: continuous evolution for efficient neural architecture search. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1829–1838 (2020)

    Google Scholar 

  45. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. ar**v preprint ar**v:1511.07122 (2015)

  46. Yuan, L., et al.: Tokens-to-token VIT: training vision transformers from scratch on imagenet. ar**v preprint ar**v:2101.11986 (2021)

  47. Yuan, L., Hou, Q., Jiang, Z., Feng, J., Yan, S.: Volo: vision outlooker for visual recognition (2021)

    Google Scholar 

  48. Zhang, Q., Yang, Y.: Rest: an efficient transformer for visual recognition. ar**v preprint ar**v:2105.13677 (2021)

  49. Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: CVPR (2017)

    Google Scholar 

  50. Zhou, D., et al.: Deepvit: towards deeper vision transformer. ar**v preprint ar**v:2103.11886 (2021)

  51. Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. In: ICLR (2016)

    Google Scholar 

  52. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: CVPR (2018)

    Google Scholar 

Download references

Acknowledgement

This work was supported in part by the National Natural Science Foundation of China under Grant No. 61976208.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gaofeng Meng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ni, B., Meng, G., **ang, S., Pan, C. (2022). NASformer: Neural Architecture Search for Vision Transformer. In: Wallraven, C., Liu, Q., Nagahara, H. (eds) Pattern Recognition. ACPR 2021. Lecture Notes in Computer Science, vol 13188. Springer, Cham. https://doi.org/10.1007/978-3-031-02375-0_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-02375-0_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-02374-3

  • Online ISBN: 978-3-031-02375-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation