Abstract
Vision transformer has shown strong representation power by modeling the long-range dependencies. However, the self-attention mechanism in transformer has a quadratic complexity to the sequence length, limiting the generalization to dense prediction downstream tasks. Besides, it is challenging to design the architecture for vision transformer manually. To alleviate these two issues, we propose an efficient and parameter-free self-attention mechanism, named dilated window, which limits the self-attention to non-overlap** windows but still retains the ability to refer to global features. The dilated window scheme relies only on changes to data layout (reshapes and transpositions), which can be implemented with one line code. Furthermore, based on that, we proposed an efficient and effective hierarchical vision transformer architecture called NASformer by using one-shot neural architecture search. The searched architectures are superior to achieve better performances than recent state-of-the-arts, such as ViT, DeiT, and Swin on many vision tasks, including image classification, object detection, and semantic segmentation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bender, G., Kindermans, P.J., Zoph, B., Vasudevan, V., Le, Q.: Understanding and simplifying one-shot architecture search. In: ICML (2018)
Brock, A., Lim, T., Ritchie, J.M., Weston, N.: Smash: one-shot model architecture search through hypernetworks. In: ICLR (2018)
Chen, C.F., Panda, R., Fan, Q.: Regionvit: regional-to-local attention for vision transformers. ar**v preprint ar**v:2106.02689 (2021)
Chen, K., et al.: MMDetection: open MMLab detection toolbox and benchmark. ar**v preprint ar**v:1906.07155 (2019)
Chen, X., **e, L., Wu, J., Tian, Q.: Progressive differentiable architecture search: bridging the depth gap between search and evaluation. In: ICCV (2019)
Chu, X., et al.: Twins: revisiting the design of spatial attention in vision transformers. ar**v preprint ar**v:2104.13840 (2021)
Chu, X., et al.: Conditional positional encodings for vision transformers. ar**v preprint ar**v:2102.10882 (2021)
Contributors, M.: MMSegmentation: openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation (2020)
Contributors, O.: Open neural network exchange (2020). https://github.com/onnx/onnx
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR (2009)
Dong, X., Yang, Y.: Searching for a robust neural architecture in four GPU hours. In: CVPR (2019)
Dosovitskiy, A., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale. In: ICLR (2021)
Ghiasi, G., Lin, T.Y., Le, Q.V.: Nas-fpn: Learning scalable feature pyramid architecture for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7036–7045 (2019)
Guo, Z., et al.: Single path one-shot neural architecture search with uniform sampling. In: ECCV (2020)
Han, K., **ao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. ar**v preprint ar**v:2103.00112 (2021)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S.J.: Rethinking spatial dimensions of vision transformers. ar**v preprint ar**v:2103.16302 (2021)
Hu, H., Zhang, Z., **e, Z., Lin, S.: Local relation networks for image recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3464–3473 (2019)
Kendall, M.G.: A new measure of rank correlation. Biometrika 30(1/2), 81–93 (1938)
Li, C., et al.: Bossnas: exploring hybrid CNN-transformers with block-wisely self-supervised neural architecture search. ar**v preprint ar**v:2103.12424 (2021)
Li, L., Talwalkar, A.: Random search and reproducibility for neural architecture search. In: UAI (2019)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, H., Simonyan, K., Yang, Y.: DARTS: differentiable architecture search. In: ICLR (2019)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. ar**v preprint ar**v:2103.14030 (2021)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Luo, R., Tian, F., Qin, T., Chen, E., Liu, T.Y.: Neural architecture optimization. In: NeurIPS (2018)
Pham, H., Guan, M., Zoph, B., Le, Q., Dean, J.: Efficient neural architecture search via parameters sharing. In: ICML (2018)
Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K., Dollár, P.: Designing network design spaces. In: CVPR (2020)
Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., Shlens, J.: Stand-alone self-attention in vision models. ar**v preprint ar**v:1906.05909 (2019)
Real, E., Aggarwal, A., Huang, Y., Le, Q.V.: Regularized evolution for image classifier architecture search. In: AAAI (2019)
So, D., Le, Q., Liang, C.: The evolved transformer. In: International Conference on Machine Learning. PMLR (2019)
Srinivas, A., Lin, T.Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A.: Bottleneck transformers for visual recognition. ar**v preprint ar**v:2101.11605 (2021)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., J’egou, H.: Training data-efficient image transformers and distillation through attention. ar**v preprint ar**v:2012.12877 (2020)
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. ar**v preprint ar**v:2103.17239 (2021)
Vaswani, A., Ramachandran, P., Srinivas, A., Parmar, N., Hechtman, B., Shlens, J.: Scaling local self-attention for parameter efficient visual backbones. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12894–12904 (2021)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I.: Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. In: ACL (2019)
Wang, H., Wu, Z., Liu, Z., Cai, H., Zhu, L., Gan, C., Han, S.: Hat: Hardware-aware transformers for efficient natural language processing. In: ACL (2020)
Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., Chen, L.-C.: Axial-deepLab: stand-alone axial-attention for panoptic segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 108–126. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_7
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. ar**v preprint ar**v:2102.12122 (2021)
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3–4), 229–256 (1992). https://doi.org/10.1007/BF00992696
**e, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR (2017)
Yang, Z., et al.: Cars: continuous evolution for efficient neural architecture search. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1829–1838 (2020)
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. ar**v preprint ar**v:1511.07122 (2015)
Yuan, L., et al.: Tokens-to-token VIT: training vision transformers from scratch on imagenet. ar**v preprint ar**v:2101.11986 (2021)
Yuan, L., Hou, Q., Jiang, Z., Feng, J., Yan, S.: Volo: vision outlooker for visual recognition (2021)
Zhang, Q., Yang, Y.: Rest: an efficient transformer for visual recognition. ar**v preprint ar**v:2105.13677 (2021)
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: CVPR (2017)
Zhou, D., et al.: Deepvit: towards deeper vision transformer. ar**v preprint ar**v:2103.11886 (2021)
Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. In: ICLR (2016)
Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: CVPR (2018)
Acknowledgement
This work was supported in part by the National Natural Science Foundation of China under Grant No. 61976208.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Ni, B., Meng, G., **ang, S., Pan, C. (2022). NASformer: Neural Architecture Search for Vision Transformer. In: Wallraven, C., Liu, Q., Nagahara, H. (eds) Pattern Recognition. ACPR 2021. Lecture Notes in Computer Science, vol 13188. Springer, Cham. https://doi.org/10.1007/978-3-031-02375-0_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-02375-0_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-02374-3
Online ISBN: 978-3-031-02375-0
eBook Packages: Computer ScienceComputer Science (R0)