Sliced Recursive Transformer

Shen, Zhiqiang; Liu, Zechun; **ng, Eric

doi:10.1007/978-3-031-20053-3_42

Zhiqiang Shen^12,13,14,
Zechun Liu^13,15 &
Eric **ng^12,14

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13684))

Included in the following conference series:

European Conference on Computer Vision

2765 Accesses
7 Citations

Abstract

We present a neat yet effective recursive operation on vision transformers that can improve parameter utilization without involving additional parameters. This is achieved by sharing weights across depth of transformer networks. The proposed method can obtain a substantial gain (\(\sim \)2%) simply using naïve recursive operation, requires no special or sophisticated knowledge for designing principles of networks, and introduces minimal computational overhead to the training procedure. To reduce the additional computation caused by recursive operation while maintaining the superior accuracy, we propose an approximating method through multiple sliced group self-attentions across recursive layers which can reduce the cost consumption by 10–30% without sacrificing performance. We call our model Sliced Recursive Transformer (SReT), a novel and parameter-efficient vision transformer design that is compatible with a broad range of other designs for efficient ViT architectures. Our best model establishes significant improvement on ImageNet-1K over state-of-the-art methods while containing fewer parameters. The proposed weight sharing mechanism by sliced recursion structure allows us to build a transformer with more than 100 or even 1000 shared layers with ease while kee** a compact size (13–15 M), to avoid optimization difficulties when the model is too large. The flexible scalability has shown great potential for scaling up models and constructing extremely deep vision transformers. Code is available at https://github.com/szq0214/SReT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

NASformer: Neural Architecture Search for Vision Transformer

Conv-PVT: a fusion architecture of convolution and pyramid vision transformer

Article 22 December 2022

Super Vision Transformer

Article 02 August 2023

Notes

1.
In a broader sense, the recurrent neural network is a type of recursive neural network.
2.
In practice, the FLOPs of the two forms are not identical as self-attention module includes extra operations like softmax, multiplication with scale and attention values, which will be multiples by the recursive operation.
3.
We observed a minor issue of soft distillation implementation in DeiT (https://github.com/facebookresearch/deit/blob/main/losses.py#L56). Basically, it is unnecessary to use logarithm for teacher’s output (logits) according to the formulation of KL-divergence or cross-entropy. Adding log on both teacher and student’s logits will make the results of KL to be extremely small and intrinsically negligible. We argue that soft labels can provide fine-grained information for distillation, and consistently achieve better results using soft labels in a proper way than one-hot label + hard distillation, as shown in Sect. 5.3.

References

https://github.com/pytorch/fairseq/blob/master/examples/translation/README.md
https://workshop2014.iwslt.org/downloads/proceeding.pdf
https://www.statmt.org/wmt14/translation-task.html
Bagherinezhad, H., Horton, M., Rastegari, M., Farhadi, A.: Label refinery: Improving imagenet classification through label progression. ar**v preprint ar**v:1805.02641 (2018)
Bai, S., Kolter, J.Z., Koltun, V.: Deep equilibrium models. In: Proceedings of the International Conference on Neural Information Processing Systems (2019)
Google Scholar
Bai, S., Kolter, J.Z., Koltun, V.: Trellis networks for sequence modeling. In: ICLR (2019)
Google Scholar
Bai, S., Koltun, V., Kolter, J.Z.: Multiscale deep equilibrium models. In: Proceedings of the International Conference on Neural Information Processing Systems (2020)
Google Scholar
Brown, T.B., et al.: Language models are few-shot learners. ar**v preprint ar**v:2005.14165 (2020)
Chen, M., Peng, H., Fu, J., Ling, H.: Autoformer: searching transformers for visual recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Chowdhury, J.R., Caragea, C.: Modeling hierarchical structures with continuous recursive neural networks. In: Proceedings of the 38th International Conference on Machine Learning, pp. 1975–1988 (2021)
Google Scholar
Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., Kaiser, L.: Universal transformers. In: International Conference on Learning Representations (2018)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. IEEE (2009)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186 (2019)
Google Scholar
Dong, L., Xu, S., Xu, B.: Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888 (2018)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Google Scholar
FAIR: https://github.com/pytorch/fairseq
Guo, Q., Yu, Z., Wu, Y., Liang, D., Qin, H., Yan, J.: Dynamic recursive neural network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5147–5156 (2019)
Google Scholar
Han, K., **s in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38
Chapter Google Scholar
Hendricks, L.A., Mellor, J., Schneider, R., Alayrac, J.B., Nematzadeh, A.: Decoupling the role of data, attention, and losses in multimodal transformers. ar**v preprint ar**v:2102.00529 (2021)
Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). ar**v preprint ar**v:1606.08415 (2016)
Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S.J.: Rethinking spatial dimensions of vision transformers. ar**v preprint ar**v:2103.16302 (2021)
Kim, J., Lee, J.K., Lee, K.M.: Deeply-recursive convolutional network for image super-resolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1637–1645 (2016)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. ar**v preprint ar**v:1412.6980 (2014)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: a lite bert for self-supervised learning of language representations. ar**v preprint ar**v:1909.11942 (2019)
Li, H., Xu, Z., Taylor, G., Studer, C., Goldstein, T.: Visualizing the loss landscape of neural nets. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 6391–6401 (2018)
Google Scholar
Li, Y., Zhang, K., Cao, J., Timofte, R., Van Gool, L.: Localvit: bringing locality to vision transformers. ar**v preprint ar**v:2104.05707 (2021)
Liang, M., Hu, X.: Recurrent convolutional neural network for object recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3367–3375 (2015)
Google Scholar
Liu, F., Gao, M., Liu, Y., Lei, K.: Self-adaptive scaling for learnable residual structure. In: Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL) (2019)
Google Scholar
Liu, S., Yang, N., Li, M., Zhou, M.: A recursive recurrent neural network for statistical machine translation. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1491–1500 (2014)
Google Scholar
Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. ar**v preprint ar**v:1907.11692 (2019)
Liu, Y., et al.: Cbnet: a novel composite backbone network architecture for object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11653–11660 (2020)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. ar**v preprint ar**v:2103.14030 (2021)
Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1412–1421 (2015)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (2002)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Google Scholar
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: Hints for thin deep nets. ar**v preprint ar**v:1412.6550 (2014)
Shen, Z., He, Z., Xue, X.: Meal: multi-model ensemble via adversarial learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 4886–4893 (2019)
Google Scholar
Shen, Z., Liu, Z., Xu, D., Chen, Z., Cheng, K.T., Savvides, M.: Is label smoothing truly incompatible with knowledge distillation: an empirical study. In: International Conference on Learning Representations (2021)
Google Scholar
Shen, Z., Liu, Z., Li, J., Jiang, Y.G., Chen, Y., Xue, X.: Dsod: learning deeply supervised object detectors from scratch. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1919–1927 (2017)
Google Scholar
Sperduti, A., Starita, A.: Supervised neural networks for the classification of structures. IEEE Trans. Neural Netw. 8(3), 714–735 (1997)
Article Google Scholar
Srinivas, A., Lin, T.Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A.: Bottleneck transformers for visual recognition. ar**v preprint ar**v:2101.11605 (2021)
Tolstikhin, I., et al.: Mlp-mixer: an all-mlp architecture for vision. ar**v preprint ar**v:2105.01601 (2021)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. ar**v preprint ar**v:2012.12877 (2020)
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. ar**v preprint ar**v:2103.17239 (2021)
Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)
Google Scholar
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. ar**v preprint ar**v:2102.12122 (2021)
Wang, Y., et al.: Transformer in action: a comparative study of transformer-based acoustic models for large scale speech recognition applications. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6778–6782. IEEE (2021)
Google Scholar
Wu, H., et al.: Cvt: introducing convolutions to vision transformers. ar**v preprint ar**v:2103.15808 (2021)
Xu, W., Xu, Y., Chang, T., Tu, Z.: Co-scale conv-attentional image transformers. ar**v preprint ar**v:2104.06399 (2021)
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: generalized autoregressive pretraining for language understanding. In: Advances in Neural Information Processing Systems (2019)
Google Scholar
Yuan, L., et al.: Tokens-to-token vit: training vision transformers from scratch on imagenet. ar**v preprint ar**v:2101.11986 (2021)
Zhou, D., et al.: Deepvit: towards deeper vision transformer. ar**v preprint ar**v:2103.11886 (2021)

Download references

Author information

Authors and Affiliations

Carnegie Mellon University, Pittsburgh, USA
Zhiqiang Shen & Eric **ng
Hong Kong University of Science and Technology, Hong Kong, China
Zhiqiang Shen & Zechun Liu
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
Zhiqiang Shen & Eric **ng
Reality Labs, Meta Inc., Menlo Park, USA
Zechun Liu

Authors

Zhiqiang Shen
View author publications
You can also search for this author in PubMed Google Scholar
Zechun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Eric **ng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhiqiang Shen .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1183 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shen, Z., Liu, Z., **ng, E. (2022). Sliced Recursive Transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13684. Springer, Cham. https://doi.org/10.1007/978-3-031-20053-3_42

Download citation

DOI: https://doi.org/10.1007/978-3-031-20053-3_42
Published: 06 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20052-6
Online ISBN: 978-3-031-20053-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Sliced Recursive Transformer

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

NASformer: Neural Architecture Search for Vision Transformer

Conv-PVT: a fusion architecture of convolution and pyramid vision transformer

Super Vision Transformer

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1183 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Sliced Recursive Transformer

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

NASformer: Neural Architecture Search for Vision Transformer

Conv-PVT: a fusion architecture of convolution and pyramid vision transformer

Super Vision Transformer

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1183 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation