Abstract
We present a neat yet effective recursive operation on vision transformers that can improve parameter utilization without involving additional parameters. This is achieved by sharing weights across depth of transformer networks. The proposed method can obtain a substantial gain (\(\sim \)2%) simply using naïve recursive operation, requires no special or sophisticated knowledge for designing principles of networks, and introduces minimal computational overhead to the training procedure. To reduce the additional computation caused by recursive operation while maintaining the superior accuracy, we propose an approximating method through multiple sliced group self-attentions across recursive layers which can reduce the cost consumption by 10–30% without sacrificing performance. We call our model Sliced Recursive Transformer (SReT), a novel and parameter-efficient vision transformer design that is compatible with a broad range of other designs for efficient ViT architectures. Our best model establishes significant improvement on ImageNet-1K over state-of-the-art methods while containing fewer parameters. The proposed weight sharing mechanism by sliced recursion structure allows us to build a transformer with more than 100 or even 1000 shared layers with ease while kee** a compact size (13–15 M), to avoid optimization difficulties when the model is too large. The flexible scalability has shown great potential for scaling up models and constructing extremely deep vision transformers. Code is available at https://github.com/szq0214/SReT.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
In a broader sense, the recurrent neural network is a type of recursive neural network.
- 2.
In practice, the FLOPs of the two forms are not identical as self-attention module includes extra operations like softmax, multiplication with scale and attention values, which will be multiples by the recursive operation.
- 3.
We observed a minor issue of soft distillation implementation in DeiT (https://github.com/facebookresearch/deit/blob/main/losses.py#L56). Basically, it is unnecessary to use logarithm for teacher’s output (logits) according to the formulation of KL-divergence or cross-entropy. Adding log on both teacher and student’s logits will make the results of KL to be extremely small and intrinsically negligible. We argue that soft labels can provide fine-grained information for distillation, and consistently achieve better results using soft labels in a proper way than one-hot label + hard distillation, as shown in Sect. 5.3.
References
https://github.com/pytorch/fairseq/blob/master/examples/translation/README.md
Bagherinezhad, H., Horton, M., Rastegari, M., Farhadi, A.: Label refinery: Improving imagenet classification through label progression. ar**v preprint ar**v:1805.02641 (2018)
Bai, S., Kolter, J.Z., Koltun, V.: Deep equilibrium models. In: Proceedings of the International Conference on Neural Information Processing Systems (2019)
Bai, S., Kolter, J.Z., Koltun, V.: Trellis networks for sequence modeling. In: ICLR (2019)
Bai, S., Koltun, V., Kolter, J.Z.: Multiscale deep equilibrium models. In: Proceedings of the International Conference on Neural Information Processing Systems (2020)
Brown, T.B., et al.: Language models are few-shot learners. ar**v preprint ar**v:2005.14165 (2020)
Chen, M., Peng, H., Fu, J., Ling, H.: Autoformer: searching transformers for visual recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Chowdhury, J.R., Caragea, C.: Modeling hierarchical structures with continuous recursive neural networks. In: Proceedings of the 38th International Conference on Machine Learning, pp. 1975–1988 (2021)
Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., Kaiser, L.: Universal transformers. In: International Conference on Learning Representations (2018)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. IEEE (2009)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186 (2019)
Dong, L., Xu, S., Xu, B.: Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888 (2018)
Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Guo, Q., Yu, Z., Wu, Y., Liang, D., Qin, H., Yan, J.: Dynamic recursive neural network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5147–5156 (2019)
Han, K., **s in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38
Hendricks, L.A., Mellor, J., Schneider, R., Alayrac, J.B., Nematzadeh, A.: Decoupling the role of data, attention, and losses in multimodal transformers. ar**v preprint ar**v:2102.00529 (2021)
Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). ar**v preprint ar**v:1606.08415 (2016)
Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S.J.: Rethinking spatial dimensions of vision transformers. ar**v preprint ar**v:2103.16302 (2021)
Kim, J., Lee, J.K., Lee, K.M.: Deeply-recursive convolutional network for image super-resolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1637–1645 (2016)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. ar**v preprint ar**v:1412.6980 (2014)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: a lite bert for self-supervised learning of language representations. ar**v preprint ar**v:1909.11942 (2019)
Li, H., Xu, Z., Taylor, G., Studer, C., Goldstein, T.: Visualizing the loss landscape of neural nets. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 6391–6401 (2018)
Li, Y., Zhang, K., Cao, J., Timofte, R., Van Gool, L.: Localvit: bringing locality to vision transformers. ar**v preprint ar**v:2104.05707 (2021)
Liang, M., Hu, X.: Recurrent convolutional neural network for object recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3367–3375 (2015)
Liu, F., Gao, M., Liu, Y., Lei, K.: Self-adaptive scaling for learnable residual structure. In: Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL) (2019)
Liu, S., Yang, N., Li, M., Zhou, M.: A recursive recurrent neural network for statistical machine translation. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1491–1500 (2014)
Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. ar**v preprint ar**v:1907.11692 (2019)
Liu, Y., et al.: Cbnet: a novel composite backbone network architecture for object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11653–11660 (2020)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. ar**v preprint ar**v:2103.14030 (2021)
Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1412–1421 (2015)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (2002)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: Hints for thin deep nets. ar**v preprint ar**v:1412.6550 (2014)
Shen, Z., He, Z., Xue, X.: Meal: multi-model ensemble via adversarial learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 4886–4893 (2019)
Shen, Z., Liu, Z., Xu, D., Chen, Z., Cheng, K.T., Savvides, M.: Is label smoothing truly incompatible with knowledge distillation: an empirical study. In: International Conference on Learning Representations (2021)
Shen, Z., Liu, Z., Li, J., Jiang, Y.G., Chen, Y., Xue, X.: Dsod: learning deeply supervised object detectors from scratch. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1919–1927 (2017)
Sperduti, A., Starita, A.: Supervised neural networks for the classification of structures. IEEE Trans. Neural Netw. 8(3), 714–735 (1997)
Srinivas, A., Lin, T.Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A.: Bottleneck transformers for visual recognition. ar**v preprint ar**v:2101.11605 (2021)
Tolstikhin, I., et al.: Mlp-mixer: an all-mlp architecture for vision. ar**v preprint ar**v:2105.01601 (2021)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. ar**v preprint ar**v:2012.12877 (2020)
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. ar**v preprint ar**v:2103.17239 (2021)
Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. ar**v preprint ar**v:2102.12122 (2021)
Wang, Y., et al.: Transformer in action: a comparative study of transformer-based acoustic models for large scale speech recognition applications. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6778–6782. IEEE (2021)
Wu, H., et al.: Cvt: introducing convolutions to vision transformers. ar**v preprint ar**v:2103.15808 (2021)
Xu, W., Xu, Y., Chang, T., Tu, Z.: Co-scale conv-attentional image transformers. ar**v preprint ar**v:2104.06399 (2021)
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: generalized autoregressive pretraining for language understanding. In: Advances in Neural Information Processing Systems (2019)
Yuan, L., et al.: Tokens-to-token vit: training vision transformers from scratch on imagenet. ar**v preprint ar**v:2101.11986 (2021)
Zhou, D., et al.: Deepvit: towards deeper vision transformer. ar**v preprint ar**v:2103.11886 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Shen, Z., Liu, Z., **ng, E. (2022). Sliced Recursive Transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13684. Springer, Cham. https://doi.org/10.1007/978-3-031-20053-3_42
Download citation
DOI: https://doi.org/10.1007/978-3-031-20053-3_42
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20052-6
Online ISBN: 978-3-031-20053-3
eBook Packages: Computer ScienceComputer Science (R0)