EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications

Maaz, Muhammad; Shaker, Abdelrahman; Cholakkal, Hisham; Khan, Salman; Zamir, Syed Waqas; Anwer, Rao Muhammad; Shahbaz Khan, Fahad

doi:10.1007/978-3-031-25082-8_1

Muhammad Maaz¹⁰,
Abdelrahman Shaker¹⁰,
Hisham Cholakkal¹⁰,
Salman Khan^10,11,
Syed Waqas Zamir¹²,
Rao Muhammad Anwer^10,13 &
…
Fahad Shahbaz Khan^10,14

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13807))

Included in the following conference series:

European Conference on Computer Vision

2824 Accesses
68 Citations

Abstract

In the pursuit of achieving ever-increasing accuracy, large and complex neural networks are usually developed. Such models demand high computational resources and therefore cannot be deployed on edge devices. It is of great interest to build resource-efficient general purpose networks due to their usefulness in several application areas. In this work, we strive to effectively combine the strengths of both CNN and Transformer models and propose a new efficient hybrid architecture EdgeNeXt. Specifically in EdgeNeXt, we introduce split depth-wise transpose attention (STDA) encoder that splits input tensors into multiple channel groups and utilizes depth-wise convolution along with self-attention across channel dimensions to implicitly increase the receptive field and encode multi-scale features. Our extensive experiments on classification, detection and segmentation tasks, reveal the merits of the proposed approach, outperforming state-of-the-art methods with comparatively lower compute requirements. Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K, outperforming MobileViT with an absolute gain of 2.2% with 28% reduction in FLOPs. Further, our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K. The code and models are available at https://t.ly/_Vu9.

M. Maaz and A. Shaker—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Chapter: EUR 29.95; Price includes VAT (Germany)

eBook: EUR 93.08; Price includes VAT (Germany)

Softcover Book: EUR 117.69; Price includes VAT (Germany)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

CSFNet: a compact and efficient convolution-transformer hybrid vision model

Article 12 February 2024

SkaNet: Split Kernel Attention Network

A Simple and Light-Weight Attention Module for Convolutional Neural Networks

Article 28 January 2020

Notes

References

Ali, A., et al.: XCiT: cross-covariance image transformers. In: Advances in Neural Information Processing Systems (2021)
Google Scholar
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. ar**v preprint ar**v:1607.06450 (2016)
Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. ar**v preprint ar**v:1706.05587 (2017)
Chen, Y., et al.: Mobile-former: bridging mobilenet and transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)
Google Scholar
Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2017)
Google Scholar
Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.: RandAugment: practical automated data augmentation with a reduced search space. In: Advances in Neural Information Processing Systems (2020)
Google Scholar
Dai, Z., Liu, H., Le, Q., Tan, M.: CoAtNet: marrying convolution and attention for all data sizes. In: Advances in Neural Information Processing Systems (2021)
Google Scholar
Dosovitskiy, A., et al.: An image is worth \(16 \times 16\) words: transformers for image recognition at scale. ar**v preprint ar**v:2010.11929 (2020)
d’Ascoli, S., Touvron, H., Leavitt, M.L., Morcos, A.S., Biroli, G., Sagun, L.: ConVit: improving vision transformers with soft convolutional inductive biases. In: International Conference on Machine Learning (2021)
Google Scholar
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)
Article Google Scholar
Fan, H., et al.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)
Google Scholar
Gao, S.H., Cheng, M.M., Zhao, K., Zhang, X.Y., Yang, M.H., Torr, P.: Res2Net: a new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 43(2), 652–662 (2019)
Article Google Scholar
Hariharan, B., Arbeláez, P., Bourdev, L., Maji, S., Malik, J.: Semantic contours from inverse detectors. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2011)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUs). ar**v preprint ar**v:1606.08415 (2016)
Howard, A., et al.: Searching for MobileNetV3. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2019)
Google Scholar
Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861 (2017)
Google Scholar
Huang, G., Sun, Yu., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 646–661. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_39
Chapter Google Scholar
Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: SqueezeNet: AlexNet-level accuracy with 50\(\times \) fewer parameters and \(<\)0.5 mb model size. ar**v preprint ar**v:1602.07360 (2016)
Khan, S., Naseer, M., Hayat, M., Zamir, S.W., Khan, F.S., Shah, M.: Transformers in vision: a survey. ACM Comput. Surv. (CSUR) 54, 1–41 (2021)
Article Google Scholar
Li, Y., et al.: Improved multiscale vision transformers for classification and detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Chapter Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)
Google Scholar
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., **e, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)
Google Scholar
Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. In: International Conference on Learning Representations (2017)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019)
Google Scholar
Ma, N., Zhang, X., Zheng, H.-T., Sun, J.: ShuffleNet V2: practical guidelines for efficient CNN architecture design. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 122–138. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_8
Chapter Google Scholar
Mehta, S., Rastegari, M.: MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer. In: International Conference on Learning Representations (2022)
Google Scholar
Mehta, S., Rastegari, M.: Separable self-attention for mobile vision transformers. ar**v preprint ar**v:2206.02680 (2022)
Mehta, S., Rastegari, M., Shapiro, L., Hajishirzi, H.: ESPNetv2: a light-weight, power efficient, and general purpose convolutional neural network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019)
Google Scholar
Polyak, B.T., Juditsky, A.B.: Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30, 838–855 (1992)
Article MathSciNet MATH Google Scholar
Ridnik, T., Ben-Baruch, E., Noy, A., Zelnik-Manor, L.: ImageNet-21k pretraining for the masses. ar**v preprint ar**v:2104.10972 (2021)
Ridnik, T., Lawen, H., Ben-Baruch, E., Noy, A.: Solving imagenet: a unified scheme for training any backbone to top results. ar**v preprint ar**v:2204.03475 (2022)
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Article MathSciNet Google Scholar
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetv 2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018)
Google Scholar
Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015)
Article Google Scholar
Srinivas, A., Lin, T.Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A.: Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)
Google Scholar
Tan, M., et al.: MnasNet: platform-aware neural architecture search for mobile. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017)
Google Scholar
**ao, T., Dollar, P., Singh, M., Mintun, E., Darrell, T., Girshick, R.: Early convolutions help transformers see better. In: Advances in Neural Information Processing Systems (2021)
Google Scholar
Yu, W., et al.: MetaFormer is actually what you need for vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)
Google Scholar
Yuan, L., et al.: Tokens-to-token ViT: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)
Google Scholar
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53
Chapter Google Scholar
Zhang, H., Hu, W., Wang, X.: EdgeFormer: improving light-weight convnets by learning from vision transformers. ar**v preprint ar**v:2203.03952 (2022)
Zhang, X., Zhou, X., Lin, M., Sun, J.: ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018)
Google Scholar
Zhou, J., Wang, P., Wang, F., Liu, Q., Li, H., **, R.: ELSA: enhanced local self-attention for vision transformer. ar** University, Linkö**, Sweden
Fahad Shahbaz Khan

Authors

Muhammad Maaz
View author publications
You can also search for this author in PubMed Google Scholar
Abdelrahman Shaker
View author publications
You can also search for this author in PubMed Google Scholar
Hisham Cholakkal
View author publications
You can also search for this author in PubMed Google Scholar
Salman Khan
View author publications
You can also search for this author in PubMed Google Scholar
Syed Waqas Zamir
View author publications
You can also search for this author in PubMed Google Scholar
Rao Muhammad Anwer
View author publications
You can also search for this author in PubMed Google Scholar
Fahad Shahbaz Khan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Muhammad Maaz .

Editor information

Editors and Affiliations

IBM Research - MIT-IBM Watson AI Lab, Massachusetts, USA
Leonid Karlinsky
Technion – Israel Institute of Technology, Haifa, Israel
Tomer Michaeli
Kyoto University, Kyoto, Japan
Ko Nishino

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Maaz, M. et al. (2023). EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds) Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol 13807. Springer, Cham. https://doi.org/10.1007/978-3-031-25082-8_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-25082-8_1
Published: 12 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25081-1
Online ISBN: 978-3-031-25082-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications

Abstract

Access this chapter

Similar content being viewed by others

CSFNet: a compact and efficient convolution-transformer hybrid vision model

SkaNet: Split Kernel Attention Network

A Simple and Light-Weight Attention Module for Convolutional Neural Networks

Notes

References

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications

Abstract

Access this chapter

Similar content being viewed by others

CSFNet: a compact and efficient convolution-transformer hybrid vision model

SkaNet: Split Kernel Attention Network

A Simple and Light-Weight Attention Module for Convolutional Neural Networks

Notes

References

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation