DeoT: an end-to-end encoder-only Transformer object detector

Ding, Tonghe; Feng, Kaili; Wei, Yanjun; Han, Yu; Li, Tian**

doi:10.1007/s11554-023-01280-0

DeoT: an end-to-end encoder-only Transformer object detector

Original Research Paper
Published: 25 January 2023

Volume 20, article number 1, (2023)
Cite this article

Journal of Real-Time Image Processing Aims and scope Submit manuscript

Tonghe Ding¹,
Kaili Feng¹,
Yanjun Wei¹,
Yu Han¹ &
…
Tian** Li ORCID: orcid.org/0000-0002-9152-0209¹

504 Accesses
Explore all metrics

Abstract

At present, with the rapid development of Transformer in object detection tasks, the object detection performance has been significantly improved. However, Transformer-based object detectors generally suffer from high complexity and slow learning convergence, and there is still a certain gap in performance compared to some convolutional neural network (CNN)-based object detectors. Therefore, to improve the existing problems of Transformer in object detection framework and make its detector performance reach the state-of-the-art level, this paper proposes an end-to-end encoder-only Transformer object detector, called DeoT. First, we design a feature pyramid fusion module (FPFM) to generate fusion features with rich semantic information. The proposal of the FPFM not only improves the detection accuracy of objects, but also solves the detection problem of objects of different sizes. Second, we propose an encoder-only Transformer module (E-OTM) to achieve a global representation of features by exploiting deformable multi-head self-attention (DMHSA). Furthermore, we design a Transformer block residual structure (TBRS) in the E-OTM, which refines the output features of the transformer module by using the channel attention and spatial attention in the channel refinement module (CRM) and spatial refinement module (SRM). The proposal of encoder-only Transformer module not only effectively alleviates the complexity and learning convergence problems of the model, but also improves the detection accuracy. We conduct sufficient experiments on the MS COCO object detection dataset and Cityscapes object detection dataset, and achieve 50.9 AP with 34 Epochs on the COCO 2017 tes-dev set, 30.1 AP with 38 FPS on the Cityscapes dataset. Therefore, DeoT not only achieves high efficiency in the training phase, but also ensures real time and accuracy in the detection process.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Institutional subscriptions

Scale-Insensitive Object Detection via Attention Feature Pyramid Transformer Network

Article 19 October 2021

Multi-scale semantic enhancement network for object detection

Article Open access 03 May 2023

MSPNet: Multi-level Semantic Pyramid Network for Real-Time Object Detection

Data availability

The data that support the findings of this study are available from the corresponding author on reasonable request.

References

Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer visionand pattern recognition, pp. 580–587 (2014). https://doi.org/10.1109/CVPR.2014.81
Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. ar**v:1504.08083 (2015)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards realtime object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017). https://doi.org/10.1109/TPAMI.2016.2577031
Article Google Scholar
Dai, J., Li, Y., He, K., Sun, J.: R-fcn: object detection via region-based fully convolutional networks. In: Advances in neural information processing systems, pp. 379–387. ar**v:1605.06409 (2016)
Zhu, Y., Zhao, C., Wang, J., Zhao, X., Wu, Y., and Lu, H.: Couplenet: coupling global structure with local parts for object detection. In: IEEE international conference on computer vision, pp. 4126–4134. ar**v:1708.02863 (2017)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computervision, pp. 2961–2969 (2017). https://doi.org/10.1109/TPAMI.2018.2844175
Cai, Z., Vasconcelos, N.: Cascade r-cnn: delving into high quality object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6154–6162(2018)
Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W., Lin, D.: Libra r-cnn: towards balanced learning for object detection. In: Proceedings of the IEEE conference on computer vision and pat-tern recognition, pp. 821–830 (2019). https://doi.org/10.1109/CVPR.2019.00091
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: single shot multibox detector. In: European conference on computer vision, pp. 21–37. Springer (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unifed, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern reco-gnition, pp. 779–788 (2016). https://doi.org/10.1109/CVPR.2016.91
Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271 (2017). https://doi.org/10.1109/CVPR.2017.690
Redmon, J., Farhadi, A.: Yolov3: An incremental improveme-nt. ar**v preprint. ar**v:1804.02767 (2018)
Bochkovskiy, A., Wang, C. Y., Liao, H. Y. M.: Yolov4: optimal speed and accuracy of object detection. ar**v preprint. ar**v: 2004.10934 (2020)
Lin, T. Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp. 2980–2988 (2017)
Tan, M., Pang, R., Le, Q. V.: Efficientdet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10778–10787 (2020). https://doi.org/10.1109/CVPR42600.2020.01079
Zhang, S., Wen, L., Bian, X., Lei, Z., Li, S. Z.: Single-shot refinement neural network for object detection. In: IEEE conference on computer vision and pattern recognition, pp. 4203–4212. ar**v:1711.06897 (2018)
Tian, Z., Shen, C., Chen, H., He, T.: Fcos: fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF international conference on computer vision and pattern recognition, pp. 9626–9635 (2020). https://doi.org/10.1109/ICCV.2019.00972
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271. ar**v:1904.07850 (2019)
Liu, W., Liao, S., Ren, W., Hu, W., Yu, Y.: High-level semantic feature detection: a new perspective for pedestrian detection. In: IEEE conference on computer vision and pattern recognition, pp. 5187–5196 (2019)
Law, H., Deng, J.: CornerNet: detecting objects as paired keypoints. Int. J. Comput. Vis. 128, 642–656 (2020)
Article Google Scholar
Zhou, X., Zhuo, J., and Krahenbuhl, P.: Bottom-up object detection by grou** extreme and center points. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 850–859. ar**v:1901.08043 (2019)
Yang, Z., Liu, S., Hu, H., Wang, L., Lin, S.: Reppoints: Point set representation for object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9657–9666 (2019). https://doi.org/10.1109/ICCV.2019.00975
Kong, T., Sun, F., Liu, H., Jiang, Y., Shi, J.: Foveabox: Beyond anchor-based object detector. In: IEEE Transactions on Image Processing. 29, 7389–7398 (2020). https://doi,org/https://doi.org/10.1109/TIP.2020.3002345
Wang, X., Chen, K., Huang, Z., Yao, C., Liu, W.: Point linking network for object detection. ar**v:1706.03646 (2017)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems, pp. 5998–6008. ar**v:1706.03762 (2017)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European conference on computer vision. Springer, pp. 213–229 (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: deformable transformers for end-to-end object detection. ar**v:2010.04159 (2020)
Sun, Z., Cao, S., Yang, Y., and Kitani, K.: Rethinking transformer-based set prediction for object detection. ar**v:2011.10881 (2020)
Wang, Y., Zhang, X., Yang, T., and Sun, J.: Anchor detr: query design for transformer-based object detection. ar**v:2109.07107 (2021)
Yao, Z., Ai, J., Li, B., and Zhang, C.: Efficient detr: improving end-to-end object detector with dense prior. ar**v:2104.01318 (2021)
Gao, P., Zheng, M., Wang, X., Dai, J., Li, H.: Fast convergence of detr with spatially modulated coattention. ar**v:2101.07448 (2021)
Simonyan, K., Zisserman, A.: Very deep connolutional networks for large-scale image recognition. ar**v:1409.1556 (2014)
Ma, J., Wan, H., Wang, J., **a, H., Bai, C.: An improved one-stage pedestrian detection method based on multi-scale attention feature extraction. J. Real Time Image Process. 18, 1965–1978 (2021). https://doi.org/10.1007/s11554-021-01074-2
Article Google Scholar
Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection.In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 39(4), 3431–3440 (2015). https://doi.org/10.1109/CVPR.2015.7298965
Wei, Y., Zhu, Z., Yu, H., et al.: Aftdnet: real-time anchor-free detection network of threat objects for X-ray baggage screening. J. Real Time Image Process. 18, 1343–1356 (2021). https://doi.org/10.1007/s11554-021-01136-5
Article Google Scholar
Fang, Y., Liao, B., Wang, X., Fang, J., Qi, J., Wu, R., Niu, J., Liu, W.: You only look at one sequence: rethinking transformer in vision through object detection. ar**v:2106.00666 (2021)
Song, H., Sun, D., Chun, S., Jampani, V., Han, D., Heo, B., Kim, W., et al.: Vidt: an efficient and effevtive fully transformer-based object detector. ar**v:2110.03921 (2021)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale. ar**v:2010.11929 (2020)
Beal, J., Kin, E., Tzeng, E., Dong, H. P., and Kislyuk, D.:Toward transformer-based object detection. ar**v:2012.09958(2020)
Dai, J., Qi, H., **ong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp. 764–773 (2017). https://doi.org/10.1109/ICCV.2017.89
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learningfor image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. ar**v: 1512.03385 (2016)
Ba, J. L., Kiros, J. R., Hinton, G. E.: Layer normalization. ar**v.1607.06450 (2016)
Zhang, J., Feng, W., Yuan, T., Sangaiah, A.K., Wang, J.: Scstcf: spatial-channel selection and temporal regularized correlation filters for visual tracking. Appl. Soft Comput. 118, 108485 (2022). https://doi.org/10.1016/j.asoc.2022.108485
Article Google Scholar
Zhang, J., Sun, J., Wang, J., Li, Z., Chen, X.: An object tracking framework with recapture based on correlation filters and siamese networks. Comput. Electr. Eng. 98, 107730 (2022). https://doi.org/10.1016/j.compeleceng.2022.107730
Article Google Scholar
Zhang, J., Liu, Y., Liu, H., et al.: Distractor-aware visual tracking using hierarchical correlation filters adaptive selection. Appl Intell. 52, 6129–6147 (2022). https://doi.org/10.1007/s10489-021-02694-8
Article Google Scholar
Zhang, J., Zheng, Z., **e, X., Gui, Y., Kim, G.J.: Reyolo: a traffic sign detector based on network reparameterization and features adaptive weighting. J. Ambient. Intelligence. Smart. Env. 14(4), 317–334 (2022). https://doi.org/10.3233/AIS-220038
Article Google Scholar
Zhang, J., Zou, X., Kuang, L.D., Wang, J., et al.: Cctsdb 2021: a more comprehensive traffic sign detection benchmark. Human-Centric. Computing. Inform. Sci. 12, 23 (2022). https://doi.org/10.22967/HCIS.2022.12.023
Article Google Scholar
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union:a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 658–666 (2019). https://doi.org/10.1109/CVPR.2019.00075
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C. L.: Microsoft coco: common objects in context. In: European conference on co-mputer vision. Springer, pp. 740–755 (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016). https://doi.org/10.1109/CVPR.2016.350
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. ar** Li

Authors

Tonghe Ding
View author publications
You can also search for this author in PubMed Google Scholar
Kaili Feng
View author publications
You can also search for this author in PubMed Google Scholar
Yanjun Wei
View author publications
You can also search for this author in PubMed Google Scholar
Yu Han
View author publications
You can also search for this author in PubMed Google Scholar
Tian** Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Formal analysis, TD and KF. Methodology, TD and KF. Supervision, YH and TL. Writing—original draft, TD and KF. Writing—review and editing, TD and YW.

Corresponding author

Correspondence to Tian** Li.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ding, T., Feng, K., Wei, Y. et al. DeoT: an end-to-end encoder-only Transformer object detector. J Real-Time Image Proc 20, 1 (2023). https://doi.org/10.1007/s11554-023-01280-0

Download citation

Received: 13 July 2022
Accepted: 25 October 2022
Published: 25 January 2023
DOI: https://doi.org/10.1007/s11554-023-01280-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Institutional subscriptions

DeoT: an end-to-end encoder-only Transformer object detector

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Scale-Insensitive Object Detection via Attention Feature Pyramid Transformer Network

Multi-scale semantic enhancement network for object detection

MSPNet: Multi-level Semantic Pyramid Network for Real-Time Object Detection

Data availability

References

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

DeoT: an end-to-end encoder-only Transformer object detector

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Scale-Insensitive Object Detection via Attention Feature Pyramid Transformer Network

Multi-scale semantic enhancement network for object detection

MSPNet: Multi-level Semantic Pyramid Network for Real-Time Object Detection

Data availability

References

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation