Log in

DeoT: an end-to-end encoder-only Transformer object detector

  • Original Research Paper
  • Published:
Journal of Real-Time Image Processing Aims and scope Submit manuscript

Abstract

At present, with the rapid development of Transformer in object detection tasks, the object detection performance has been significantly improved. However, Transformer-based object detectors generally suffer from high complexity and slow learning convergence, and there is still a certain gap in performance compared to some convolutional neural network (CNN)-based object detectors. Therefore, to improve the existing problems of Transformer in object detection framework and make its detector performance reach the state-of-the-art level, this paper proposes an end-to-end encoder-only Transformer object detector, called DeoT. First, we design a feature pyramid fusion module (FPFM) to generate fusion features with rich semantic information. The proposal of the FPFM not only improves the detection accuracy of objects, but also solves the detection problem of objects of different sizes. Second, we propose an encoder-only Transformer module (E-OTM) to achieve a global representation of features by exploiting deformable multi-head self-attention (DMHSA). Furthermore, we design a Transformer block residual structure (TBRS) in the E-OTM, which refines the output features of the transformer module by using the channel attention and spatial attention in the channel refinement module (CRM) and spatial refinement module (SRM). The proposal of encoder-only Transformer module not only effectively alleviates the complexity and learning convergence problems of the model, but also improves the detection accuracy. We conduct sufficient experiments on the MS COCO object detection dataset and Cityscapes object detection dataset, and achieve 50.9 AP with 34 Epochs on the COCO 2017 tes-dev set, 30.1 AP with 38 FPS on the Cityscapes dataset. Therefore, DeoT not only achieves high efficiency in the training phase, but also ensures real time and accuracy in the detection process.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

The data that support the findings of this study are available from the corresponding author on reasonable request.

References

  1. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer visionand pattern recognition, pp. 580–587 (2014). https://doi.org/10.1109/CVPR.2014.81

  2. Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. ar**v:1504.08083 (2015)

  3. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards realtime object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017). https://doi.org/10.1109/TPAMI.2016.2577031

    Article  Google Scholar 

  4. Dai, J., Li, Y., He, K., Sun, J.: R-fcn: object detection via region-based fully convolutional networks. In: Advances in neural information processing systems, pp. 379–387. ar**v:1605.06409 (2016)

  5. Zhu, Y., Zhao, C., Wang, J., Zhao, X., Wu, Y., and Lu, H.: Couplenet: coupling global structure with local parts for object detection. In: IEEE international conference on computer vision, pp. 4126–4134. ar**v:1708.02863 (2017)

  6. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computervision, pp. 2961–2969 (2017). https://doi.org/10.1109/TPAMI.2018.2844175

  7. Cai, Z., Vasconcelos, N.: Cascade r-cnn: delving into high quality object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6154–6162(2018)

  8. Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W., Lin, D.: Libra r-cnn: towards balanced learning for object detection. In: Proceedings of the IEEE conference on computer vision and pat-tern recognition, pp. 821–830 (2019). https://doi.org/10.1109/CVPR.2019.00091

  9. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: single shot multibox detector. In: European conference on computer vision, pp. 21–37. Springer (2016). https://doi.org/10.1007/978-3-319-46448-0_2

  10. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unifed, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern reco-gnition, pp. 779–788 (2016). https://doi.org/10.1109/CVPR.2016.91

  11. Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271 (2017). https://doi.org/10.1109/CVPR.2017.690

  12. Redmon, J., Farhadi, A.: Yolov3: An incremental improveme-nt. ar**v preprint. ar**v:1804.02767 (2018)

  13. Bochkovskiy, A., Wang, C. Y., Liao, H. Y. M.: Yolov4: optimal speed and accuracy of object detection. ar**v preprint. ar**v: 2004.10934 (2020)

  14. Lin, T. Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp. 2980–2988 (2017)

  15. Tan, M., Pang, R., Le, Q. V.: Efficientdet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10778–10787 (2020). https://doi.org/10.1109/CVPR42600.2020.01079

  16. Zhang, S., Wen, L., Bian, X., Lei, Z., Li, S. Z.: Single-shot refinement neural network for object detection. In: IEEE conference on computer vision and pattern recognition, pp. 4203–4212. ar**v:1711.06897 (2018)

  17. Tian, Z., Shen, C., Chen, H., He, T.: Fcos: fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF international conference on computer vision and pattern recognition, pp. 9626–9635 (2020). https://doi.org/10.1109/ICCV.2019.00972

  18. Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271. ar**v:1904.07850 (2019)

  19. Liu, W., Liao, S., Ren, W., Hu, W., Yu, Y.: High-level semantic feature detection: a new perspective for pedestrian detection. In: IEEE conference on computer vision and pattern recognition, pp. 5187–5196 (2019)

  20. Law, H., Deng, J.: CornerNet: detecting objects as paired keypoints. Int. J. Comput. Vis. 128, 642–656 (2020)

    Article  Google Scholar 

  21. Zhou, X., Zhuo, J., and Krahenbuhl, P.: Bottom-up object detection by grou** extreme and center points. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 850–859. ar**v:1901.08043 (2019)

  22. Yang, Z., Liu, S., Hu, H., Wang, L., Lin, S.: Reppoints: Point set representation for object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9657–9666 (2019). https://doi.org/10.1109/ICCV.2019.00975

  23. Kong, T., Sun, F., Liu, H., Jiang, Y., Shi, J.: Foveabox: Beyond anchor-based object detector. In: IEEE Transactions on Image Processing. 29, 7389–7398 (2020). https://doi,org/https://doi.org/10.1109/TIP.2020.3002345

  24. Wang, X., Chen, K., Huang, Z., Yao, C., Liu, W.: Point linking network for object detection. ar**v:1706.03646 (2017)

  25. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems, pp. 5998–6008. ar**v:1706.03762 (2017)

  26. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European conference on computer vision. Springer, pp. 213–229 (2020). https://doi.org/10.1007/978-3-030-58452-8_13

  27. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: deformable transformers for end-to-end object detection. ar**v:2010.04159 (2020)

  28. Sun, Z., Cao, S., Yang, Y., and Kitani, K.: Rethinking transformer-based set prediction for object detection. ar**v:2011.10881 (2020)

  29. Wang, Y., Zhang, X., Yang, T., and Sun, J.: Anchor detr: query design for transformer-based object detection. ar**v:2109.07107 (2021)

  30. Yao, Z., Ai, J., Li, B., and Zhang, C.: Efficient detr: improving end-to-end object detector with dense prior. ar**v:2104.01318 (2021)

  31. Gao, P., Zheng, M., Wang, X., Dai, J., Li, H.: Fast convergence of detr with spatially modulated coattention. ar**v:2101.07448 (2021)

  32. Simonyan, K., Zisserman, A.: Very deep connolutional networks for large-scale image recognition. ar**v:1409.1556 (2014)

  33. Ma, J., Wan, H., Wang, J., **a, H., Bai, C.: An improved one-stage pedestrian detection method based on multi-scale attention feature extraction. J. Real Time Image Process. 18, 1965–1978 (2021). https://doi.org/10.1007/s11554-021-01074-2

    Article  Google Scholar 

  34. Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection.In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)

  35. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 39(4), 3431–3440 (2015). https://doi.org/10.1109/CVPR.2015.7298965

  36. Wei, Y., Zhu, Z., Yu, H., et al.: Aftdnet: real-time anchor-free detection network of threat objects for X-ray baggage screening. J. Real Time Image Process. 18, 1343–1356 (2021). https://doi.org/10.1007/s11554-021-01136-5

    Article  Google Scholar 

  37. Fang, Y., Liao, B., Wang, X., Fang, J., Qi, J., Wu, R., Niu, J., Liu, W.: You only look at one sequence: rethinking transformer in vision through object detection. ar**v:2106.00666 (2021)

  38. Song, H., Sun, D., Chun, S., Jampani, V., Han, D., Heo, B., Kim, W., et al.: Vidt: an efficient and effevtive fully transformer-based object detector. ar**v:2110.03921 (2021)

  39. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale. ar**v:2010.11929 (2020)

  40. Beal, J., Kin, E., Tzeng, E., Dong, H. P., and Kislyuk, D.:Toward transformer-based object detection. ar**v:2012.09958(2020)

  41. Dai, J., Qi, H., **ong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp. 764–773 (2017). https://doi.org/10.1109/ICCV.2017.89

  42. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learningfor image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. ar**v: 1512.03385 (2016)

  43. Ba, J. L., Kiros, J. R., Hinton, G. E.: Layer normalization. ar**v.1607.06450 (2016)

  44. Zhang, J., Feng, W., Yuan, T., Sangaiah, A.K., Wang, J.: Scstcf: spatial-channel selection and temporal regularized correlation filters for visual tracking. Appl. Soft Comput. 118, 108485 (2022). https://doi.org/10.1016/j.asoc.2022.108485

    Article  Google Scholar 

  45. Zhang, J., Sun, J., Wang, J., Li, Z., Chen, X.: An object tracking framework with recapture based on correlation filters and siamese networks. Comput. Electr. Eng. 98, 107730 (2022). https://doi.org/10.1016/j.compeleceng.2022.107730

    Article  Google Scholar 

  46. Zhang, J., Liu, Y., Liu, H., et al.: Distractor-aware visual tracking using hierarchical correlation filters adaptive selection. Appl Intell. 52, 6129–6147 (2022). https://doi.org/10.1007/s10489-021-02694-8

    Article  Google Scholar 

  47. Zhang, J., Zheng, Z., **e, X., Gui, Y., Kim, G.J.: Reyolo: a traffic sign detector based on network reparameterization and features adaptive weighting. J. Ambient. Intelligence. Smart. Env. 14(4), 317–334 (2022). https://doi.org/10.3233/AIS-220038

    Article  Google Scholar 

  48. Zhang, J., Zou, X., Kuang, L.D., Wang, J., et al.: Cctsdb 2021: a more comprehensive traffic sign detection benchmark. Human-Centric. Computing. Inform. Sci. 12, 23 (2022). https://doi.org/10.22967/HCIS.2022.12.023

    Article  Google Scholar 

  49. Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union:a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 658–666 (2019). https://doi.org/10.1109/CVPR.2019.00075

  50. Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C. L.: Microsoft coco: common objects in context. In: European conference on co-mputer vision. Springer, pp. 740–755 (2014). https://doi.org/10.1007/978-3-319-10602-1_48

  51. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016). https://doi.org/10.1109/CVPR.2016.350

  52. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. ar** Li

Authors

Contributions

Formal analysis, TD and KF. Methodology, TD and KF. Supervision, YH and TL. Writing—original draft, TD and KF. Writing—review and editing, TD and YW.

Corresponding author

Correspondence to Tian** Li.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ding, T., Feng, K., Wei, Y. et al. DeoT: an end-to-end encoder-only Transformer object detector. J Real-Time Image Proc 20, 1 (2023). https://doi.org/10.1007/s11554-023-01280-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11554-023-01280-0

Keywords

Navigation