Log in

End-to-end acceleration of the YOLO object detection framework on FPGA-only devices

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Object detection has been revolutionized by convolutional neural networks (CNNs), but their high computational complexity and heavy data access requirements make implementing these algorithms on edge devices challenging. To address this issue, we propose an efficient object detection accelerator for YOLO series algorithm. Our architecture utilizes multiple dimensions of parallelism to accelerate the convolution computation. We employ line-buffer-based parallel data caches and dedicated data access units to minimize off-chip bandwidth pressure. Additionally, our proposed design not only accelerates the convolutional computation, but also control-intensive post-processing to achieve low detection latency. We evaluate the final design on **linx V7-690t FPGA device, achieving a throughput of 525 GOP/s for a batch size of 1 and 914 GOP/s for a batch size equal to 2. Compared with state-of-the-art YOLOv2 and YOLOv3 implementations, our proposed accelerator offers up to 9\(\times\) throughput improvement and 5\(\times\) shorter latency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

Availability of data and materials

All data generated or analyzed during this study are included in this published article.

References

  1. Carranza-García M, Lara-Benítez P, García-Gutiérrez J, Riquelme JC (2021) Enhancing object detection for autonomous driving by optimizing anchor generation and addressing class imbalance. Neurocomputing 449:229–244. https://doi.org/10.1016/j.neucom.2021.04.001

    Article  Google Scholar 

  2. Nguyen EH, Yang H, Deng R, Lu Y, Zhu Z, Roland JT, Lu L, Landman BA, Fogo AB, Huo Y (2022) Circle representation for medical object detection. IEEE Trans Med Imaging 41(3):746–754. https://doi.org/10.1109/TMI.2021.3122835

    Article  Google Scholar 

  3. Angelo TD, Mendes M, Keller B, Ferreira R, Delabrida S, Rabelo R, Azpurua H, Bianchi A (2019) Deep learning-based object detection for digital inspection in the mining industry. In: 2019 18th ieee international conference on machine learning and applications (ICMLA), pp 633–640. https://doi.org/10.1109/ICMLA.2019.00116

  4. Zhang J, Cheng L, Li C, Li Y, He G, Xu N, Lian Y (2021) A low-latency FPGA implementation for real-time object detection. In: 2021 IEEE international symposium on circuits and systems (ISCAS), pp 1–5

  5. Nguyen DT, Nguyen TN, Kim H, Lee H-J (2019) A high-throughput and power-efficient FPGA implementation of YOLO CNN for object detection. IEEE Trans Very Large Scale Integr (VLSI) Syst 27(8):1861–1873. https://doi.org/10.1109/TVLSI.2019.2905242

    Article  Google Scholar 

  6. Ahmad A, Pasha MA, Raza GJ (2020) Accelerating tiny YOLOv3 using FPGA-based hardware/software co-design. In: 2020 IEEE international symposium on circuits and systems (ISCAS), pp 1–5. https://doi.org/10.1109/ISCAS45731.2020.9180843

  7. Liang Y, Lu L, **ao Q, Yan S (2020) Evaluating fast algorithms for convolutional neural networks on FPGAs. IEEE Trans Comput Aided Des Integr Circuits Syst 39(4):857–870

    Article  Google Scholar 

  8. Capotondi A, Rusci M, Fariselli M, Benini L (2020) CMix-NN: mixed low-precision CNN library for memory-constrained edge devices. IEEE Trans Circuits Syst II Express Briefs 67(5):871–875. https://doi.org/10.1109/TCSII.2020.2983648

    Article  Google Scholar 

  9. Zhang Z, Mahmud MAP, Kouzani AZ (2022) Resource-constrained FPGA implementation of YOLOv2. Neural Comput Appl 34(19):16989–17006. https://doi.org/10.1007/s00521-022-07351-w

    Article  Google Scholar 

  10. Anupreetham A, Ibrahim M, Hall M, Boutros A, Kuzhively A, Mohanty A, Nurvitadhi E, Betz V, Cao Y, Seo J-s (2021) End-to-end FPGA-based object detection using pipelined CNN and non-maximum suppression. In: 2021 31st international conference on field-programmable logic and applications (FPL), pp 76–82. https://doi.org/10.1109/FPL53798.2021.00021. ISSN: 1946-1488

  11. Shi M, Ouyang P, Yin S, Liu L, Wei S (2019) A fast and power-efficient hardware architecture for non-maximum suppression. IEEE Trans Circuits Syst II Express Briefs 66(11):1870–1874. https://doi.org/10.1109/TCSII.2019.2893527

    Article  Google Scholar 

  12. Redmon J, Farhadi A (2017) YOLO9000: better, faster, stronger. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 6517–6525. https://doi.org/10.1109/CVPR.2017.690

  13. Li Y, Gong R, Tan X, Yang Y, Hu P, Zhang Q, Yu F, Wang W, Gu S (2021) BRECQ: pushing the limit of post-training quantization by block reconstruction. ar**v. ar**v:2102.05426 [cs]. https://doi.org/10.48550/ar**v.2102.05426. Accessed 18 Apr 2023

  14. Nagel M, Amjad RA, Baalen MV, Louizos C, Blankevoort T (2020) Up or down? adaptive rounding for post-training quantization. In: Hal I, Aarti S (eds.) Proceedings of the 37th international conference on machine learning, vol 119. PMLR, pp 7197–7206. https://proceedings.mlr.press/v119/nagel20a.html

  15. Gysel P, Pimentel J, Motamedi M, Ghiasi S (2018) Ristretto: a framework for empirical study of resource-efficient inference in convolutional neural networks. IEEE Trans Neural Netw Learn Syst 29(11):5784–5789. https://doi.org/10.1109/TNNLS.2018.2808319

    Article  Google Scholar 

  16. Wang D, Xu K, Jiang D (2017) PipeCNN: an opencl-based open-source FPGA accelerator for convolution neural networks. In: 2017 international conference on field programmable technology (ICFPT), pp 279–282

  17. Véstias M, Duarte RP, Sousa JTd, Neto H (2017) Parallel dot-products for deep learning on FPGA. In: 2017 27th international conference on field programmable logic and applications (FPL), pp 1–4. https://doi.org/10.23919/FPL.2017.8056863

  18. Fu Y, Wu E, Sirasao A, Attia S, Khan K, Wittig R (2016) Deep learning with int8 optimization on xilinx devices

  19. **linx: UltraScale architecture and product data sheet: overview (2020). https://www.xilinx.com/support/documentation/data_sheets/ds890-ultrascale-overview.pdf

  20. Guo L, Lau J, Chi Y, Wang J, Yu CH, Chen Z, Zhang Z, Cong J (2020) Analysis and optimization of the implicit broadcasts in FPGA HLS to improve maximum frequency. In: 2020 57th ACM/IEEE design automation conference (DAC), pp 1–6. https://doi.org/10.1109/DAC18072.2020.9218718

  21. Wang D, Xu K, Guo J, Ghiasi S (2020) DSP-efficient hardware acceleration of convolutional neural network inference on FPGAs. IEEE Trans Comput Aided Des Integr Circuits Syst 39(12):4867–4880

    Article  Google Scholar 

  22. Obeidat F, Klenke R (2011) Introducing MicroBlaze as an infrastructure for performance modeling. In: 2011 IEEE international conference on microelectronic systems education, pp 90–93. https://doi.org/10.1109/MSE.2011.5937101

  23. Xu M, Yao H, Huan X (2012) Performance test of dual-core processor system based on NIOS II. In: 2012 IEEE symposium on electrical & electronics engineering (EEESYM), pp 82–85. https://doi.org/10.1109/EEESym.2012.6258593

  24. Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65–76. https://doi.org/10.1145/1498765.1498785

    Article  Google Scholar 

  25. Zhang C, Li P, Sun G, Guan Y, **ao B, Cong J (2015) Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: Proceedings of the 2015 ACM/SIGDA international symposium on field-programmable gate arrays, pp 161–170. Association for Computing Machinery, Monterey California USA. https://doi.org/10.1145/2684746.2689060

  26. Chen K, Wang J, Pang J, Cao Y, **ong Y, Li X, Sun S, Feng W, Liu Z, Xu J, Zhang Z, Cheng D, Zhu C, Cheng T, Zhao Q, Li B, Lu X, Zhu R, Wu Y, Dai J, Wang J, Shi J, Ouyang W, Loy CC, Lin D (2019) MMDetection: Open mmlab detection toolbox and benchmark. ar**v preprint ar**v:1906.07155

  27. Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2010) The PASCAL visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338. https://doi.org/10.1007/s11263-009-0275-4

    Article  Google Scholar 

  28. Li S, Wang Q, Jiang J, Sheng W, **g N, Mao Z (2022) An efficient CNN accelerator using inter-frame data reuse of videos on FPGAs. IEEE Trans Very Large Scale Integr (VLSI) Syst 30(11):1587–1600. https://doi.org/10.1109/TVLSI.2022.3151788

    Article  Google Scholar 

  29. Intel neural compute stick 2. https://www.intel.com/content/www/cn/zh/developer/articles/tool/neural-compute-stick.html Accessed 10 May 2023

  30. Jetson nano developer kit for AI and robotics | NVIDIA. https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-nano/ Accessed 10 May 2023

  31. Herrmann V, Knapheide J, Steinert F, Stabernack B (2022) A YOLO v3-tiny FPGA architecture using a reconfigurable hardware accelerator for real-time region of interest detection. In: 2022 25th Euromicro conference on digital system design (DSD), pp 84–92. https://doi.org/10.1109/DSD57027.2022.00021. ISSN: 2771-2508

  32. Zhang H, Wu W, Ma Y, Wang Z (2020) Efficient hardware post processing of anchor-based object detection on FPGA. In: 2020 IEEE computer society annual symposium on VLSI (ISVLSI). IEEE, Limassol, Cyprus, pp 580–585. https://doi.org/10.1109/ISVLSI49217.2020.00089. https://ieeexplore.ieee.org/document/9155076/ Accessed 15 Nov 2022

  33. Adiono T, Putra A, Sutisna N, Syafalni I, Mulyawan R (2021) Low latency YOLOv3-Tiny accelerator for low-cost FPGA using general matrix multiplication principle. IEEE Access 9:141890–141913. https://doi.org/10.1109/ACCESS.2021.3120629

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by Bei**g Natural Science Foundation under Grant No. 4202063, National Key Research and Development Program of China under Grant No. 2019YFB2204200.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dong Wang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, D., Wang, A., Mo, R. et al. End-to-end acceleration of the YOLO object detection framework on FPGA-only devices. Neural Comput & Applic 36, 1067–1089 (2024). https://doi.org/10.1007/s00521-023-09078-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-023-09078-8

Keywords

Navigation