End-to-end acceleration of the YOLO object detection framework on FPGA-only devices

Zhang, Dezheng; Wang, Aibin; Mo, Ruchan; Wang, Dong

doi:10.1007/s00521-023-09078-8

End-to-end acceleration of the YOLO object detection framework on FPGA-only devices

Original Article
Published: 13 November 2023

Volume 36, pages 1067–1089, (2024)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Dezheng Zhang^1,2,
Aibin Wang^1,2,
Ruchan Mo^1,2 &
…
Dong Wang ORCID: orcid.org/0000-0002-0068-8824^1,2

735 Accesses
Explore all metrics

Abstract

Object detection has been revolutionized by convolutional neural networks (CNNs), but their high computational complexity and heavy data access requirements make implementing these algorithms on edge devices challenging. To address this issue, we propose an efficient object detection accelerator for YOLO series algorithm. Our architecture utilizes multiple dimensions of parallelism to accelerate the convolution computation. We employ line-buffer-based parallel data caches and dedicated data access units to minimize off-chip bandwidth pressure. Additionally, our proposed design not only accelerates the convolutional computation, but also control-intensive post-processing to achieve low detection latency. We evaluate the final design on **linx V7-690t FPGA device, achieving a throughput of 525 GOP/s for a batch size of 1 and 914 GOP/s for a batch size equal to 2. Compared with state-of-the-art YOLOv2 and YOLOv3 implementations, our proposed accelerator offers up to 9\(\times\) throughput improvement and 5\(\times\) shorter latency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SSD: Single Shot MultiBox Detector

Object detection using YOLO: challenges, architectural successors, datasets and applications

Article 08 August 2022

A review of convolutional neural networks in computer vision

Article Open access 23 March 2024

Availability of data and materials

All data generated or analyzed during this study are included in this published article.

References

Carranza-García M, Lara-Benítez P, García-Gutiérrez J, Riquelme JC (2021) Enhancing object detection for autonomous driving by optimizing anchor generation and addressing class imbalance. Neurocomputing 449:229–244. https://doi.org/10.1016/j.neucom.2021.04.001
Article Google Scholar
Nguyen EH, Yang H, Deng R, Lu Y, Zhu Z, Roland JT, Lu L, Landman BA, Fogo AB, Huo Y (2022) Circle representation for medical object detection. IEEE Trans Med Imaging 41(3):746–754. https://doi.org/10.1109/TMI.2021.3122835
Article Google Scholar
Angelo TD, Mendes M, Keller B, Ferreira R, Delabrida S, Rabelo R, Azpurua H, Bianchi A (2019) Deep learning-based object detection for digital inspection in the mining industry. In: 2019 18th ieee international conference on machine learning and applications (ICMLA), pp 633–640. https://doi.org/10.1109/ICMLA.2019.00116
Zhang J, Cheng L, Li C, Li Y, He G, Xu N, Lian Y (2021) A low-latency FPGA implementation for real-time object detection. In: 2021 IEEE international symposium on circuits and systems (ISCAS), pp 1–5
Nguyen DT, Nguyen TN, Kim H, Lee H-J (2019) A high-throughput and power-efficient FPGA implementation of YOLO CNN for object detection. IEEE Trans Very Large Scale Integr (VLSI) Syst 27(8):1861–1873. https://doi.org/10.1109/TVLSI.2019.2905242
Article Google Scholar
Ahmad A, Pasha MA, Raza GJ (2020) Accelerating tiny YOLOv3 using FPGA-based hardware/software co-design. In: 2020 IEEE international symposium on circuits and systems (ISCAS), pp 1–5. https://doi.org/10.1109/ISCAS45731.2020.9180843
Liang Y, Lu L, **ao Q, Yan S (2020) Evaluating fast algorithms for convolutional neural networks on FPGAs. IEEE Trans Comput Aided Des Integr Circuits Syst 39(4):857–870
Article Google Scholar
Capotondi A, Rusci M, Fariselli M, Benini L (2020) CMix-NN: mixed low-precision CNN library for memory-constrained edge devices. IEEE Trans Circuits Syst II Express Briefs 67(5):871–875. https://doi.org/10.1109/TCSII.2020.2983648
Article Google Scholar
Zhang Z, Mahmud MAP, Kouzani AZ (2022) Resource-constrained FPGA implementation of YOLOv2. Neural Comput Appl 34(19):16989–17006. https://doi.org/10.1007/s00521-022-07351-w
Article Google Scholar
Anupreetham A, Ibrahim M, Hall M, Boutros A, Kuzhively A, Mohanty A, Nurvitadhi E, Betz V, Cao Y, Seo J-s (2021) End-to-end FPGA-based object detection using pipelined CNN and non-maximum suppression. In: 2021 31st international conference on field-programmable logic and applications (FPL), pp 76–82. https://doi.org/10.1109/FPL53798.2021.00021. ISSN: 1946-1488
Shi M, Ouyang P, Yin S, Liu L, Wei S (2019) A fast and power-efficient hardware architecture for non-maximum suppression. IEEE Trans Circuits Syst II Express Briefs 66(11):1870–1874. https://doi.org/10.1109/TCSII.2019.2893527
Article Google Scholar
Redmon J, Farhadi A (2017) YOLO9000: better, faster, stronger. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 6517–6525. https://doi.org/10.1109/CVPR.2017.690
Li Y, Gong R, Tan X, Yang Y, Hu P, Zhang Q, Yu F, Wang W, Gu S (2021) BRECQ: pushing the limit of post-training quantization by block reconstruction. ar**v. ar**v:2102.05426 [cs]. https://doi.org/10.48550/ar**v.2102.05426. Accessed 18 Apr 2023
Nagel M, Amjad RA, Baalen MV, Louizos C, Blankevoort T (2020) Up or down? adaptive rounding for post-training quantization. In: Hal I, Aarti S (eds.) Proceedings of the 37th international conference on machine learning, vol 119. PMLR, pp 7197–7206. https://proceedings.mlr.press/v119/nagel20a.html
Gysel P, Pimentel J, Motamedi M, Ghiasi S (2018) Ristretto: a framework for empirical study of resource-efficient inference in convolutional neural networks. IEEE Trans Neural Netw Learn Syst 29(11):5784–5789. https://doi.org/10.1109/TNNLS.2018.2808319
Article Google Scholar
Wang D, Xu K, Jiang D (2017) PipeCNN: an opencl-based open-source FPGA accelerator for convolution neural networks. In: 2017 international conference on field programmable technology (ICFPT), pp 279–282
Véstias M, Duarte RP, Sousa JTd, Neto H (2017) Parallel dot-products for deep learning on FPGA. In: 2017 27th international conference on field programmable logic and applications (FPL), pp 1–4. https://doi.org/10.23919/FPL.2017.8056863
Fu Y, Wu E, Sirasao A, Attia S, Khan K, Wittig R (2016) Deep learning with int8 optimization on xilinx devices
**linx: UltraScale architecture and product data sheet: overview (2020). https://www.xilinx.com/support/documentation/data_sheets/ds890-ultrascale-overview.pdf
Guo L, Lau J, Chi Y, Wang J, Yu CH, Chen Z, Zhang Z, Cong J (2020) Analysis and optimization of the implicit broadcasts in FPGA HLS to improve maximum frequency. In: 2020 57th ACM/IEEE design automation conference (DAC), pp 1–6. https://doi.org/10.1109/DAC18072.2020.9218718
Wang D, Xu K, Guo J, Ghiasi S (2020) DSP-efficient hardware acceleration of convolutional neural network inference on FPGAs. IEEE Trans Comput Aided Des Integr Circuits Syst 39(12):4867–4880
Article Google Scholar
Obeidat F, Klenke R (2011) Introducing MicroBlaze as an infrastructure for performance modeling. In: 2011 IEEE international conference on microelectronic systems education, pp 90–93. https://doi.org/10.1109/MSE.2011.5937101
Xu M, Yao H, Huan X (2012) Performance test of dual-core processor system based on NIOS II. In: 2012 IEEE symposium on electrical & electronics engineering (EEESYM), pp 82–85. https://doi.org/10.1109/EEESym.2012.6258593
Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65–76. https://doi.org/10.1145/1498765.1498785
Article Google Scholar
Zhang C, Li P, Sun G, Guan Y, **ao B, Cong J (2015) Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: Proceedings of the 2015 ACM/SIGDA international symposium on field-programmable gate arrays, pp 161–170. Association for Computing Machinery, Monterey California USA. https://doi.org/10.1145/2684746.2689060
Chen K, Wang J, Pang J, Cao Y, **ong Y, Li X, Sun S, Feng W, Liu Z, Xu J, Zhang Z, Cheng D, Zhu C, Cheng T, Zhao Q, Li B, Lu X, Zhu R, Wu Y, Dai J, Wang J, Shi J, Ouyang W, Loy CC, Lin D (2019) MMDetection: Open mmlab detection toolbox and benchmark. ar**v preprint ar**v:1906.07155
Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2010) The PASCAL visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338. https://doi.org/10.1007/s11263-009-0275-4
Article Google Scholar
Li S, Wang Q, Jiang J, Sheng W, **g N, Mao Z (2022) An efficient CNN accelerator using inter-frame data reuse of videos on FPGAs. IEEE Trans Very Large Scale Integr (VLSI) Syst 30(11):1587–1600. https://doi.org/10.1109/TVLSI.2022.3151788
Article Google Scholar
Intel neural compute stick 2. https://www.intel.com/content/www/cn/zh/developer/articles/tool/neural-compute-stick.html Accessed 10 May 2023
Jetson nano developer kit for AI and robotics | NVIDIA. https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-nano/ Accessed 10 May 2023
Herrmann V, Knapheide J, Steinert F, Stabernack B (2022) A YOLO v3-tiny FPGA architecture using a reconfigurable hardware accelerator for real-time region of interest detection. In: 2022 25th Euromicro conference on digital system design (DSD), pp 84–92. https://doi.org/10.1109/DSD57027.2022.00021. ISSN: 2771-2508
Zhang H, Wu W, Ma Y, Wang Z (2020) Efficient hardware post processing of anchor-based object detection on FPGA. In: 2020 IEEE computer society annual symposium on VLSI (ISVLSI). IEEE, Limassol, Cyprus, pp 580–585. https://doi.org/10.1109/ISVLSI49217.2020.00089. https://ieeexplore.ieee.org/document/9155076/ Accessed 15 Nov 2022
Adiono T, Putra A, Sutisna N, Syafalni I, Mulyawan R (2021) Low latency YOLOv3-Tiny accelerator for low-cost FPGA using general matrix multiplication principle. IEEE Access 9:141890–141913. https://doi.org/10.1109/ACCESS.2021.3120629
Article Google Scholar

Download references

Acknowledgements

This work was supported by Bei**g Natural Science Foundation under Grant No. 4202063, National Key Research and Development Program of China under Grant No. 2019YFB2204200.

Author information

Authors and Affiliations

Institute of Information Science, Bei**g Jiaotong University, Bei**g, 100044, China
Dezheng Zhang, Aibin Wang, Ruchan Mo & Dong Wang
Bei**g Key Laboratory of Advanced Information Science and Network Technology, Bei**g, 100044, China
Dezheng Zhang, Aibin Wang, Ruchan Mo & Dong Wang

Authors

Dezheng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Aibin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ruchan Mo
View author publications
You can also search for this author in PubMed Google Scholar
Dong Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dong Wang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, D., Wang, A., Mo, R. et al. End-to-end acceleration of the YOLO object detection framework on FPGA-only devices. Neural Comput & Applic 36, 1067–1089 (2024). https://doi.org/10.1007/s00521-023-09078-8

Download citation

Received: 23 December 2022
Accepted: 14 September 2023
Published: 13 November 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s00521-023-09078-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

End-to-end acceleration of the YOLO object detection framework on FPGA-only devices

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SSD: Single Shot MultiBox Detector

Object detection using YOLO: challenges, architectural successors, datasets and applications

A review of convolutional neural networks in computer vision

Availability of data and materials

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

End-to-end acceleration of the YOLO object detection framework on FPGA-only devices

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SSD: Single Shot MultiBox Detector

Object detection using YOLO: challenges, architectural successors, datasets and applications

A review of convolutional neural networks in computer vision

Availability of data and materials

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation