Abstract
Object detection has been revolutionized by convolutional neural networks (CNNs), but their high computational complexity and heavy data access requirements make implementing these algorithms on edge devices challenging. To address this issue, we propose an efficient object detection accelerator for YOLO series algorithm. Our architecture utilizes multiple dimensions of parallelism to accelerate the convolution computation. We employ line-buffer-based parallel data caches and dedicated data access units to minimize off-chip bandwidth pressure. Additionally, our proposed design not only accelerates the convolutional computation, but also control-intensive post-processing to achieve low detection latency. We evaluate the final design on **linx V7-690t FPGA device, achieving a throughput of 525 GOP/s for a batch size of 1 and 914 GOP/s for a batch size equal to 2. Compared with state-of-the-art YOLOv2 and YOLOv3 implementations, our proposed accelerator offers up to 9\(\times\) throughput improvement and 5\(\times\) shorter latency.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09078-8/MediaObjects/521_2023_9078_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09078-8/MediaObjects/521_2023_9078_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09078-8/MediaObjects/521_2023_9078_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09078-8/MediaObjects/521_2023_9078_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09078-8/MediaObjects/521_2023_9078_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09078-8/MediaObjects/521_2023_9078_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09078-8/MediaObjects/521_2023_9078_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09078-8/MediaObjects/521_2023_9078_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09078-8/MediaObjects/521_2023_9078_Fig9_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09078-8/MediaObjects/521_2023_9078_Fig10_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09078-8/MediaObjects/521_2023_9078_Fig11_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09078-8/MediaObjects/521_2023_9078_Fig12_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09078-8/MediaObjects/521_2023_9078_Fig13_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09078-8/MediaObjects/521_2023_9078_Fig14_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09078-8/MediaObjects/521_2023_9078_Fig15_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09078-8/MediaObjects/521_2023_9078_Fig16_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09078-8/MediaObjects/521_2023_9078_Fig17_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09078-8/MediaObjects/521_2023_9078_Fig18_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09078-8/MediaObjects/521_2023_9078_Fig19_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09078-8/MediaObjects/521_2023_9078_Fig20_HTML.png)
Similar content being viewed by others
Availability of data and materials
All data generated or analyzed during this study are included in this published article.
References
Carranza-García M, Lara-Benítez P, García-Gutiérrez J, Riquelme JC (2021) Enhancing object detection for autonomous driving by optimizing anchor generation and addressing class imbalance. Neurocomputing 449:229–244. https://doi.org/10.1016/j.neucom.2021.04.001
Nguyen EH, Yang H, Deng R, Lu Y, Zhu Z, Roland JT, Lu L, Landman BA, Fogo AB, Huo Y (2022) Circle representation for medical object detection. IEEE Trans Med Imaging 41(3):746–754. https://doi.org/10.1109/TMI.2021.3122835
Angelo TD, Mendes M, Keller B, Ferreira R, Delabrida S, Rabelo R, Azpurua H, Bianchi A (2019) Deep learning-based object detection for digital inspection in the mining industry. In: 2019 18th ieee international conference on machine learning and applications (ICMLA), pp 633–640. https://doi.org/10.1109/ICMLA.2019.00116
Zhang J, Cheng L, Li C, Li Y, He G, Xu N, Lian Y (2021) A low-latency FPGA implementation for real-time object detection. In: 2021 IEEE international symposium on circuits and systems (ISCAS), pp 1–5
Nguyen DT, Nguyen TN, Kim H, Lee H-J (2019) A high-throughput and power-efficient FPGA implementation of YOLO CNN for object detection. IEEE Trans Very Large Scale Integr (VLSI) Syst 27(8):1861–1873. https://doi.org/10.1109/TVLSI.2019.2905242
Ahmad A, Pasha MA, Raza GJ (2020) Accelerating tiny YOLOv3 using FPGA-based hardware/software co-design. In: 2020 IEEE international symposium on circuits and systems (ISCAS), pp 1–5. https://doi.org/10.1109/ISCAS45731.2020.9180843
Liang Y, Lu L, **ao Q, Yan S (2020) Evaluating fast algorithms for convolutional neural networks on FPGAs. IEEE Trans Comput Aided Des Integr Circuits Syst 39(4):857–870
Capotondi A, Rusci M, Fariselli M, Benini L (2020) CMix-NN: mixed low-precision CNN library for memory-constrained edge devices. IEEE Trans Circuits Syst II Express Briefs 67(5):871–875. https://doi.org/10.1109/TCSII.2020.2983648
Zhang Z, Mahmud MAP, Kouzani AZ (2022) Resource-constrained FPGA implementation of YOLOv2. Neural Comput Appl 34(19):16989–17006. https://doi.org/10.1007/s00521-022-07351-w
Anupreetham A, Ibrahim M, Hall M, Boutros A, Kuzhively A, Mohanty A, Nurvitadhi E, Betz V, Cao Y, Seo J-s (2021) End-to-end FPGA-based object detection using pipelined CNN and non-maximum suppression. In: 2021 31st international conference on field-programmable logic and applications (FPL), pp 76–82. https://doi.org/10.1109/FPL53798.2021.00021. ISSN: 1946-1488
Shi M, Ouyang P, Yin S, Liu L, Wei S (2019) A fast and power-efficient hardware architecture for non-maximum suppression. IEEE Trans Circuits Syst II Express Briefs 66(11):1870–1874. https://doi.org/10.1109/TCSII.2019.2893527
Redmon J, Farhadi A (2017) YOLO9000: better, faster, stronger. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 6517–6525. https://doi.org/10.1109/CVPR.2017.690
Li Y, Gong R, Tan X, Yang Y, Hu P, Zhang Q, Yu F, Wang W, Gu S (2021) BRECQ: pushing the limit of post-training quantization by block reconstruction. ar**v. ar**v:2102.05426 [cs]. https://doi.org/10.48550/ar**v.2102.05426. Accessed 18 Apr 2023
Nagel M, Amjad RA, Baalen MV, Louizos C, Blankevoort T (2020) Up or down? adaptive rounding for post-training quantization. In: Hal I, Aarti S (eds.) Proceedings of the 37th international conference on machine learning, vol 119. PMLR, pp 7197–7206. https://proceedings.mlr.press/v119/nagel20a.html
Gysel P, Pimentel J, Motamedi M, Ghiasi S (2018) Ristretto: a framework for empirical study of resource-efficient inference in convolutional neural networks. IEEE Trans Neural Netw Learn Syst 29(11):5784–5789. https://doi.org/10.1109/TNNLS.2018.2808319
Wang D, Xu K, Jiang D (2017) PipeCNN: an opencl-based open-source FPGA accelerator for convolution neural networks. In: 2017 international conference on field programmable technology (ICFPT), pp 279–282
Véstias M, Duarte RP, Sousa JTd, Neto H (2017) Parallel dot-products for deep learning on FPGA. In: 2017 27th international conference on field programmable logic and applications (FPL), pp 1–4. https://doi.org/10.23919/FPL.2017.8056863
Fu Y, Wu E, Sirasao A, Attia S, Khan K, Wittig R (2016) Deep learning with int8 optimization on xilinx devices
**linx: UltraScale architecture and product data sheet: overview (2020). https://www.xilinx.com/support/documentation/data_sheets/ds890-ultrascale-overview.pdf
Guo L, Lau J, Chi Y, Wang J, Yu CH, Chen Z, Zhang Z, Cong J (2020) Analysis and optimization of the implicit broadcasts in FPGA HLS to improve maximum frequency. In: 2020 57th ACM/IEEE design automation conference (DAC), pp 1–6. https://doi.org/10.1109/DAC18072.2020.9218718
Wang D, Xu K, Guo J, Ghiasi S (2020) DSP-efficient hardware acceleration of convolutional neural network inference on FPGAs. IEEE Trans Comput Aided Des Integr Circuits Syst 39(12):4867–4880
Obeidat F, Klenke R (2011) Introducing MicroBlaze as an infrastructure for performance modeling. In: 2011 IEEE international conference on microelectronic systems education, pp 90–93. https://doi.org/10.1109/MSE.2011.5937101
Xu M, Yao H, Huan X (2012) Performance test of dual-core processor system based on NIOS II. In: 2012 IEEE symposium on electrical & electronics engineering (EEESYM), pp 82–85. https://doi.org/10.1109/EEESym.2012.6258593
Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65–76. https://doi.org/10.1145/1498765.1498785
Zhang C, Li P, Sun G, Guan Y, **ao B, Cong J (2015) Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: Proceedings of the 2015 ACM/SIGDA international symposium on field-programmable gate arrays, pp 161–170. Association for Computing Machinery, Monterey California USA. https://doi.org/10.1145/2684746.2689060
Chen K, Wang J, Pang J, Cao Y, **ong Y, Li X, Sun S, Feng W, Liu Z, Xu J, Zhang Z, Cheng D, Zhu C, Cheng T, Zhao Q, Li B, Lu X, Zhu R, Wu Y, Dai J, Wang J, Shi J, Ouyang W, Loy CC, Lin D (2019) MMDetection: Open mmlab detection toolbox and benchmark. ar**v preprint ar**v:1906.07155
Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2010) The PASCAL visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338. https://doi.org/10.1007/s11263-009-0275-4
Li S, Wang Q, Jiang J, Sheng W, **g N, Mao Z (2022) An efficient CNN accelerator using inter-frame data reuse of videos on FPGAs. IEEE Trans Very Large Scale Integr (VLSI) Syst 30(11):1587–1600. https://doi.org/10.1109/TVLSI.2022.3151788
Intel neural compute stick 2. https://www.intel.com/content/www/cn/zh/developer/articles/tool/neural-compute-stick.html Accessed 10 May 2023
Jetson nano developer kit for AI and robotics | NVIDIA. https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-nano/ Accessed 10 May 2023
Herrmann V, Knapheide J, Steinert F, Stabernack B (2022) A YOLO v3-tiny FPGA architecture using a reconfigurable hardware accelerator for real-time region of interest detection. In: 2022 25th Euromicro conference on digital system design (DSD), pp 84–92. https://doi.org/10.1109/DSD57027.2022.00021. ISSN: 2771-2508
Zhang H, Wu W, Ma Y, Wang Z (2020) Efficient hardware post processing of anchor-based object detection on FPGA. In: 2020 IEEE computer society annual symposium on VLSI (ISVLSI). IEEE, Limassol, Cyprus, pp 580–585. https://doi.org/10.1109/ISVLSI49217.2020.00089. https://ieeexplore.ieee.org/document/9155076/ Accessed 15 Nov 2022
Adiono T, Putra A, Sutisna N, Syafalni I, Mulyawan R (2021) Low latency YOLOv3-Tiny accelerator for low-cost FPGA using general matrix multiplication principle. IEEE Access 9:141890–141913. https://doi.org/10.1109/ACCESS.2021.3120629
Acknowledgements
This work was supported by Bei**g Natural Science Foundation under Grant No. 4202063, National Key Research and Development Program of China under Grant No. 2019YFB2204200.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, D., Wang, A., Mo, R. et al. End-to-end acceleration of the YOLO object detection framework on FPGA-only devices. Neural Comput & Applic 36, 1067–1089 (2024). https://doi.org/10.1007/s00521-023-09078-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-09078-8