Introduction

The main idea of neural networks (NN) is based on biological neural system structure, which consists of several connected elements named neurons [1]. In biological systems, neurons get signals from dendrites and pass them to the next neurons via axon as shown in Fig. 1.

Fig. 1
figure 1

Typical biological neurons [20]

Neural networks are made up of artificial neurons for handling brain tasks like learning, recognition and optimization. In this structure, the nodes are neurons, links can be considered as synapses and biases as activation thresholds [2]. Each layer extracts some information related to the features and forwards them with a weight to the next layer. Output is the sum of all these information gains multiplied by their related weights. Figure 2 represents a simple artificial neural network structure.

Fig. 2
figure 2

Simple artificial neural network structure

Deep neural networks are complex artificial neural networks with more than two layers. Nowadays, these networks are widely used for several scientific and industrial purposes such as visual object detection, segmentation, image classification, speech recognition, natural language processing, genomics, drug discovery, and many other areas [3].

Deep learning is a new subset of machine learning including algorithms that are used for learning concepts in different levels, utilizing artificial neural networks [4].

As Fig. 3 shows, if each neuron and its weight are represented by Xi and Wi j respectively, the output result (Yj) would be:

Fig. 3
figure 3

A typical deep neural network structure

$${\mathrm{Y}}_{\mathrm{j}}=\sum_{\mathrm{i}=1}^{\mathrm{n}}\upsigma ({\mathrm{W}}_{\mathrm{ij}}{\mathrm{X}}_{\mathrm{i}})$$
(1)

where \(\sigma\) is the activation function. A popular function that is used for activation in deep neural networks is ReLU (Rectified Linear Unit) function, which is defined in Eq. (2).

$${\left[\sigma (z)\right]}_{j}=\mathrm{max}\{{z}_{j} \;and\; 0\}$$
(2)

Leaky ReLU, tanhh and Sigmoid functions are some other activation functions with less frequent usage [5].

$$\sigma \left(z\right)=\frac{1}{1+{e}^{-z}}$$
(3)

As shown in Fig. 4, each layer of a deep neural network’s role is to extract some features and send them to the next layer with its corresponding weight. For example, in the first layer, color properties (green, red blue) are gained; in the next layer, edge of objects are determined and so on.

Fig. 4
figure 4

Deep learning setup for object detection [21]

Convolutional neural networks are a type of deep neural networks that is mostly used for recognition, mining and synthesis applications like face detection, handwritting recognition and natural language processing [6]. Since parallel computations is an unavoidable part of CNNs, several efforts and research works have been done for designing an optimized hardware for it. As a result, many application-specific integrated circuits (ASICs) as hardware accelerators have been introduced and evaluated in the recent decade [7]. In the next section, some of the most successful and impressive works related to CNN accelerators are introduced.

Related works

Tianshi et al. [8] proposed DianNao as a hardware accelerator for large-scale convolutional neural networks (CNNs) and deep neural networks (DNNs). The main focus of the suggested model is on the memory structure to be optimized for big neural network computations. The experimental results showed speedup in computation and reduction of overhead in performance and energy. This research also demonstrated that the accelerator can be implemented in very small area in order of 3 mm2 and 485 mW power.

Zidong et al. [9] suggested ShiDianNao as a CNN accelerator for image processing close to a CMOS or CCD sensor. The performance and energy of this architecture is compared to CPU, GPU and DainNao, which has been discussed in previous work [8]. Utilizing SRAM instead of DRAM made it 60 times more enery effiecent than DianNao. It is also 50×, 30× and 1.87× faster than a mainstream CPU, GPU and DianNao, with just 65 nm usage area and 320 mW power.

Wenyan et al. [6] offered a flexible dataflow accelerator for convolutional neural networks called FlexFlow. Working on different types of parallelism is the substantial contribution of this model. Results of the tests showed 2–10 × performance speedup and 2.5–10 × power efficiency in comparison with three investigated baseline architectures.

Eyriss is a spatial architecture for energy efficient data flows for CNNs which presented by Yu-Hsin et al. [10]. This hardware model is based on a dataflow named row stationary (RS). This dataflow minimizes energy consumption by reusing computation of filter weights. The proposed RS dataflow is investigated on AlexNet CNN configuration, which proved energy efficiency improvement.

Morph is a flexible accelerator for 3D CNN-based video processing that offered by Katrik et al. [7]. Since the previous work and proposed architectures didn’t specificly focus on video processing, this model can be considered as a novelty in this area. Comparison of energy consumption in this architecture with previous idea, Eyriss [10] showed a high level of reduction that means energy saving. The main reason of this improvement is effective data reuse which reduces the access to higher level buffers and high cost off-cheap memory.

Michael et al. [11] described Buffets that is an efficient and composable accelerator and independent of any particular design. Through this research, explicit decoupled data orchestration (EDDO) is introduced which allows evaluation of energy efficiency in acceleators. Result of this work showed that with a smaller usage area, higher energy efficiency and lower control overhead is acquired.

Deep learning applications

Deep learning has a wide range of applications in recognition, classification and prediction, and since it tends to work like the human brain and consequently does the human jobs in a more accurate and low cost manner, its usage is dramatically increasing. More than 100 papers published from 2015 to 2020, helped categorize the main applications as below:

  • Computer vision

  • Translation

  • Smart cars

  • Robotics

  • Health monitoring

  • Disease prediction

  • Medical image analysis

  • Drug discovery

  • Biomedicine

  • Bioinformatics

  • Smart clothing

  • Personal health advisors

  • Pixel restoration for photos

  • Sound restoration in videos

  • Describing photos

  • Handwriting recognition

  • Predicting natural disasters

  • Cyber physical security systems [12]

  • Intelligent transportation systems [13]

  • Computed tomography image reconstruction [14]

Method

As mentioned previously, artificial intelligence and deep learning applications are growing drastically, but they have high complexity computation, energy consumption, costs and memory bandwidth. All these reasons were major motivations for develo** deep learning accelerators (DLA) [15]. A DLA is a hardware architecture that is specially designed and optimized for deep learning purposes. Recent DLA architectures (e.g. OpenCL) have mainly focused on maximizing computation reuse and minimizing memory bandwidth, which led to higher speed and performance [16].

Generally, most of the accelerators support just fixed data flow and are not reconfigurable, but for doing huge deployments, they need to be programmable. Hyoukjun et al. [15] proposed a novel architecture named MAERI (Multiply-Accumulate Engine with Reconfigurable Interconnects), which is reconfigurable and employs ART (Augmented Reduction Tree) which showed 8 ~ 459% better utilization for different data flows over a strict network-on-chip (NoC) fabric. Figure 5 shows the overall structure of MAERI DLA.

Fig. 5
figure 5

MAERI micro architecture [15]

In another research, Hyoukjun et al. offered a framework called “MAESTRO” (Modeling Accelerator Efficiency via Spatio-Temporal Resource Occupancy) for predicting energy performance and efficiency in DLAs [15]