Introduction

Today’s widespread applications of computer vision spanning surveillance1, disaster management2, camera traps for wildlife monitoring3, autonomous driving, smartphones, etc., are fueled by the remarkable technological advances in image sensing platforms4 and the ever-improving field of deep learning algorithms5. However, hardware implementations of vision sensing and vision processing platforms have traditionally been physically segregated. For example, current vision sensor platforms based on CMOS technology act as transduction entities that convert incident light intensities into digitized pixel values, through a two-dimensional array of photodiodes6. The vision data generated from such CMOS Image Sensors (CIS) are often processed elsewhere in a cloud environment consisting of CPUs and GPUs7. This physical segregation leads to bottlenecks in throughput, bandwidth, and energy-efficiency for applications that require transferring large amounts of data from the image sensor to the back-end processor, such as object detection and tracking from high-resolution images/videos.

To address these bottlenecks, many researchers are trying to bring intelligent data processing closer to the source of the vision data, i.e., closer to the CIS, taking one of three broad approaches—near-sensor processing8,9, in-sensor processing10, and in-pixel processing11,12,13. Near-sensor processing aims to incorporate a dedicated machine learning accelerator chip on the same printed circuit board8, or even 3D-stacked with the CIS chip9. Although this enables processing of the CIS data closer to the sensor rather than in the cloud, it still suffers from the data transfer costs between the CIS and processing chip. On the other hand, in-sensor processing solutions10 integrate digital or analog circuits within the periphery of the CIS sensor chip, reducing the data transfer between the CIS sensor and processing chips. Nevertheless, these approaches still often require data to be streamed (or read in parallel) through a bus from CIS photo-diode arrays into the peripheral processing circuits10. In contrast, in-pixel processing solutions, such as11,12,13,14,15, aim to embed processing capabilities within the individual CIS pixels. Initial efforts have focused on in-pixel analog convolution operation14,15 but many11,14,15,16 require the use of emerging non-volatile memories or 2D materials. Unfortunately, these technologies are not yet mature and thus not amenable to the existing foundry-manufacturing of CIS. Moreover, these works fail to support multi-bit, multi-channel convolution operations, batch normalization (BN), and Rectified Linear Units (ReLU) needed for most practical deep learning applications. Furthermore, works that target digital CMOS-based in-pixel hardware, organized as pixel-parallel single instruction multiple data (SIMD) processor arrays12, do not support convolution operation, and are thus limited to toy workloads, such as digit recognition. Many of these works rely on digital processing which typically yields lower levels of parallelism compared to their analog in-pixel alternatives. In contrast, the work in13, leverages in-pixel parallel analog computing, wherein the weights of a neural network are represented as the exposure time of individual pixels. Their approach requires weights to be made available for manipulating pixel-exposure time through control pulses, leading to a data transfer bottleneck between the weight memories and the sensor array. Thus, an in-situ CIS processing solution where both the weights and input activations are available within individual pixels that efficiently implements critical deep learning operations such as multi-bit, multi-channel convolution, BN, and ReLU operations has remained elusive. Furthermore, all existing in-pixel computing solutions have targeted datasets that do not represent realistic applications of machine intelligence mapped onto state-of-the-art CIS. Specifically, most of the existing works are focused on simplistic datasets like MNIST12, while few13 use the CIFAR-10 dataset which has input images with a significantly low resolution (\(32\times 32\)), that does not represent images captured by state-of-the-art high resolution CIS.

Towards that end, we propose a novel in-situ computing paradigm at the sensor nodes called Processing-in-Pixel-in-Memory (P2M), illustrated in Fig. 1, that incorporates both the network weights and activations to enable massively parallel, high-throughput intelligent computing inside CISs. In particular, our circuit architecture not only enables in-situ multi-bit, multi-channel, dot product analog acceleration needed for convolution, but re-purposes the on-chip digital correlated double sampling (CDS) circuit and single slope ADC (SS-ADC) typically available in conventional CIS to implement all the required computational aspects for the first few layers of a state-of-the-art deep learning network. Furthermore, the proposed architecture is coupled with a circuit-algorithm co-design paradigm that captures the circuit non-linearities, limitations, and bandwidth reduction goals for improved latency and energy-efficiency. The resulting paradigm is the first to demonstrate feasibility for enabling complex, intelligent image processing applications (beyond toy datasets), on high resolution images of Visual Wake Words (VWW) dataset, catering to a real-life TinyML application. We choose to evaluate the efficacy of P2M on TinyML applications, as they impose tight compute and memory budgets, that are otherwise difficult to meet with current in- and near-sensor processing solutions, particularly for high-resolution input images. Key highlights of the presented work are as follows:

Figure 1
figure 1

Existing and proposed solutions to alleviate the energy, throughput, and bandwidth bottleneck caused by the segregation of Sensing and Compute.

  1. 1.

    We propose a novel processing-in-pixel-in-memory (P2M) paradigm for resource-constrained sensor intelligence applications, wherein novel memory-embedded pixels enable massively parallel dot product acceleration using in-situ input activations (photodiode currents) and in-situ weights all available within individual pixels.

  2. 2.

    We propose re-purposing of on-chip memory-embedded pixels, CDS circuits and SS-ADCs to implement positive and negative weights, BN, and digital ReLU functionality within the CIS chip, thereby map** all the computational aspects for the first few layers of a complex state-of-the-art deep learning network within CIS.

  3. 3.

    We further develop a compact MobileNet-V2 based model optimized specifically for P2M-implemented hardware constraints, and benchmark its accuracy and energy-delay product (EDP) on the VWW dataset, which represents a common use case of visual TinyML.

The remainder of the paper is organized as follows. Section "Challenges and opportunities in P2M" discusses the challenges and opportunities for P2M. Section "P2M circuit implementation" explains our proposed P2M circuit implementation using manufacturable memory technologies. Then, Sect. "P2M-constrained algorithm-circuit co-design" discusses our approach for P2M-constrained algorithm-circuit co-design. Section "Experimental results" presents our TinyML benchmarking dataset, model architectures, test accuracy and EDP results. Finally, some conclusions are provided in Sect. "Conclusions".

Challenges and opportunities in P2M

The ubiquitous presence of CIS-based vision sensors has driven the need to enable machine learning computations closer to the sensor nodes. However, given the computing complexity of modern CNNs, such as Resnet-18Full size image

Multi-channel, multi-bit weight embedded pixels

Our modified pixel circuit builds upon the standard three transistor pixel by embedding additional transistors \(W_i\)s that represent weights of the CNN layer, as shown in Fig. 2. Each weight transistor \(W_i\) is connected in series with the source-follower transistor \(G_s\). When a particular weight transistor \(W_i\) is activated (by pulling its gate voltage to \(V_{DD}\)), the pixel output is modulated both by the driving strength of the transistor \(W_i\) and the voltage at the gate of the source-follower transistor \(G_s\). A higher photo-diode current implies the PMOS source follower is strongly ON, resulting in an increase in the output pixel voltage. Similarly, a higher width of the weight transistor \(W_i\) results in lower transistor resistance and hence lower source degeneration for the source follower transistor, resulting in higher pixel output voltage. Figure 3a, obtained from SPICE simulations using 22 nm GlobalFoundries technology exhibits the desired dependence on transistor width and input photo-diode current. Thus, the pixel output performs an approximate multiplication of the input light intensity (voltage at the gate of transistor \(G_s\)) and the weight (or driving strength) of the transistor \(W_i\), as exhibited by the plot in Fig. 3b. The approximation stems from the fact that transistors are inherently non-linear. In “P2M-constrained algorithm-circuit co-design” section, we leverage our hardware-algorithm co-design framework to incorporate the circuit non-linearities within the CNN training framework, thereby maintaining close to state-of-the-art classification accuracy. Multiple weight transistors \(W_i\)s are incorporated within the same pixel and are controlled by independent gate control signals. Each weight transistor implements a different channel in the output feature map of the layer. Thus, the gate signals represent select lines for specific channels in the output feature map. Note, it is desirable to reduce the number of output channels so as to reduce the total number of weight transistors embedded within each pixel while ensuring high test accuracy for VWW. For our work, using a holistic hardware-algorithm co-design framework (“Classification accuracy” section), we were able to reduce the number of channels in the first layer from 16 to 8, this implies the proposed circuit requires 8 weight transistors per pixel, which can be reasonably implemented.

The presented circuit can support both overlap** and non-overlap** strides depending on the number of weight transistors \(W_i\)s per pixel. Specifically, each stride for a particular kernel can be mapped to a different set of weight transistors over the pixels (input activations). The transistors \(W_i\)s represent multi-bit weights as the driving strength of the transistors can be controlled over a wide range based on transistor width, length, and threshold voltage.

In-situ multi-pixel convolution operation

To achieve the convolution operation, we simultaneously activate multiple pixels. In the specific case of VWW, we activate \(X\times Y\times 3\) pixels at the same time, where X and Y denote the spatial dimensions and 3 corresponds to the RGB (red, blue, green) channels in the input activation layer. For each activated pixels, the pixel output is modulated by the photo-diode current and the weight of the activated \(W_i\) transistor associated with the pixel, in accordance with Fig. 3a,b. For a given convolution operation only one weight transistor is activated per pixel, corresponding to a specific channel in the first layer of the CNN. The weight transistors \(W_i\) represent multi-bit weights through their driving strength. As detailed in “Multi-channel, multi-bit weight embedded pixels” section, for each pixel, the output voltage approximates the multiplication of light intensity and weight. For each bit line, shown as vertical blue lines in Fig. 2, the cumulative pull up strength of the activated pixels connected to that line drives it high. The increase in pixel output voltages accumulate on the bit lines implementing an analog summation operation. Consequently, the voltage at the output of the bit lines represent the convolution operation between input activations and the stored weight inside the pixel.

Figure 3c plots the output voltage (at node Analog Convolution Output in Fig. 2) as a function of normalized ideal convolution operation. The plot in the figure was generated by considering 75 pixels are activated, simultaneously. For each line in Fig. 3c, the activated weight transistors \(W_i\) are chosen to have the same width and the set of colored lines represents the range of widths. For each line, the input I is swept from its minimum to maximum value and the ideal dot product is normalized and plotted on x-axis. The y-axis plots the actual SPICE circuit output. The largely linear nature of the plot indicates that the circuits are working as expected and the small amount of non-linearities are captured in our training framework described in “Custom convolution for the first layer modeling circuit non-idealities” section.

Note, in order to generate multiple output feature maps, the convolution operation has to be repeated for each channel in the output feature map. The corresponding weight for each channel is stored in a separate weight transistor embedded inside each pixel. Thus, there are as many weight transistors embedded within a pixel as there are number of channels in the output feature map. Note that even though we can reduce the number of filters to 8 without any significant drop in accuracy for the VWW dataset, if needed, it is possible to increase the number of filters to 64 (many SOTA CNN architectures have up to 64 channels in their first layer), without significant increase in area using advanced 3D integration, as described in “CIS process integration and area considerations” section.

In summary, the presented scheme can perform in-situ multi-bit, multi-channel analog convolution operation inside the pixel array, wherein both input activations and network weights are present within individual pixels.

Figure 3
figure 3

(a) Pixel output voltage as a function of weight (transistor width) and input activation (normalized photo-diode current) simulated on GlobalFoundries 22 nm FD-SOI node. As expected pixel output increases both as a function of weights and input activation. (b) A scatter plot comparing pixel output voltage to ideal multiplication value of Weights\(\times\)Input activation (Normalized \(W\times I\)). (c) Analog convolution output voltage versus ideal normalized convolution value when 75 pixels are activated simultaneously.

Re-purposing digital correlated double sampling circuit and single-slope ADCs as ReLU neurons

Weights in a CNN layer span positive and negative values. As discussed in the previous sub-section, weights are mapped by the driving strength (or width) of transistors \(W_i\)s. As the width of transistors cannot be negative, the \(W_i\) transistors themselves cannot represent negative weights. Interestingly, we circumvent this issue by re-purposing on-chip digital CDS circuit present in many state-of-the-art commercial CIS20,21. A digital CDS is usually implemented in conjunction to column parallel Single Slope ADCs (SS-ADCs). A single slope ADC consists of a ramp-generator, a comparator, and a counter (see Fig. 2). An input analog voltage is compared through the comparator to a ram** voltage with a fixed slope, generated by the ramp generator. A counter which is initially reset, and supplied with an appropriate clock, keeps counting until the ramp voltage crosses the analog input voltage. At this point, the output of counter is latched and represents the converted digital value for input analog voltage. A traditional CIS digital CDS circuit takes as input two correlated samples at two different time instances. The first sample corresponds to the reset noise of the pixel and the second sample to the actual signal superimposed with the reset noise. A digital CIS CDS circuit then takes the difference between the two samples, thereby, eliminating reset noise during ADC conversion. In an SS-ADC the difference is taken by simply making the counter ‘up’ count for one sample and ‘down’ count for the second.

Figure 4
figure 4

(a) A typical timing waveform, showing double sampling (one for positive and other for negative) weights. The numerical labels in the figure correspond to the numerical label in the circuit shown in Fig. 2. (b) Typical timing waveform for the SS-ADC showing comparator output (Comp), counter enable (trigger), ramp generator output, and counter clock (Counter).

We utilize the noise cancelling, differencing behavior of the CIS digital CDS circuit already available on commercial CIS chips to implement positive and negative weights and implement ReLU. First, each weight transistor embedded inside a pixel is ‘tagged’ as a positive or a ‘negative weight’ by connecting it to ‘red lines’ (marked as VDD for positive weights in Fig. 2) and ‘green lines’ (marked as VDD for negative weights in Fig. 2). For each channel, we activate multiple pixels to perform an inner-product and read out two samples. The first sample corresponds to a high VDD voltage applied on the ‘red lines’ (marked as VDD for positive weights in Fig. 2) while the ‘green lines’ (marked as VDD for negative weights in Fig. 2) are kept at ground. The accumulated multi-bit dot product result is digitized by the SS-ADC, while the counter is ‘up’ counting. The second sample, on the other hand, corresponds to a high VDD voltage applied on the ‘green lines’ (marked as VDD for negative weights in Fig. 2) while the ‘red lines’ (marked as VDD for positive weights in Fig. 2) are kept at ground. The accumulated multi-bit dot product result is again digitized and also subtracted from the first sample by the SS-ADC, while the counter is ‘down’ counting. Thus, the digital CDS circuit first accumulates the convolution output for all positive weights and then subtracts the convolution output for all negative weights for each channel, controlled by respective select lines for individual channels. Note, possible sneak currents flowing between weight transistors representing positive and negative weights can be obviated by integrating a diode in series with weight transistors or by simply splitting each weight transistor into two series connected transistors, where the channel select lines control one of the series connected transistor, while the other transistor is controlled by a select line representing positive/negative weights.

Interestingly, re-purposing the on-chip CDS for implementing positive and negative weights also allows us to easily implement a quantized ReLU operation inside the SS-ADC. ReLU clips negative values to zero. This can be achieved by ensuring that the final count value latched from the counter (after the CDS operation consisting of ‘up’ counting and then ‘down’ counting’) is either positive or zero. Interestingly, before performing the dot product operation, the counter can be reset to a non-zero value representing the scale factor of the BN layer as described in “P2M-constrained algorithm-circuit co-design” section. Thus, by embedding multi-pixel convolution operation and re-purposing on-chip CDS and SS-ADC circuit for implementing positive/negative weights, batch-normalization and ReLU operation, our proposed P2M scheme can implement all the computational aspect for the first few layers of a complex CNN within the pixel array enabling massively parallel in-situ computations.

Putting these features together, our proposed P2M circuit computes one channel at a time and has three phases of operation:

  1. 1.

    Reset Phase: First, the voltage on the photodiode node M (see Fig. 2) is pre-charged or reset by activating the reset transistor \(G_r\). Note, since we aim at performing multi-pixel convolution, the set of pixels \(X\times Y\times 3\) are reset, simultaneosuly.

  2. 2.

    Multi-pixel Convolution Phase: Next, we discharge the gate of the reset transistor \(G_r\) which deactivates \(G_r\). Subsequently, \(X\times Y\times 3\) pixels are activated by pulling the gate of respective \(G_H\) transistors to VDD. Within the activated set of pixels, a single weight transistor corresponding to a particular channel in the output feature map is activated, by pulling high its gate voltage through the select lines (labeled as select lines for multiple channels in Fig. 2). As the photodiode is sensitive to the incident light, photo-current is generated as light shines upon the diode (for a duration equal to exposure time), and voltage on the gate of \(G_s\) is modulated in accordance to the photodiode current that is proportional to the intensity of incident light. The pixel output voltage is a function of the incident light (voltage on node M) and the driving strength of the activated weight transistor within each pixel. Pixel output from multiple pixels are accumulated on the column-lines and represent the multi-pixel analog convolution output. The SS-ADC in the periphery converts analog output to a digital value. Note, the entire operation is repeated twice, one for positive weights (‘up’ counting) and another for negative weights (‘down counting’).

  3. 3.

    ReLU Operation: Finally, the output of the counter is latched and represents a quantized ReLU output. It is ensured that the latched output is either positive or zero, thereby mimicking the ReLU functionality within the SS-ADC.

The entire P2M circuit is simulated using commercial 22 nm GlobalFoundries FD-SOI (fully depleted silicon-on-insulator) technology, the SS-ADCs are implemented using a using a bootstrap ramp generator and dynamic comparators. Assuming the counter output which represents the ReLU function is an N-bit integer, it needs \(2^N\) cycles for a single conversion. The ADC is supplied with a 2GHz clock for the counter circuit. SPICE simulations exhibiting the multiplicative nature of weight transistor embedded pixels with respect to photodiode current is shown in Fig. 3a,b. Functional behavior of the circuit for analog convolution operation is depicted in Fig. 3c. A typical timing waveform showing pixel operation along with SS-ADC operation simulated on 22 nm GlobalFoundries technology node is shown in Fig. 4.

It may also be important to note that a highlight of our proposal is that we re-purpose various circuit functions already available in commercial cameras. This ensures most of the existing peripheral and corresponding timing control blocks would require only minor modification to support our proposed P2M computations. Specifically, instead of activating one row at a time in a rolling shutter manner, P2M requires activation of group of rows, simultaneously, corresponding to the size of kernels in the first layers. Multiple group of rows would then be activated in a typical rolling shutter format. Overall, the sequencing of pixel activation (except for the fact that group of rows have to be activated instead of a single row), CDS, ADC operation and bus-readout would be similar to typical cameras22.

CIS process integration and area considerations

In this section, we would like to highlight the viability of the proposed P2M paradigm featuring memory-embedded pixels with respect to its manufacturability using existing foundry processes. A representative illustration of a heterogeneously integrated system catering to the needs of the proposed P2M paradigm is shown in Fig. 5. The figure consists of two key elements, (i) backside illuminated CMOS image sensor (Bi-CIS), consisting of photo-diodes, read-out circuits and pixel transistors (reset, source follower and select transistors), and (ii) a die consisting of multiple weight transistors per pixel (refer Fig 2). From Fig. 2, it can be seen that each pixel consists of multiple weight transistors that would lead to exceptionally high area overhead. However, with the presented heterogeneous integration scheme of Fig. 5, the weight transistors are vertically aligned below a standard pixel, thereby incurring no (or minimal) increase in footprint. Specifically, each Bi-CIS chip can be implemented in a leading or lagging technology node. The die consisting of weight transistors can be built on an advanced planar or non-planar technology node such that the multiple weight transistors can be accommodated in the same footprint occupied by a single pixel (assuming pixel sizes are larger than the weight transistor embedded memory circuit configuration). The Bi-CIS image sensor chip/die is heterogeneously integrated through a bonding process (die-to-die or die-to-wafer) integrating it onto the die consisting of weight transistors. Preferably, a die-to-wafer low-temperature metal-to-metal fusion with a dielectric-to-dielectric direct bonding hybrid process can achieve high-throughput sub-micron pitch scaling with precise vertical alignment23 . One of the advantages of adapting this heterogeneous integration technology is that chips of different sizes can be fabricated at distinct foundry sources, technology nodes, and functions and then integrated together. In case there are any limitations due to the increased number of transistors in the die consisting of the weights, a conventional pixel-level integration scheme, such as Stacked Pixel Level Connections (SPLC), which shields the logic CMOS layer from the incident light through the Bi-CIS chip region, would also provide a high pixel density and a large dynamic range24. Alternatively, one could also adopt the through silicon via (TSV) integration technique for front-side illuminated CMOS image sensor (Fi-CIS), wherein the CMOS image sensor is bonded onto the die consisting of memory elements through a TSV process. However, in the Bi-CIS, the wiring is moved away from the illuminated light path allowing more light to reach the sensor, giving better low-light performance25.

Figure 5
figure 5

Representative illustration of heterogeneously integrated system featuring P2M paradigm, built on backside illuminated CMOS image sensor (Bi-CIS). \(\textcircled {1}\) Micro lens, \(\textcircled {2}\) Light shield, \(\textcircled {3}\) Backside illuminated CMOS Image Sensor (Bi-CIS), \(\textcircled {4}\) Backend of line of the Bi-CIS, \(\textcircled {5}\) Die consisting of weight transistors, \(\textcircled {6}\) solder bumps for input/output bus (I/O).

Advantageously, the heterogeneous integration scheme can be used to manufacture P2M sensor systems on existing as well as emerging technologies. Specifically, the die consisting of weight transistors could use a ROM-based structure as shown in “P2M circuit implementation” section or other emerging programmable non-volatile memory technologies like PCM26, RRAM27, MRAM28, ferroelectric field effect transistors (FeFETs)29 etc., manufactured in distinct foundries and subsequently heterogeneously integrated with the CIS die. Thus, the proposed heterogeneous integration allows us to achieve lower area-overhead, while simultaneously enabling seamless, massively parallel convolution. Specifically, based on reported contacted poly pitch and metal pitch numbers30, we estimate more than 100 weight transistors can be embedded in a 3D integrated die using a 22 nm technology, assuming the underlying pixel area (dominated by the photodiode) is 10 \(\upmu\)m \(\times\) 10 \(\upmu\)m. Availability of back-end-of-line monolithically integrated two terminal non-volatile memory devices could allow denser integration of weights within each pixel. Such weight embedded pixels allow individual pixels to have in-situ access to both activation and weights as needed by the P2M paradigm which obviates the need to transfer weights or activation from one physical location to another through a bandwidth constrained bus. Hence, unlike other multi-chip solutions9, our approach does not incur energy bottlenecks.

Figure 6
figure 6

Algorithm-circuit co-design framework to enable our proposed P2M approach optimize both the performance and energy-efficiency of vision workloads. We propose the use of ① large strides, ② large kernel sizes, ③ reduced number of channels, ④ P2M custom convolution, and ⑤ shifted ReLU operation to incorporate the shift term of the batch normalization layer, for emulating accurate P2M circuit behaviour.