Introduction

Artificial intelligence (AI) has made tremendous success during the past decade, driven by explosively big data and deep learning algorithms as well as rapidly evolving GPU and application-specific integrated circuit (ASIC) chips. However, a huge gap still exists between state-of-the-art AI hardware and the human brain, especially in terms of energy efficiency1. There are several key bottlenecks faced by the conventional computing hardware based on Si CMOS and von Neumann architecture with separated memory and computing units. For example, the slowing down of Moore’s law scaling hinders the continuous improvement of integration density and performance of individual two-dimensional (2D) chips2. Moreover, the long latency of memory access and limited data bandwidth between memory and computing units has become a key limiting factor of system performance, known as the von Neumann bottleneckFull size image

The two different types of RRAMs on the 2nd and 3rd layers were intentionally chosen to fulfill their functions as analog CIM and digital memory, respectively. To further verify the structural integrity, cross-sectional TEM analyses on the RRAMs in the last two layers were carried out after the fabrication of M3D-LIME chip. Figure 2c shows the TEM image of the analog RRAM in the CIM layer, in which the material stack of TiN/HfAlOx/TaOx/TiN is clearly resolved. Figure 2d shows the cross-sectional TEM image of the digital RRAM in the TCAM layer, which has a different material stack of Pd/TaOx/Ta2O5/Pt to achieve a large on/off ratio of HRS/LRS. To reveal the different resistive switching characteristics of these two RRAMs, direct current (DC) I-V tests were carried out. Both RRAMs were tested in the form of 1T1R cell, and the results are shown in Fig. 2e. The Ta2O5-based RRAM exhibited an abrupt resistive switching in the set and reset processes with a large on/off ratio of HRS/LRS > 300, which is favorable for memory applications such as TCAM and buffer. Meanwhile, the HfAlOx-based RRAM exhibited a more gradual resistive switching that enables excellent multi-level programming capability, which is considered favorable for CIM applications44.

Furthermore, array-level electrical measurements were conducted on the fabricated 1k-bit analog RRAM array for CIM, and the detailed schematic of the array is shown in Supplementary Fig. 1. Figure 2f shows the analog resistive switching characteristics with good linearity and symmetry, where 20 1T1R cells were measured under a series of set and reset pulses with a width of 50 ns. The set, reset and read voltages are 1.6 V, 2.6 V and 0.2 V, respectively. Figure 2g shows the cumulative probability distribution of the CIM array with 32 equally distributed conductance states, equivalent to 5-bit programming precision. Here 128 1T1R cells in each state were programmed using the standard write-verify scheme, as shown in Supplementary Fig. 2. In addition, the results of array-level retention and endurance tests are shown in Supplementary Fig. 3, where excellent endurance over 106 cycles and data retention exceeding 104 s were achieved. These results confirm the superior analog switching characteristics of the HfAlOx-based RRAM array for CIM.

Figure 2h illustrates the implementation of MVM on the array, where the weight matrix is first mapped onto the CIM array as the RRAM conductance and the vector is applied as the voltage pulse inputs to the BLs. By virtue of Ohm’s law and Kirchhoff’s current law, the current outputs on the SLs represent the MVM results36. To demonstrate this process on the fabricated 1k-bit CIM array, an 8×128 weight matrix used in the feature extraction later was first mapped on the CIM array as shown in Fig. 2i, and the corresponding map** error is shown in Fig. 2j. The conductance values of RRAM cells were proportional to the magnitude of the elements in the weight matrix, which were quantized into 4 bits. After the weight map**, a series of 8-bit vectors, whose elements followed Gaussian distributions, were input into the RRAM array for performing MVM operations. By applying the Kirchhoff’s current law and Ohm’s law, the RRAM array completed the MVM calculations and output a series of 128-bit vectors. All the elements of these vectors represented the inner products were normalized and plotted against their theoretically expected results in Fig. 2k. The linear fitting suggests a good consistency between them with a small R-square of 0.96. This result confirms the feasibility of performing MVM operations efficiently on the analog RRAM array for CIM. The detailed experimental method is shown in Supplementary Note 1 and the characterizations on the CIM array in Fig. 2 were performed using on-chip control circuits (including WL address decoder and switch) with the assistance of an off-chip test system.

TCAM array with digital Ta2O5-based RRAM and CNTFET

In the 3rd layer of M3D-LIME chip, CNTFET and Ta2O5-based digital RRAM were used to fabricate the 2T2R TCAM array, which performs template storage and matching for the one-shot/few-shot learning task. The fabrication process is well-optimized and elaborated in the Methods section, where one of the key challenges is to achieve high device performance within the thermal budget. The TEM structural analysis of the Ta2O5-based RRAM was already presented in Fig. 2. Figure 3a further shows a false-color SEM image of a 1\(\times\)5 2T2R TCAM array for electrical measurements. To build this array, the fabrication process of back-gated CNTFETs followed by Ta2O5-based RRAMs was carefully optimized under the highest temperature of 250 oC as elaborated in the Methods section. The fabricated CNTFET achieved an on-state current density Ion/W up to 60 μA/μm, and the electrical data are shown in Supplementary Fig. 4. The high current density is favorable to drive the Ta2O5-based RRAM and achieve a high on/off ratio of HRS/LRS for memory applications. Before the fabrication of Ta2O5-based RRAM, the CNTFETs need to be carefully passivated to preserve their high performance. Here 10 nm-thick yttrium (Y) was first deposited on the CNTFETs and then naturally oxidized into Y2O3 to yield an excellent interface with CNT47. A bilayer oxide of 35 nm Al2O3/10 nm HfO2 was then deposited by ALD at 250 oC to fully passivate the devices. The statistical results of 500 passivated CNTFETs were shown in Supplementary Fig. 5, showing an average Ion/W of 28 μA/μm at VDS = −1 V and a high Ion/Ioff ratio close to 105.

Fig. 3: Characterizations of digital RRAM-based TCAM array.
figure 3

a False-color SEM image of a 1\(\times\)5 2T2R TCAM array fabricated with CNTFET and Ta2O5-based digital RRAM. Scale bar: 50 μm. b Illustration of Hamming distance calculation using 2T2R TCAM. c The discharging waveform of ML as a function of the number of mismatched bits Nmis (from 0 to 5). d Ηistogram of τ measured from 100 2T2R TCAM arrays with 1-bit mismatch (Nmis=1). e Fitting of the measured τ versus Nmis. The purple dots are the experimental data measured from 5 TCAM arrays. Inset: τ−1 exhibits a linear dependence on Nmis. f Hamming distance (predicted) calculated by TCAM as a function of the actual Hamming distance (expected). The blue line plots the linear fitting with a small R-square of 0.9996, demonstrating proper function of Hamming distance calculation.

After that, Pd/TaOx/Ta2O5/Pt digital RRAMs were fabricated on top of the drain terminals of CNTFETs to fulfill the 2T2R TCAM array. Here the sputtered 10 nm Ta2O5 and 20 nm TaOx served as the RSL and the oxygen reservoir layer, respectively, to achieve a large HRS/LRS ratio. After the etching of RRAM stack, 45 nm-thick Al2O3 was deposited by ALD at 150 oC to further passivate and protect the RRAM during operation. The fabricated 5\(\times\)1 2T2R TCAM array consists of 5-unit cells of 2T2R and a pre-charge transistor. Figure 3b shows its circuit diagram which employs CNTFETs to build the pre-charge circuitry, along with the illustration of using it for the calculation of Hamming distance. Here the sense amplifiers are not integrated on-chip in this study. The TEs of all the RRAMs are wired together as the match line (ML) which periodically charges and discharges controlled by the pre-charge transistor. Meanwhile, the drain electrodes of all the CNTFETs are wired together as the SL, while the gate electrodes of CNTFETs in a 2T2R pair are connected to the search lines (SEL and \(\overline{{{{{{\rm{SEL}}}}}}}\)). The working principle for data storing and searching is illustrated in Supplementary Table 1. The template vector is stored in the TCAM while the extracted feature vector is input via SEL and\(\,\overline{{{{{{\rm{SEL}}}}}}}\) to calculate their Hamming distance. The discharging resistance of ML is decided by the number of mismatched bits (Nmis), giving rise to a time constant of discharging (τ) inversely proportional to Nmis-1\(\propto\)Nmis). When calculating, the ML is first charged by the pre-charge transistor, and the extracted feature vector is then input. After turning off the pre-charge transistor and discharging the ML for a given time, the Nmis that indicates the Hamming distance can be calculated by measuring the voltage of the ML.

The cross-sectional TEM image of a 1T1R half-cell in the TCAM layer is shown in Supplementary Fig. 6. Excellent endurance > 5\(\times\)105 cycles and retention exceeding 104 s at 120 oC were also verified. Furthermore, the characterization of 2T2R TCAM cells in the TCAM layer shown in Supplementary Fig. 8 reveals a large resistance ratio >300× between the matched and mismatched cells, which is stable over a large number of search cycles (>105). To further demonstrate the proper function of the TCAM array, electrical measurement was carried out to complete the search operation. As an example, a binary vector of “00000” was stored in the TCAM array by programming each 2T2R cell according to Supplementary Table 1. The source of the pre-charge transistor was fixed at a voltage of 1 V to initially charge the ML to 1 V by applying a voltage of -3 V to the gate. When calculating, 5 SEL and 5 \(\overline{{{{{{\rm{SEL}}}}}}}\) were applied with a voltage of 1 V or -3 V according to the input vector. After that, the pre-charge transistor was turned off by applying a gate voltage of 1 V, and then the ML started to discharge with a time constant depending on the Nmis\(\propto\)Nmis-1). Figure 3c plots the discharging waveform measured by an oscilloscope. To measure the discharging waveform for different Nmis, the input vector was varied from “00000” to “11111”, whose corresponding Nmis increased from 0 to 5. Thanks to the optimized 2T2R array with a large resistance ratio between the matched and mismatched cells, the discharging waveforms of different Nmis were well-spaced, leading to an accurate calculation of the Hamming distance. The plot in Supplementary Fig. 9 confirms the inversely proportional relationship between τ and Nmis, which also verifies the proper search function of the RRAM-based TCAM. The detailed experimental method is described in Supplementary Note 2.

To further validate the proper function of template matching, multiple TCAM arrays were measured for statistical analysis. Figure 3d plots the histogram of τ for 100 2T2R TCAM arrays with 1-bit mismatch. For this measurement, the 2T2R cells in the arrays were mapped to store the vector “00000”. After that, a randomly selected vector from “10000” to “00001” with Nmis = 1 was input to each array. By measuring the discharging resistance (Rdischarge) of each array with a read voltage of 0.15 V, τ can be quickly calculated by τ=R\(\times\)Cp, where Cp is the parasitic capacitance. The statistical results indicate that the measured τ values exhibit a relatively narrow distribution, with an average of 8.2 μs and a standard deviation of 0.48 μs by Gaussian fitting. Furthermore, the discharging waveforms of TCAM arrays with an increasing Nmis from 0 to 5 were also measured in the same way as in Fig. 3c to obtain a statistical result of τ. Figure 3e plots the extracted τ and τ-1 (inset), and the fitting curve confirms the linear dependence of τ−1 on the Nmis. Based on these experimental results, we can further simulate the calculation of Hamming distance for an even larger TCAM array with 128-bit vector inputs, as shown in Fig. 3f. The calculated Hamming distances by the TCAM array achieve excellent consistency with the theoretical values, as confirmed by the linear fitting with a small R-square of 0.9996. These results demonstrate that the fabricated Ta2O5-based digital RRAM arrays can successfully fulfill the desired TCAM operations.

Implementation of One-Shot Learning

Based on the above measurement results of each functional layer, one-shot/few-shot learning was further implemented on the M3D-LIME chip to evaluate its system-level performance. Here a widely used MANN with the hybrid memory architecture is adopted, whose structure and key parameters are shown in Supplementary Fig. 10 and Supplementary Table 2. Using M3D-LIME, the CNN for feature extraction is implemented by the 2nd layer of CIM arrays with the help of buffer arrays in the 3rd TCAM layer and data interface in the 1st CMOS layer. Meanwhile, the templates are stored in the 3rd layer of TCAM arrays while template matching is performed by the massively parallel search operations. Figure 4a illustrates the data flow for implementing one-shot/few-shot learning by this MANN. In this task, massive data need to be frequently transferred among the three functional layers, which can take the advantage of high on-chip bandwidth enabled by the ultra-dense ILVs of M3D.

Fig. 4: Implementation of one-shot learning on M3D-LIME chip.
figure 4

a Illustration of data flow on the M3D-LIME chip for implementing one-shot/few-shot learning. b Schematic of the implementation of one-shot/few-shot learning. It consists of two steps: For feature extraction, a CNN is implemented by the CIM arrays to perform MVM calculations with the help of the data interface in the 1st layer of Si CMOS logic and the 1T1R memory arrays for buffer in the 3rd layer; For template storage and matching, they are performed by the 3rd layer of TCAM. c The classification accuracy of one-shot/few-shot learning on the Omniglot dataset using GPU and the M3D-LIME. The accuracy is the average of 5 randomly selected classes (5-way) in the dataset. d Benchmark of the energy efficiency of the M3D-LIME chip and GPU. e Benchmark of the execution time on the M3D-LIME and 2D chip baseline.

Figure 4b shows the detailed implementations of feature extraction and template matching using M3D-LIME. For the feature extraction, input data are first stored in the input buffer of the CIM array and then fed into the array via data interface, such as digital-to-analog converter (DAC). After that, the key operation of MVM is performed by the CIM array, and the output is read out by the data interface, such as analog-to-digital converter (ADC). Other operations such as max pooling and activation functions of ReLU can be implemented by the CMOS logic in the 1st layer using standard Si circuit design, as previously demonstrated48,49,50. The results are stored in the output buffer to complete the calculation of one layer in the CNN. This process can be repeated until the entire CNN is completed to extract features from a given image, which are then binary-quantized into binary feature vectors. For the template matching, binary feature vectors extracted from training are stored in the RRAM-based TCAM arrays, and then compared with a given input feature vector for inference. Here the Hamming distance is calculated via a large-scale parallel search operation on the TCAM as experimentally illustrated in Fig. 3. To classify a new query, the category with the minimum Hamming distance (i.e., the slowest discharging ML) is the final classification result.

Following the above workflow, the network is used to implement one-shot/few-shot learning on the Omniglot dataset and benchmark the system performance of M3D-LIME. As shown in Fig. 4c, accuracies of 89% and 96% can be achieved in the 5-way 1-shot and 5-shot learning, respectively. Both values are close to those obtained by GPU (93% and 98%). Furthermore, the execution time and energy consumption were also evaluated to manifest the advantages of M3D-LIME compared to GPU and 2D chip architecture which is illustrated in the Supplementary Fig. 11. The benchmark results in Figs. 4d and 4e show that M3D-LIME could achieve 18.3× higher energy efficiency than GPU (Nvidia Tesla V100 as a commonly used reference for benchmark37, 51), as well as a 2.73× speedup than its 2D counterpart. The detailed execution time and energy efficiency benchmarks are shown in Supplementary Tables 3 and 4, respectively. The reported energy efficiency value of the Nvidia Tesla V100 GPU, which has been extensively utilized as a standard reference in numerous prior studies is included for comparison in the energy efficiency benchmark against GPUs. The pipeline implementation of MANN is also illustrated in Supplementary Fig. 12. For the 2D chip, the buffer is realized by a global cache nearby the CIM array, and data are transferred between the cache and the CIM array via a bus, where the bus bandwidth limits the number of CIM arrays computing in parallel and brings additional data transfer latency. By contrast, in the M3D-LIME chip, the buffer of each CIM array is realized by a local RRAM array located directly above it. As a result, data can be transferred more efficiently between the CIM array and the buffer directly via the high-density ILVs, which helps significantly reduce the latency.