Introduction

Neuromorphic devices offer the potential for a fundamentally new computing paradigm, one based on a brain-inspired architecture that promises enormous efficiency gains over conventional computing architectures1,2,3,4,5,6,7,8,9,10,11. A particularly successful neuromorphic computing approach is the implementation of spike-based neural network algorithms in CMOS-based neuromorphic hardware2,12,13,14,15,16,17. An alternate neuromorphic computing approach is to exploit brain-like physical properties exhibited by novel nano-scale materials and structures18,19,20,21,22, including, in particular, the synapse-like dynamics of resistive memory (memristive) switching4,23,24,25,26,27,28,29,30,31.

This study focuses on a class of neuromorphic devices based on memristive nanowire networks (NWNs)32,33. NWNs are comprised of metal-based nanowires that form a heterogeneous network structure similar to a biological neural network34,35,36,37,38. Additionally, nanowire-nanowire cross-point junctions exhibit memristive switching attributed to the evolution of a metallic nano-filament due to electro-chemical metallisation39,40,

Fig. 2: Input–output map** of MNIST digits in a nanowire network device.
figure 2

Images of MNIST digits ‘0’ and ‘5’ averaged across 100 samples randomly selected from the training set (column 1), their corresponding input voltage streams (column 2, red) and readout voltages from multiple channels (columns 3–7, blue).

Each row in Fig. 2 shows the averaged image, input and readout data for 100 MNIST samples randomly selected from the training set for the corresponding digit class.

For each class, the readout voltages from each channel (columns 3–7, blue) are distinctly different from the corresponding input voltages and exhibit diverse characteristics across the readout channels. This demonstrates that the NWN nonlinearly maps the input signals into a higher-dimensional space. Rich and diverse dynamical features are embedded into the channel readouts from the spatially distributed electrodes, which are in contact with different parts of the network (see Supplementary Fig. S6 for additional non-equilibrium dynamics under non-adiabatic conditions). We show below how the inter-class distinctiveness of these dynamical features, as well as their intra-class diversity, can be harnessed to perform online classification of the MNIST digits.

Online learning

Table 1 presents the MNIST handwritten digit classification results using the online method (external weights trained by an RLS algorithm). Results are shown for one and five readout channels. For comparison, also shown are the corresponding classification results using the offline batch method (external weights trained by backpropagation with gradient descent). Both classifiers learn from the dynamical features extracted from the NWN, with readouts delivered to the two classifiers separately. For both classifiers, accuracies increase with the number of readout channels, demonstrating the non-linearity embedded by the network in the readout data. For the same number of channels, however, the online method outperforms the batch method. In addition to achieving a higher classification accuracy, the online classifier W requires only a single epoch of 50,000 training samples, compared to 100 training epochs for the batch method using 500 mini-batches of 100 samples and a learning rate η = 0.1. The accuracy of the online classifier becomes comparable to that of the batch classifier when active error correction is not used in the RLS algorithm (see Supplementary Table 1). A key advantage of the online method is that continuous learning from the streaming input data enables relatively rapid convergence, as shown next.

Table 1 MNIST digit classification using an NWN device with one and five readout channels for online and batch (offline) linear classifiers

To better understand how learning is achieved with the NWN device, we investigated further the dependence of classification accuracy on the number of training samples and the device readouts. Figure 3a shows classification accuracy as a function of the number of digits presented to the classifier during training (See Supplementary Fig. S7 for classification results using different electrode combinations for input/drain/readouts and different voltage ranges). The classification accuracy consistently increases as more readout samples are presented to the classifier to update W and plateaus at ≃92% after ≃10,000 samples. Classification accuracy also increases with the number of readout channels, corresponding to an increase in the number of dynamical features (i.e., 5 × 784 features per digit for 5 channel readouts, the channels are added following the order 1,2,13,15,12) that become sufficiently distinguishable to improve classification. However, as shown in Fig. 3b, this increase is not linear, with the largest improvements observed from 1 to 2 channels. Figure 3c shows the confusion matrix for the classification result using 5 readout channels after learning from 50,000 digit samples. The classification results for 8 digits lie within 1.5σ (where s.d. is σ = 3%) from the average (93.4%). Digit ‘1’ demonstrates significantly higher accuracy since it has a simpler structure, and ‘5’ is an outlier because of the irregular variances of handwriting and low pixel resolution (See Supplementary Fig. S8 for examples of misclassified digits).

Fig. 3: Online learning of MNIST digits.
figure 3

a Testing accuracy as a function of the number of training samples read out from one and five channels. Inset shows a zoom-in of the converged region of the curve. b Maximum testing accuracy achieved after 50,000 training samples with respect to the number of readout channels used by the online linear classifier. Error bars indicate the standard error of the mean of 5 measurements with shuffled training samples. c Confusion matrix for online classification using 5 readout channels.

Mutual information

Mutual information (MI) is an information-theoretic metric that can help uncover the inherent information content within a system and provide a means to assess learning progress during training. Figure 4a shows the learning curve of the classifier, represented by the mean of the magnitude of the change in the weight matrix, \(\overline{| \Delta {{{{{{{\bf{W}}}}}}}}| }\), as a function of the number of sample readouts for 5 channels. Learning peaks at ≃102 − 103 samples, after which it declines rapidly and becomes negligible by 104 samples. This is reflected in the online classification accuracy (cf. Fig. 3a), which begins to saturate by ~104 samples. The rise and fall of the learning rate profile can be interpreted in terms of maximal dynamical information being extracted by the network. This is indicated by Fig. 4b, which presents mutual information (MI) between the 10 MNIST digit classes and each of the NWN device readouts used for online classification (cf. Fig. 3). The MI values for each channel are calculated by averaging the values across the 784 pixel positions. The coincidence of the saturation in MI with the peak in \(\overline{| \Delta {{{{{{{\bf{W}}}}}}}}| }\) between 102 − 103 samples demonstrates learning is associated with information dynamics. Note that by ≃102 samples, the network has received approximately 10 samples for each digit class (on average). It is also noteworthy that MI for the input channel is substantially smaller.

Fig. 4: Learning rate and mutual information.
figure 4

a Mean of the magnitude of changes in the linear weight matrix, \(\overline{| \Delta {{{{{{{\bf{W}}}}}}}}| }\), as a function of the number of samples learned by the network. b Corresponding Mutual Information (MI) for each of the 5 channels used for online classification (cf. Fig. 3) and for input channel 0.

Figure 5 shows MI estimated in a static way, combining all the samples after the whole training dataset is presented to the network. The MI maps are arranged according to the digit classes and averaged within each class. The maps suggest that distinctive information content is extracted when digit samples from different classes are streamed into the network. This is particularly evident when comparing the summed maps for each of the digits (bottom row of Fig. 5). Additionally, comparison with the classification confusion matrix shown in Fig. 3c reveals that the class with the highest total MI value (‘1’) exhibits the highest classification accuracy (98.4%), while the lowest MI classes (‘5’ and ‘8’) exhibit the lowest accuracies (89.6% and 89.5%), although the trend is less evident for intermediate MI values.

Fig. 5: Spatial distribution of mutual information (MI) for readout channels (rows), sorted by digit class (columns).
figure 5

MI maps summed over all channels are shown in the bottom row, and mean MI is shown above each map.

Sequence memory task

As mentioned earlier, RC is most suitable for time-dependent information processing. Here, an RC framework with online learning is used to demonstrate the capacity of NWNs to recall a target digit in a temporal digit sequence constructed from the MNIST database. The sequence memory task is summarised in Fig. 6. A semi-repetitive sequence of 8 handwritten digits is delivered consecutively into the network in the same way as individual digits were delivered for the MNIST classification task. In addition to readout voltages, the network conductance is calculated from the output current. Using a sliding memory window, the earliest (first) digit is reconstructed from the memory features embedded in the conductance readout of subsequent digits. Figure 6 shows digit ‘7’ reconstructed using the readout features from the network corresponding to the following 3 digits, ‘5’, ‘1’ and ‘4’. See “Methods” for details.

Fig. 6: Schematic outline of the sequence memory task.
figure 6

Samples of a semi-repetitive 8-digit sequence (14751479) constructed from the MNIST dataset are temporally streamed into the NWN device through one input channel. A memory window of length L (L = 4 shown as an example) slides through each digit of the readouts from 2 channels (7 and 13) as well as the network conductance. In each sliding window, the first (earliest) digit is selected for recall as its image is reconstructed from the voltage of one readout channel (channel 7) and memory features embedded in the conductance time series of later L − 1 digits in the memory window. Linear reconstruction weights are trained by the online learning method and reconstruction quality is quantified using the structural similarity index measure (SSIM). The shaded grey box shows an example of the memory exclusion process, in which columns of conductance memory features are replaced by voltage readouts from another channel (channel 13) to demonstrate the memory contribution to image reconstruction (recall) of the target digit.

Figure 7a shows the network conductance time series and readout voltages for one of the digit sequence samples. The readout voltages exhibit near-instantaneous responses to high pixel intensity inputs, with dynamic ranges that vary distinctively among different channels. The conductance time series also exhibits a large dynamic range (at least 2 orders of magnitude) and, additionally, delayed dynamics. This can be attributed to recurrent loops (i.e., delay lines) in the NWN and to memristive dynamics determined by nano-scale electro-ionic transport. The delay dynamics demonstrate that NWNs retain the memory of previous inputs (see Supplementary Fig. S9 for an example showing the fading memory property of the NWN reservoir). Figure 7b shows the respective digit images and I − V curves for the sequence sample. The NWN is driven to different internal states as different digits are delivered to the network in sequence. While the dynamics corresponding to digits from the same class show some similar characteristics in the I − V phase space (e.g., digit ‘1’), generally, they exhibit distinctive characteristics due to their sequence position. For example, the first instance of ‘4’ exhibits dynamics that explore more of the phase space than the second instance of ‘4’. This may be attributed to differences in the embedded memory patterns, with the first ‘4’ being preceded by ‘91’ while the second ‘4’ is preceded by ‘51’ and both ‘9’ and ‘5’ have distinctively different phase space characteristics, which are also influenced by their sequence position as well as their uniqueness.

Fig. 7: NWN sequence memory.
figure 7

a Conductance time series (G) and readout voltages for one full sequence cycle during the sequence memory task. For better visualisation, voltage readout curves are smoothed by averaging over a moving window of length 0.05 s and values for channel 7 are magnified by ×10. b Corresponding digit images and memory patterns in I − V phase space.

Figure 8a shows the image reconstruction quality for each digit in the sequence as a function of memory window length. Structural similarity (SSIM) is calculated using a testing group of 500 sets, and the maximum values achieved after learning from 7000 training sets are presented (see Supplementary Fig. S10 for the learning curving for L = 4 and Supplementary Fig. S11 for average SSIM across all digits). The best reconstruction results are achieved for digits ‘1’ and ‘7’, which are repeated digits with relatively simple structures. In contrast, digit ‘4’, which is also repeated, but has a less simple structure, is reconstructed less faithfully. This indicates that the repeat digits produce memory traces that are not completely forgotten before each repetition (i.e., nano-filaments in memristive junctions do not completely decay). On average, the linear reconstructor is able to recall these digits better than the non-repeat digits. For the non-repeat digits (‘5’ and ‘9’), the reconstruction results are more interesting: digit ‘5’ is consistently reconstructed with the lowest SSIM, which correlates with its low classification accuracy (cf. Fig. 3c), while ‘9’ exhibits a distinctive jump from L = 4 to L = 5 (see also Fig. 8b). This reflects the contextual information used in the reconstruction: for L = 4, ‘9’ is reconstructed from the sub-sequence ‘147’, which is the same sub-sequence for ‘5’, but for L = 5, ‘9’ is uniquely reconstructed from sub-sequence‘1475’, with a corresponding increase in SSIM. This is not observed for digit ‘5’; upon closer inspection, it appears that the reconstruction of ‘5’ suffers interference from ‘9’ (see Supplementary Fig. S12) due to the common sub-sequence ‘147’ and to the larger variance of ‘5’ in the MNIST dataset (which also contributes to its misclassification). A similar jump in SSIM is evident for the repeat digit ‘7’ from L = 2 to L = 3. For L = 3, the first instance of ‘7’ (green curve) is reconstructed from ‘51’, while the second instance (pink curve) is reconstructed from ‘91’, so the jump in SSIM from L = 2 may be attributed to digit ‘7’ leaving more memory traces in digit ‘1’, which has a simpler structure than either ‘9’ or ‘5’.

Fig. 8: Digit reconstruction in sequence memory task using the online learning algorithm.
figure 8

a Maximum SSIM for each digit in the sequence as a function of the memory window length after the network learned from 7000 training sets. The testing set is comprised of 500 sequence samples. Error bars indicate the standard error of the mean across the samples within each digit class. b An example of a target digit from the MNIST dataset and the reconstructed digits using memory windows of different lengths. c Maximum SSIM with respect to the number of columns excluded in the memory feature space. The results are averaged across all testing digits using L = 4, and the standard error of the mean is indicated by the shading. Dashed blue lines indicate when whole digits are excluded.

While the SSIM curves for each individual digit in the sequence increase only gradually with memory window length, their average (shown in Supplementary Fig. S11) shows an increase up to L = 5, followed by saturation. This reflects the repetition length of the sequence.

Figure 8c shows the maximum SSIM, averaged over all reconstructed digits using L = 4 when memory is increasingly excluded from the online reconstruction. SSIM decreases as more columns of conductance features are excluded (replaced with memoryless voltage features). This demonstrates that the memory embedded in the conductance features enhances online learning by the reconstructor. In particular, the maximum SSIM plateaus when ~28 and ~56 columns (corresponding to whole digits) are excluded and decreases significantly when the number of columns excluded is approximately 14, 42 or 70, indicating most of the memory traces are embedded in the central image pixels.