Introduction

At the core of today’s technological challenges is the ability to process information at massively superior speed and accuracy. Despite large-scale success of deep learning approaches in producing exciting new possibilities1,2,3,4,5,6,7, such methods generally rely on training big models of neural networks posing severe limitations on their deployment in the most common applications8. In fact, there is a growing demand for develo** small, lightweight models that are capable of fast inference and also fast adaptation - inspired by the fact that biological systems such as human brains are able to accomplish highly accurate and reliable information processing across different scenarios while costing only a tiny fraction of the energy that would have been needed using big neural networks.

As an alternative direction to the current deep learning paradigm, research into the so-called neuromorphic computing has been attracting significant interest9. Neuromorphic computing generally focuses on develo** novel types of computing systems that operate at a fraction of the energy comparing against current transistor-based computers, often deviating from the von-Neumann architecture and drawing inspirations from biological and physical principles10. Within the broader field of neuromorphic computing, an important family of models known as reservoir computing (RC) has progressed significantly over the past two decades11,12. RC conceptualizes how a brain-like system operates, with a core three-layer architecture (see Box 1 and Box 2): An input (sensing) layer which receives information and performs some pre-processing, a middle (processing) layer typically defined by some nonlinear recurrent network dynamics with input signals acting as stimulus and an output (control) layer that recombines signals from the processing layer to produce the final output. Reminiscent of many biological neuronal systems, the front end of an RC network, including its input and processing layers, is fixed and non-adaptive, which transforms input signals before reaching the output layer; in the last, output part of an RC the signals are combined in some optimized way to achieve the desired task. An important aspect of the output layer is its simplicity, where typically a weighted sum is sufficient, reminding a great deal of how common mechanical and electrical systems operate - with a complicated core that operates internally and a control layer that enables simple adaptation according to the specific application scenario.

Can such an architecture work? This inquiry was attempted in the early 2000s by Jaeger (echo state networks (ESNs)11) and Maass (liquid state machines (LSMs),12), achieving surprisingly high level of prediction accuracy in systems that exhibit strong nonlinearity and chaotic behavior. These two initially distinct lines of work were later reconciled into a unified, reservoir computing framework by Schrauwen and Verstraeten13, explicitly defining a new area of research that touches upon nonlinear dynamics, complex networks and machine learning. Research in RC over the past twenty years has produced significant results in the mathematical theory, computational methods as well as experimental prototypes and realizations, summarized in Fig. 1. Despite successes in those respective directions, large-scale industry-wide adoption of RC or broadly convincing “killer-applications” beyond synthetic and lab experiments are still not available. This is not due to the lack of potential applications. In fact, thanks to its compact design and fast training, RC has long been sought as an ideal solution in many industry-level signal processing and learning tasks including nonlinear distortion compensation in optical communications, real-time speech recognition, active noise control, among others. For practical applications, an integrated RC approach is much needed and can hardly be derived from existing work that focuses on either the algorithm or the experiment alone. This perspective offers a unified overview of the current status in theoretical, algorithmic and experimental RCs, to identify critical gaps that prevents industry adoption of RC and to discuss remedies.

Fig. 1: Selected research milestones of RC encompassing system and algorithm designs, representing theory, experimental realizations as well as applications.
figure 1

For each category a selection of the representative publications were highlighted.

Theory and algorithm design of RC systems

The core idea of RC is to design and use a dynamical system as reservoir that adaptively generates signal basis according to the input data and combines them in some optimal way to mimic the dynamic behavior of a desired process. Under this angle, we review and discuss important results on representing, designing and analyzing RC systems.

Mathematical representation of an RC system

The mathematical abstraction of an RC can generally be described in the language of dynamical systems, as follows. Consider a coupled system of equations

$$\left\{\begin{array}{l}\Delta {{{{{{ x}}}}}}=F({{{{{{ x}}}}}};{{{{{{ u}}}}}};{{{{{{ p}}}}}}),\quad \\ {{{{{{ y}}}}}}=G({{{{{{ x}}}}}};{{{{{{ u}}}}}};{{{{{{ q}}}}}}).\quad \end{array}\right.$$
(1)

Here the operator Δ acting on x becomes \(\frac{{{{{{{{\rm{d}}}}}}}}x}{{{{{{{{\rm{d}}}}}}}}t}\) for a continuous-time system, x(t + 1) − x(t) for a discrete-time system, and a compound of these two operations for a hybrid system. Additionally, \({{{{{{ u}}}}}}\in {{\mathbb{R}}}^{d}\), \({{{{{{ x}}}}}}\in {{\mathbb{R}}}^{n}\), and \({{{{{{ y}}}}}}\in {{\mathbb{R}}}^{m}\) are generally referred to as the input, internal state and output of the system, respectively, with vector field F, output function G and parameters p (fixed) and q (learnable) representing their functional couplings. Once set up by fixing the vector field F and the output function G and the parameters p, one can utilize the RC system to perform learning tasks, typically in time-series data. Given a time series \({\{{{{{{{ z}}}}}}(t)\in {{\mathbb{R}}}^{m}\}}_{t\in {\mathbb{N}}}\), an optimization problem is usually formulated to determine the best q:

$$\mathop{\min }\limits_{{{{{{{ q}}}}}}}{\int}_{t}\left(\parallel\! G({{{{{{x(t)}}}}}};{{{{{{u(t)}}}}}};{{{{{{ q}}}}}})-{{{{{{ z}}}}}}(t){\parallel }^{2}+\beta R({{{{{{ q}}}}}})\right)dt,$$
(2)

where R(q) is a regularization term.

Also, when z(t) is seen as a driving signal, the optimization problem can be regarded as a driving-response synchronization problem finding appropriate parameters q14. Since RC is often simulated on classical computers, most commonly used RC takes discrete time steps:

$$\left\{\begin{array}{l}{{{{{{ x}}}}}}(t+1)=(1-\gamma ){{{{{{ x}}}}}}(t)+\gamma f(W{{{{{{ x}}}}}}(t)+{W}^{(in)}{{{{{{ u}}}}}}(t)+{{{{{{ b}}}}}}),\quad \\ {{{{{{ y}}}}}}(t)={W}^{(out)}{{{{{{ x}}}}}}(t),\quad\hfill \end{array}\right.$$
(3)

which is a special form of (1), but now with time steps and network parameters more explicitly expressed. In this form, f is usually a component-wise nonlinear activation function (e.g., \(\tanh\)), the input-to-internal and internal-to-output map**s are encoded by the matrices W(in) and W(out), whereas the internal network is represented by the matrix W. The additional parameters b and γ are used to ensure that the dynamics of x is bounded, non-diminishing and (ideally) exhibits rich patterns that enable later extraction. Given some training time series data {z(t)} (assumed to be scalar for notational convenience), once the RC system is set up by fixing the choice of f, γ, b, W(in) and W, the output weight matrix W(out) can be obtained by attempting to minimize a loss function. A commonly used loss function is

$${W}^{(out)\top }=\arg \mathop{\min }\limits_{{{{{{{ w}}}}}}}\left(\parallel\! X{{{{{{ w}}}}}}-{{{{{{ z}}}}}}{\parallel }^{2}+\beta \parallel\! {{{{{{ w}}}}}}{\parallel }^{2}\right),$$
(4)

where \(X={({{{{{{ x}}}}}}{(1)}^{\top },\,{{{{{{ x}}}}}}{(2)}^{\top },\ldots,\,{{{{{{ x}}}}}}{(T)}^{\top })}^{\top }\), z = (z(1), z(2), …, z(T)) and β ∈ [0, 1] is a prescribed parameter. This problem is in a special form of Tikhonov regularization and yields an explicit solution \({W}^{(out)\top }={\left({X}^{\top }X+{\beta }^{2}I\right)}^{-1}{X}^{\top }{{{{{{ z}}}}}}\).

Common RC designs

Designing is a crucial step for acquiring a powerful RC network. There are still no complete instructions on how to design optimal RC networks based on various necessities. With the unified forms Eqs. (1) and (2) in mind, a standard RC system as initially proposed contains everything random and fixed including the input and internal matrices W(in) and W, leaving the choice of parameters γ and β according to some heuristic rules. Based on this default setting, we show how different RC designs can generally be interpreted as optimizing in one and/or multiple parts along the following directions. Firstly, in RC coupling parameter search, with the goal of selecting a good and potentially optimal coupling parameter γ to maintain the RC dynamics bounded and produces rich pattern that allow for the internal states to form a signal bases that can later be combined to approximate the desired series {z(t)}. Empirical studies have shown that γ chosen so that the system is around the edge of chaos15 typically produces the best outcome, which is supported by a necessary but not sufficient condition - imposed on the largest singular value of the effective stability matrix Wγ = (1 − γ) + γW. Then, in RC output training, whose design commonly amounts to two aspects. One is to determine the right optimization objective, for instance the one in Eq. (4) with common generalizations include to change the norms used in the objective in particular the term ∥w∥ to enforce sparsity or to impose additional prior information by changing βw∥ into Lw∥ with some matrix L encoding the prior information. On the other hand, (upon choice of the objective) to further determine the parameter, e.g., β as in Eq. (4). Although there is no general theoretically guaranteed optimal choice, several common methods can be utilized, e.g., cross-validation techniques that had been well-developed in the literature of computational inverse problems. RC network design is crucial to determine the dynamic characteristics. With the goal of determining a good internal coupling network W. This has received much attention and has attracted many novel proposals, which include structured graphs with random as well as non-random weights16,17, and networks that are layered and deep or hierarchically coupled18,19,20. Furthermore, sometimes those designs are themselves coupled with the way the input and output parts of the system are used, for example in solving partial differential equations (PDEs)21,22 or representing the dynamics of multivariate time series23. Finally, as for RC input design, although received relatively little attention until recently, it turns out that the input part of an RC can play very important roles in the system’s performance. Here input design is generally interpreted to include not only the design of the input coupling matrix W(in) but also potentially some (non)linear transformation on the input u(t) and/or target variable z(t) prior to setting up the rest of the RC system. The so-called next-generation RC (NG-RC) is one such example24, showing great potential of input design in improving the data efficiency (less data required to train) of an RC.

In addition to the separate designs of the individual parts of an RC, the novel concept of neural architecture search (NAS) has motivated the research of hyperparmeter optimization11. Ref. 11 considers RC network with sigmoid nonlinearity and unit output function and showed that if the largest singular value of the weight matrix W is less than one then the system has ESP, and if the spectral radius of W is larger than one then the system is asymptotically unstable and thus cannot has ESP. Tighter bounds were subsequently derived in41. In particular, the spectral radius condition provides a practical way of ruling out bad RCs and can be seen a necessary condition for RC to properly function.

The second category is about memory capacity. Defined by the summation of delay linear correlations of the input sequence and output states, was shown to not exceed N for under iid input stream42, can be approached with arbitrary precision using simple linear cyclic reservoirs16, and can be improved using the time delays in the reservoir neurons43.

Universal approximation theorems can be regarded as a single category. Prior to the research of RC, universal representation theorems by Boyd and Chua showed that any time-invariant continuous nonlinear operator can be approximated either by a Volterra series or alternatively by a linear dynamical system with nonlinear readout44. RC’s representation power has attracted significant recent interest: ESNs are shown to be universally approximating for discrete-time fading memory processes that are uniformly bounded45 and further that the approximating family can be associated with networks with ESP and fading memory46. For discrete-time stochastic inputs, linear reservoir systems with either polynomial or neural network readout maps are universal and so are ESNs with linear outputs under further exponential moment constraints imposed on the input process47. For structurally stable systems, they can be approximated (upon topological conjugacy) by a sufficiently large ESN48. In particular, ESNs whose output states are trained with Tikhonov regularization are shown to approximate ergodic dynamical systems49. Also rigorously, the dynamics of RC is validated as a higher-dimensional embedding of the input nonlinear dynamics43. In addition, explicit error bounds are derived for ESNs and general RCs with ESP and fading memory properties under input sequences with given dependency structures50. Finally, according to conventional and generalized embedding theories, the RCs with time delays are established with significantly-reduced network sizes, and sometimes can achieve dynamics reconstruction even in the reservoir with a single neuron43.

The last category includes research about linear versus nonlinear transformations and next-generation RC. Focusing on linear reservoirs (possibly upon pre-transformations of the input states), recent work showed that the output states of an RC can be expressed in terms of a controllability matrix together with the network encoded inputs17. Moreover, a simplified class of RCs are shown to be equivalent to general vector autoregressive (VAR) processes51 - with possible nonlinear basis expansions it forms theoretical foundations for the recently coined concept of next-generation RC24.

Research of how to design RC architectures, how to train them and why they work have, over the past two decades following the pioneering works of Jaeger and Maass, led to much evolved view of the capabilities as well as limitations of the RC framework for learning. On the one hand, simulation and numerical research has produced many new network architectures improving the performance of RC beyond purely random connections; future works can either adopt a one-fits-all approach to investigate very large random RCs or perhaps more likely to follow the concept of domain-specific architecture (DSA)52 to explore structured classes of RCs that achieve optimal performance for particular types of applications, with Bayesian optimization26,