1 Introduction

From diagnosing diseases to translating languages or forecasting the weather, machine learning has changed our lives in many ways by giving computer systems new powers. The successes achieved by machine learning systems, however, highly rely on experienced machine learning experts who design specific machine learning pipelines. These pipelines are typically composed of various components, such as data preprocessors, feature extractors and training algorithms. In order to create high-performing models using these pipelines, machine learning experts need to deal with the complex task of selecting suitable components among many available options, as well as finding performance-optimising hyperparameter settings for each component. The many choices that have to be made in this context typically interact in ways even seasoned experts find difficult to predict; achieving good results, therefore, relies on experience, intuition and substantial amounts of experimentation. Furthermore, beyond the design of pipelines for offline modelling, there is a need for automatic adaptation and generation of new models in autonomous systems with machine-learning components. Such systems should be able to automatically deal with changes in their environment, as well as with modified or new system components, without the need for manually updating the underlying machine-learning pipelines.

Automated machine learning, in short AutoML, is a rapidly growing research area that strives to make machine learning available to non-machine learning experts by automating the steps that needs to be taken in creating high-performance machine-learning pipelines for given use cases. The topic of AutoML is strongly tied to the broader theme of automating data science (De Bie et al. 2022), which aims at automating the full spectrum of tasks performed in the context of deriving insight from raw data. Generally, automating data science considers automation of four key steps: (1) data exploration to understand and interpret data, (2) data engineering to create a dataset ready to be fed into a machine learning pipeline by cleaning and removing outliers, data augmentation and feature engineering (3) model building by algorithm selection and hyperparameter optimisation, and (4) exploitation of data in decision-making. Among these, current AutoML systems mainly focus on automated model building for a given dataset. Therefore, in this survey, we mainly cover research in Step 3. More concretely, AutoML can be considered as the process of automatically selecting at least one of the following: (a) the structure of a machine learning pipeline, (b) steps of the pipeline and (c) hyperparameters of steps of the pipeline. It should be mentioned that there are automated systems, such as the automated statistician (Steinruecken et al. 2019), that have focused on automating other data science tasks (e.g., automatically creating human-readable reports from raw data). However, these fall outside the scope of this article.

Foundations of modern AutoML systems were laid as early as 1976, when John Rice introduced the algorithm selection problem (Rice 1976). Earlier work that considered automating steps of machine learning pipelines considered algorithm selection (Ali and Smith 2006; Brazdil and Soares 2000) or hyperparameter optimisation (Bengio 2000; Bergstra and Bengio 2012) separately. The real power of AutoML was unlocked through the definition of the combined algorithm selection and hyperparameter optimisation problem,

and the subsequent demonstration that this problem can be solved effectively, with the development of Auto-WEKA (Thornton et al. 2013), which used this concept to select a model from a search space consisting of classic machine learning algorithms and their hyperparameters. More recently, AutoML research has also placed a strong focus on neural architecture search (NAS) (Liu et al. 2018a), an area that specifically concerns the design of one class of machine-learning models, i.e., neural networks, by automatically designing them.

There has recently been broad and substantial impact of AutoML research within academia and industry, leading to a steep increase in research activities in the area. Results of ongoing academic research in AutoML include systems such as auto-sklearn (Feurer et al. 2015), TPOT (Olson et al. 2016a) and AutoKeras (** et al. 2019). Inspired by these works, many industrial AutoML platforms have also be designed for large-scale deployments, such as Google Cloud AutoML (Bisong 2019), Amazon SageMaker Autopilot (Das et al. 2020) and Oracle AutoML (Yakovlev et al. 2020). Thanks to these systems, high-performance machine learning models and pipelines can now be constructed with minimal effort from users.

Overall, we aim to provide a survey targeting novice researchers and practitioners on AutoML with knowledge on Machine Learning. We provide extensive background to understand the approach taken for research in both AutoML and NAS systems. Specifically, we

  • provide comprehensive background information needed to understand the advanced approaches taken in AutoML research;

  • cover the topic of hyperparameter optimisation, a core technique used in most AutoML systems;

  • provide an overview of the available state-of-the-art AutoML systems for classic machine learning and deep neural networks; and

  • present and discuss a list of open challenges and future research directions.

Following previous work (see, e.g., Hutter et al. 2019; Elsken et al. 2019b), we focus our discussion on three key components found in all AutoML systems: (1) the search space that incorporates all design choices in a given machine-learning pipeline (i.e., all algorithms in a pipeline, their hyperparameters, or architectural hyperparameters of a neural network; (2) the search strategy used to find good candidate solutions within the search space, i.e., performance-optimising instantiations of the design choices; and (3) the performance evaluation (or estimation) technique used to automatically assess the performance of candidate solutions on a fraction of a given dataset.

Using these components as a structuring principle, we provide a comprehensive overview of the most relevant past and present work in AutoML. In this, we do not aim for completeness (which, in light of the large number of publications on the topic, would be hardly feasible), but rather focus on what can reasonably be considered the most important and impactful approaches. After introducing definitions of the problems encountered and solved in the area of AutoML, we discuss the three key components in more detail, followed by brief general introductions to two additional topics of broad importance to AutoML research: benchmarks and meta-learning techniques.

We review the techniques used for both AutoML and NAS systems. We start by discussing background topics for both categories of work by introducing the AutoML problem definitions, the search space and the search strategy. Next, we cover hyperparameter optimisation techniques used for tuning machine learning algorithms defined with a fixed search space and general performance estimation approaches. Following this background material, we present three categories of AutoML approaches: hyperparameter optimisation techniques, which assume that the only design choices to be made are hyperparameters of a given machine-learning pipeline or system; NAS techniques, which focus on design choices related to the architecture of deep neural networks; and broad-spectrum AutoML systems that consider design spaces containing broad classes of machine-learning algorithms, but are not specifically focused on deep neural networks. Finally, we discuss several directions for future research in the area of AutoML that we believe are particularly promising.

We note that there are several earlier survey papers on AutoML (Yao et al. 2021), Salehin et al. (2024) provide AutoML surveys mainly focused on NAS. In contrast, we cover hyperparameter optimisation, NAS and broad-spectrum AutoML systems, emphasising connections between those sub-areas of AutoML. The survey article by Salehin et al. (2024) provides a more general overview of NAS methods, along with introductory information suitable for readers with limited machine learning knowledge (e.g., different tasks on machine learning pipelines). Compared to that survey, ours discusses relevant methods from the literature in more detail. We also provide a more detailed background on important topics, such as important benchmarks and libraries. The short survey of Zöller and Huber (2021) mainly provides background information to understand their benchmarking task of a number of available AutoML systems. It covers hyperparameter optimisation techniques but does not provide much detail about the underlying search spaces or about NAS systems. The survey by Escalante (2021), as the name suggests, mainly provides a brief overview of the state of AutoML research without providing details on any of the techniques. Del Valle et al. (2023) focus on reviewing AutoML solutions specifically designed for multi-label and multi-target regression problems, while we provide a comprehensive and detailed overview of AutoML systems for a broad class of machine learning problems.

There are several other papers that focus purely on NAS (Elsken et al. 2019b; Wistuba et al. 2019), mainly covering early research and the main components of NAS. Compared to the more recent survey by Ren et al. (2022), our discussion of NAS is supported by extensive background information (e.g., hyperparameter optimisation approaches) to provide a deeper understanding of the underlying topics. There are also a number of dedicated survey articles on topics such as evolutionary NAS (Liu et al. 2021) and reinforcement learning for NAS (Jaafra et al. 2019). While we cover important work on these topics in our survey, the scope of our work is different, as we aim to cover the broader research area of AutoML; thus, we consider these surveys as complementary to our work.

Overall, we aim to provide a useful entry point into the area of AutoML to researchers and practitioners new to the area, as well as a basis for further discussion and development to AutoML experts.

2 Background

In this section, we mainly focus on introducing the concepts and techniques that are referred to in the context of AutoML later in this paper.

2.1 Problem definitions

Machine learning models are generated by machine learning algorithms by internally optimising a number of parameters. Machine learning algorithm typically have several hyperparameters that need to be externally set by users and control various aspects of the training process; their values usually do not change during training. For example, a neural network model can have millions of parameters, such as the weights and bias values. The learning algorithm that determines these parameters has several hyperparameters, such as the learning rate, the optimisation schedule and how to perform data augmentation. Next to the choice of the algorithm, the choice of hyperparameters has a strong influence on the final performance of the model and the efficacy of the learning process.

AutoML aims to produce the best model for a dataset of interest, given a large set of machine learning algorithms and their hyperparameter spaces. There are two major goals in finding these models: (1) to obtain a model with the highest performance for a given machine learning task (e.g., classification or regression accuracy) on the dataset, and (2) to ensure that good “out of sample generalisation” of the model on unseen data is achieved. Hyperparameter optimisation (HPO) strategies facilitate reaching the first goal, while towards the second, techniques such as cross-validation are used to compare models obtained using different hyperparameter configurations.

There are three closely related AutoML problems, namely, (i) algorithm (or model) selection, which deals with multiple machine learning algorithms with fixed hyperparameter settings; (ii) hyperparameter optimisation, which deals with a single algorithm; and (iii) combined algorithm selection and hyperparameter optimisation (Thornton et al. 2013). To formally define these problems, we use the following notation:

  • \(\varvec{\Lambda } = \Lambda _1 \times \Lambda _2 \times \cdots \times \Lambda _n\) denotes the configuration space of algorithm A induced by n hyperparameters, where \(\Lambda _i\) denotes the domain of the ith hyperparameter.

  • \(\mathcal {A} = \{A^{(1)},\cdots ,A^{(R)}\}\) denotes a set of learning algorithms. The hyperparameters of each algorithm \(A^{(j)}\) are defined over the configuration space \(\pmb \Lambda ^{(j)}\).

  • \(\mathcal {D} = \{(\textbf{x}_1, y_1), \cdots , (\textbf{x}_n, y_n)\}\) denotes the training dataset, where \(\mathbf {x_i}\) and \(y_i\) represent the values of features and target variables, respectively. To perform k-fold cross-validation, \(\mathcal {D}\) is further split into k equally-sized partitions composed of training (\(\mathcal {D}_{train}^{(1)}, \cdots , \mathcal {D}_{train}^{(k)}\)) and validation sets (\(\mathcal {D}_{valid}^{(1)}, \cdots , \mathcal {D}_{valid}^{(k)}\)) such that \(\mathcal {D}_{train}^{(i)}= \mathcal {D} /\ \mathcal {D}_{valid}^{(i)}\). Note that there is usually also a test set, but this is considered to be completely external to the selection or optimisation procedure.

  • Generalisation performance is evaluated by running training algorithm A on \(\mathcal {D}_{train}^{(i)}\) and evaluating the resulting model on \(\mathcal {D}_{valid}^{(i)}\) by measuring the loss using a performance metric, such as classification accuracy, denoted by \({\mathcal {L}({A}, \mathcal {D}_{train}^{(i)}, \mathcal {D}_{valid}^{(i)}})\). This evaluation approach corresponds to k-fold cross-validation, which has also been the method of choice in the AutoML literature thus far. In principle, however, other validation techniques can be used.

Definition 1

(Algorithm (or Model) Selection Problem) Given a set \(\mathcal {A}\) of learning algorithms (associated with specific classes of models), the algorithm selection problem can then be defined as finding \(A^* \in \mathcal {A}\) that yields the model with the best performance on the validation set (Thornton et al. 2013):

$$\begin{aligned} {A}^* \in \mathop {\mathrm {arg\,min}}\limits \limits _{A \in \mathcal {A}} \frac{1}{k} \cdot \sum _{i=1}^{k} {\mathcal {L}\Big ({A}, \mathcal {D}_{train}^{(i)}, \mathcal {D}_{valid}^{(i)}}\Big ) \end{aligned}$$
(1)

We note that the \(A \in \mathcal {A}\) can also correspond to the same learning algorithm with different hyperparameter settings or even just different sequences of random numbers; equivalently, the problem can also be formalised for sets of models (irrespectively how these were obtained) rather than sets of algorithms.

Definition 2

(Hyperparameter optimisation problem) Given a learning algorithm A with hyperparameters \(\pmb {\Lambda }\), and letting \(A_{\pmb {\lambda }}\) denote that A with the hyperparameter vector \(\pmb {\lambda } \in \varvec{\Lambda }\), the HPO problem can be defined as finding \(\pmb {\lambda }^*\) that yields the model with the best performance on the validation set (Thornton et al. 2013):

$$\begin{aligned} \pmb {\lambda }^* \in \mathop {\mathrm {arg\,min}}\limits \limits _{\pmb {\lambda } \in \varvec{\Lambda }} \frac{1}{k} \cdot \sum _{i=1}^k \mathcal {L}\Big (A_{\pmb {\lambda }}, \mathcal {D}_{train}^{(i)}, \mathcal {D}_{valid}^{(i)}\Big ) \end{aligned}$$
(2)

Definition 3

(Combined Algorithm Selection and Hyperparameter Optimisation (CASH) problem) Given a set of learning algorithms \(\mathcal {A}\), where the hyperparameters of each algorithm \(A^{(j)} \in \mathcal {A}\) have the configuration space \(\pmb \Lambda ^{(j)}\). The CASH problem can be defined as finding the algorithm \(A^*\) and its hyperparameter configuration \(\pmb {\lambda }^*\) that yields the model with the best performance on the validation set (Thornton et al. 2013):

$$\begin{aligned} A^*, \pmb \lambda ^* \in \mathop {\mathrm {arg\,min}}\limits \limits _{A^{(j)}\in \mathcal {A}, \pmb \lambda \in \pmb \Lambda ^{(j)}} \frac{1}{k} \cdot \sum _{i=1}^k \mathcal {L}\Big (A_{\pmb \lambda }^{(j)}, D_{train}^{(i)}, D_{valid}^{(i)}\Big ). \end{aligned}$$
(3)

Dealing with the algorithm selection problem requires finding an approach for setting the hyperparameters of the algorithms. For instance, this can be done by using the default hyperparameter values or optimising the hyperparameters of each algorithm separately before performing algorithm selection. This approach is less relevant for modern AutoML research and thus will not be covered in this survey paper. The HPO problem is still frequently tackled in NAS systems, where only the hyperparameters of one type of learning algorithm are optimised.

HPO procedures are also used within approaches for solving the CASH problem. Section 3 covers HPO strategies, and in Sect. 4, we will discuss how these techniques are used in NAS. The CASH problem is the most general AutoML problem and typically considers a much more diverse search space based on different machine learning algorithms and other elements of a machine learning pipeline, such as preprocessors. In Sect. 5, we will review AutoML systems that are defined based on the CASH problem.

2.2 Components of AutoML

As outlined in Fig. 1, AutoML approaches can be defined in terms of three key components: (i) the search space, (ii) the search strategy (or algorithm), and (iii) performance evaluation (or estimation) (see, e.g., Hutter et al. 2019; Elsken et al. 2019b). In the following, we explain these components in more detail.

Fig. 1
figure 1

Three important components of AutoML

2.2.1 Search space

The search space defines all design choices in the machine learning solution space (pipeline) that need to be made. These include the learning algorithms (and hence model classes) that can be selected and their hyperparameters (including, e.g., architectural hyperparameters of a neural network). By treating the choice of the algorithm as a hyperparameter, the CASH problem can, in principle, be tackled using HPO techniques.

There are various types of hyperparameters, including categorical and numerical hyperparameters. Categorical hyperparameters can take a discrete set of values. For example, support vector machines can use linear, Gaussian or sigmoid kernels. There is no notion of order among these values. On the other hand, numerical hyperparameters are defined on the numeric domain, e.g., the set of all positive integers or the set of all real values. These hyperparameters typically have a range predefined by the AutoML engineer. Examples are the learning rate of a neural network or the kernel width of the aforementioned Gaussian kernel. Hyperparameter optimisation methods and AutoML systems implicitly deal with these types of hyperparameters in a different way. While early research explicitly mentioned how these methods handle the various types of hyperparameters (see, e.g., Bergstra et al. 2011; Hutter et al. 2011), in current work this is often handled implicitly, for example by using numerical encodings of categorical hyperparameters [see, e.g., auto-sklearn (Feurer et al. 2015)].

There can be dependencies among hyperparameters in the search space as well. Some hyperparameters are only relevant if a certain other hyperparameter is set to a given value. For example, the kernel width hyperparameter is only relevant for certain types of kernels (e.g., Gaussian, sigmoid). Similarly, in neural networks, the number of layers determines the relevance of other hyperparameters for each layer. As mentioned previously, the choice of the learning algorithm can be modelled using a categorical hyperparameter, whose value determines which group of hyperparameters specific to the various learning algorithm will be activated. These hyperparameters are referred to as conditional hyperparameters and lead to a tree-structured search space where some hyperparameters that appear in leaf nodes are only valid when their parent nodes take certain values. Therefore, when using HPO techniques, we have to consider that these hyperparameters are essentially different from the other hyperparameters. In some cases, it might be plausible to assume that the difference between small values of a given numerical hyperparameter has a higher impact on performance than the same difference between larger values. To make this more concrete, the difference between a learning rate of 0.01 and 0.02 is typically more significant than the difference between a learning rate of 10 and 10.01. When, for example, a random search operator samples uniformly over such a range, all values will be chosen with equal probability, neglecting this important aspect. In order to address this issue, a transformation can be applied to the hyperparameter range, such that sampling takes place in this transformed range. A common choice for this is the log-transformation operator. When applying the log transformation on a hyperparameter range, values are sampled uniformly within this logarithmic space before being transformed back to the original space and subsequently being used to instantiate the configuration. This ensures that higher emphasis is being placed on smaller values. Snoek et al. (2014) describe a method that automatically selects a flexible transformation to sample hyperparameter values from.

2.2.2 Search algorithm

The search algorithm specifies how the search space is explored in order to optimise the performance of the resulting model. Early approaches considered search spaces that were small enough to permit exhaustive enumeration. However, realistic applications of AutoML typically give rise to infinite search spaces induced by dozens of hyperparameters. While, in principle, these can still be explored using random search or grid search (after discretisation), in practice, far better results can usually be obtained using more sophisticated search strategies. In Sects. 3.2 and 4.2, we will discuss various search algorithms that are proposed with the aim of optimising the hyperparameters of machine-learning algorithms and pipelines.

2.2.3 Performance evaluation

The search algorithm iteratively provides a series of candidate solutions (i.e., learning algorithms and/or hyperparameter settings) that could be used to solve the given machine-learning task. In order to determine which of these is the best, their performance needs to be evaluated. In addition, the final result (i.e., model) produced by an AutoML system also needs to be evaluated. It is important to ensure that the data used for this latter outer validation is disjoint from that used for the former inner validation since otherwise, the degree to which the model generalises to new data is likely to be overestimated. Therefore, as mentioned in Sect. 2.1, standard cross-validation, based on splits into training and testing data, does not suffice for the general purpose of AutoML. This has been taken into account in defining the \(\mathcal {L}\) function in the definitions of Sect. 2.1.

The most commonly employed method for dealing with this situation splits the data \(\mathcal {D}\) that is used by the AutoML system to produce a final model into a training set (\(\mathcal {D}_{train}\)) and a validation set (\(\mathcal {D}_{valid}\)). Any model considered by the search algorithm underlying the AutoML system is then trained on the training set and evaluated on the validation set, as mentioned earlier in Sect. 2.1. This is typically done using a k-fold (inner) cross-validation scheme, by splitting \(\mathcal {D}\) into k equally-sized partitions composed of training (\(\mathcal {D}_{train}^{(1)}, \cdots , \mathcal {D}_{train}^{(k)}\)) and validation sets (\(\mathcal {D}_{valid}^{(1)}, \cdots , \mathcal {D}_{valid}^{(k)}\)) such that \(\mathcal {D}_{train}^{(i)}= \mathcal {D} /\ \mathcal {D}_{valid}^{(i)}\).

This overall procedure is sometimes referred to as nested evaluation or nested cross-validation (Varma and Simon 2006). In the remainder of this survey, we focus on the inner validation, which forms a key component of the AutoML system.

2.3 Benchmarks

In (automated) machine learning, it is quite common to develop a method that does not only solve a single problem but is meant to solve a wide range of problems. For example, the methods described in Sect. 3 all address HPO, which can be applied to many datasets. In order to justify claims about the performances of these systems, researchers and practitioners rely on benchmarks. Having good benchmarks is important, as they form the basis for a fair and accurate assessment of the state of the art, which in turn provides the basis for determining which approaches can readily be deployed in practice and for deciding which methods receive attention in terms of further research. In Sects. 3 and 4, we will give an overview of important and widely used benchmarks for HPO and NAS, respectively.

2.4 Meta-learning

The main goal of meta-learning is to observe how various machine-learning approaches perform on a range of different datasets and to use the meta-data collected from these observations to learn new tasks faster (Vanschoren 2018; Brazdil et al. 2022).

In the context of AutoML, meta-learning is typically used for warm-starting the search process by recommending good initial hyperparameter settings [as used, e.g., in Auto-sklearn (Feurer et al. 2022) and OBOE (Yang et al. 2019)]. There are also forms of meta-learning that immediately warm-start the parameters of a model [i.e., the weights of a neural network; see, e.g., MAML (Finn et al. 2017) and Reptile (Nichol et al. 2014; Hutter et al. 2014; van Rijn and Hutter 2018), whereas other works aim to automatically construct a good search space based on historical data (Hoos 2012; Wistuba et al. 2015a; Pfisterer et al. 2021; Perrone et al. 2019). In this chapter, we will review what we consider the most impactful work in this direction.

We note that relatively little is known about the search landscapes in which hyperparameter optimisation and other AutoML tasks take place. Recently, Pushak and Hoos (2022) have found evidence that most AutoML loss landscapes only have a small number of strongly interacting hyperparameters, giving rise to a convex, unimodal structure. The idea of using efficiently computable proxies for an expensive objective function is underlying several classes of HPO methods (notably, Bayesian optimisation) and has also been explored in engineering optimisation (Long et al. 2022) and image processing (Tseng et al. 2019). Furthermore, exploratory landscape analysis has recently been used for identifying efficiently computable proxy functions for HPO (Schneider et al. 2022). While all of those efforts aim at improving our understanding of HPO (and more general: AutoML) landscape properties, they have yet to produce major practical impact in terms of guiding the design of search spaces or search strategies.

Important design criteria for the search spaces underlying HPO are the set of hyperparameters to be optimised, the dependencies between these hyperparameters (i.e., conditional hyperparameters), the ranges of values to be considered for numerical hyperparameters, and potential transformations. In Sects. 3.1.1 and 3.1.2 we will describe functional ANOVA and ablation analysis, respectively, for determining the importance of hyperparameters on a single dataset, as a post-hoc analysis after HPO. Section 3.1.3 describes works that utilise this knowledge across datasets, which is an important step towards building configuration spaces. Section 3.1.4 describes a technique to apply the transformation of a hyperparameter search space. Section 3.1.5 outlines various philosophies behind designing a search space. Finally, Sect. 3.1.6 covers several methods that bring all these components together and automatically design a search space.

3.1.1 Hyperparameter importance through functional ANOVA

When designing a search space, one important consideration is to determine which hyperparameters should be considered and which can be safely ignored because they hardly influence performance. Functional ANOVA is a statistical method commonly used for sensitivity analysis that can determine the contribution of a single hyperparameter (or combination of hyperparameters) to the overall variance of the result (Sobol 1993). Functional ANOVA works based on the marginal for each subset of hyperparameters and determines their general influence on the performance on a given dataset. It works on the premise that hyperparameters with a high variance in the marginal are important to optimise, and hyperparameters with a low variance in the marginal are less important to optimise.

The marginal determines, for a given subset of hyperparameters and a given set of values for these hyperparameters, what the marginal performance is. Informally, for a given hyperparameter, the marginal performance is the performance per value, averaging over all possible values of all other hyperparameters. This can be easily visualised for a small number of hyperparameters. Figure 2 shows examples of marginals for two hyperparameters of a support vector machine (left, middle), and also for the combination of these hyperparameters (right). The vertical axes present the average expected performance of a given value when averaging over all possible values of all other hyperparameters. As such, this is a much stronger statement about the performance than just varying a single hyperparameter over a given range, but this comes at the cost of additional effort of computing this.

Fig. 2
figure 2

Marginals of two hyperparameters (complexity and gamma) of a support vector machine, and the combination of both. The horizontal axes represent the value of a given hyperparameter, and the vertical axis represents the marginal performance, in terms of predictive accuracy (van Rijn and Hutter 2018)

Averaging over all possible values of all performance values of all other hyperparameters seems practically infeasible, as many search spaces are prohibitively large or even infinite. However, Hutter et al. (2014) showed how the marginal performance values for machine learning hyperparameters could be calculated using surrogate models (Eggensperger et al. 2015). This requires data on the performance of a large number of configurations, which can be obtained by evaluating these configurations on the current dataset. HPO methods typically perform various performance evaluation on various parts of the search space, and come with such performance evaluations as a side-product. As such, methods to determine hyperparameter importance can often be used in post-hoc analysis, after an HPO procedure has been used to optimise the hyperparameters on a given dataset. The surrogate model can then be trained to map hyperparameter configurations to performance estimates. After it has been trained, it can predict the performance of previously unseen hyperparameter configurations. Note that these surrogate models might not be fully accurate, and some effort should be devoted to ensuring that the surrogate model is sufficiently accurate. Hutter et al. (2014) show specifically how tree-based surrogate models can be used to efficiently calculate marginals for hyperparameters, even for machine learning algorithms with infinitely large configuration spaces.

Once we have determined the marginal for each hyperparameter and combination of hyperparameters, we can utilise functional analysis of variance (ANOVA) to determine the importance of these. Functional ANOVA calculates the variance of the marginal of a given subset of hyperparameters and relates it to the variance of the other marginals. This indicates how much a given subset of hyperparameters contributes to the variance in performance when taking into consideration all hyperparameters. Hyperparameters that yield marginals with high variance are deemed important and should likely be optimised. Hyperparameters that yield marginals with a low variance are deemed less important and could likely be skipped in the optimisation process. We note that since hyperparameter importance is determined retroactively after the optimisation process has been completed, this knowledge will typically be utilised for future optimisation processes. Section 3.1.3 describes how several works have done this.

3.1.2 Hyperparameter importance through ablation analysis

Ablation analysis aims to determine the importance of hyperparameters by removing (ablating) them from the configuration space in order of importance (Fawcett and Hoos 2016). Ablation analysis is performed on the basis of two hyperparameter configurations; in the context of HPO, it typically utilises the default configuration and an optimised configuration. Similar to functional ANOVA, it can therefore be used as a post-hoc analysis technique after a HPO method has determined the optimised configuration for a given dataset.

For each hyperparameter, there is a (possible) difference between the value in the default configuration and the optimised configuration. Ablation analysis starts with one of the configurations (e.g., the optimised configuration) and determines which hyperparameter would impact performance most if it were to be set to the value of the other configuration (e.g., the default configuration). It does so by evaluating all the resulting configurations that can be reached by changing a single hyperparameter from the current configuration to the default value. It selects the hyperparameter and subsequent configuration that impacts the performance most and continues the procedure from that configuration by determining which of the other hyperparameters would impact performance most by setting it to the default value. This is iterated until all hyperparameters have been set sequentially to the default value. In this way, a so-called ablation path is computed, i.e., a list of hyperparameters sorted by their impact on performance. This ablation path and the corresponding changes in performance can be used to assess hyperparameter importance. We note that ablation is a greedy procedure, and the impact of each parameter can sensitively depend on the order in which other parameter values have been changed.

In order to perform ablation analysis, we typically use the default configuration and an optimised configuration. The optimised configuration can be determined by running an HPO procedure. Note that on top of this HPO procedure, the ablation analysis also evaluates additional configurations along the ablation path. This causes additional computational costs. Biedenkapp et al. (2017) address this problem by introducing surrogate models, that (similar to the surrogate models used by functional ANOVA) are trained on a limited set of performance evaluations and that can estimate the performance of previously unseen configurations. In principle, the surrogate model can be trained on performance evaluations that have been obtained during HPO. This alleviates the need for additional costly performance evaluations, although the surrogate model might be somewhat inaccurate. Similar to functional ANOVA, ablation analysis is a post-hoc method that is employed after the completion of the optimisation process. The knowledge gained from ablation can be utilised in future runs of HPO methods, as detailed in the following section.

3.1.3 Hyperparameter importance across datasets

While functional ANOVA (Sect. 3.1.1) and ablation analysis (Sect. 3.1.2) are designed as post-hoc methods that are typically utilised to determine the importance of hyperparameters on a given dataset after the HPO procedure has been run, we are often interested in patterns that generalise across datasets. In other words, given an algorithm, we would like to know what are generally the most important hyperparameters, regardless of the dataset.

van Rijn and Hutter (2018) combined the functional ANOVA framework with the experimental results that are available in OpenML (Vanschoren et al. 2013). They apply functional ANOVA on each dataset in OpenML100 (Bischl et al. 2021), using the performance results of configurations that were already available in OpenML, and determine for various learning algorithms (i.e., random forests, AdaBoost and support vector machines) the importance of the hyperparameters. This work showed that, for these algorithms, in many cases the same hyperparameters are important. As such, knowledge of hyperparameter importance on a collection of datasets is likely to transfer to other datasets. This work was later also extended to neural networks. Sharma et al. (2019) demonstrated similar results for residual neural networks, using a benchmark suite of image classification datasets, and Moussa et al. (2023) for quantum neural networks, using a benchmark suite of small datasets on which a quantum circuit can be simulated.

Alternatively, Probst et al. (2019) proposed a new hyperparameter importance measure dubbed tunability, which calculates for each hyperparameter what would be the effect in terms of performance when it would be optimised. For this, it assumes a default value for all other hyperparameters. Tunability is related to ablation analysis yet differs in important aspects. While ablation analysis sequentially and greedily calculates an ablation path and determines parameter importance based on the changes in performances observed in the context of that sequence, tunability determines for each hyperparameter the performance increase if it were the first step of the ablation path. Probst et al. (2019) determined the tunability of six machine learning algorithms on a subset of binary classification datasets from the OpenML100. Similar to the conclusions of van Rijn and Hutter (2018), often the same hyperparameters emerged as being important.

3.1.4 Transformation of input spaces

For many HPO tasks, the hyperparameter space is defined by a range (minimum and maximum value) per numerical hyperparameter, across which values for this hyperparameter can be sampled uniformly. Sometimes, the search range can be transformed a priori to influence the way that configurations are sampled and might emphasise or de-emphasise certain regions of the underlying search space. For example, a log transformation will ensure uniform sampling from a log-transformed space, emphasising lower values and de-emphasising higher values. While usually not detailed in scientific publications, an inspection of the source code reveals that AutoML systems such as auto-sklearn (Feurer et al. 2015) and ML-Plan (Mohr et al. 2018) rely on such transformations. Search space transformations are usually determined based on the experience of the user and can have a great impact on the performance of a given HPO procedure.

Snoek et al. (2014) proposed a principled approach to these transformations by introducing Bayesian input war**. Specifically, their approach transforms each hyperparameter according to a beta conditional density function. The authors refer to this transformation as war**. The beta distribution is a flexible parameterised probability distribution defined by two parameters, \(\alpha\) and \(\beta\); depending on their values, a wide variety of transformation functions can be obtained, including the linear, exponential, logarithmic and sigmoid transformations. For a configuration space of n numerical hyperparameters, this results in \(n\times 2\) additional parameters that need to be optimised as well. While the exact details are beyond the scope of this survey, Snoek et al. (2014) use input war** in combination with a Gaussian process. They treat these additional parameters as hyperparameters to the covariance function, place a normally-distributed prior over these, and use a Markov chain Monte Carlo (MCMC) approach via slice sampling to optimise the covariance function. Once this optimisation process has been performed, the search space has been transformed for each hyperparameter according to an optimised value of the beta distribution, allowing for uniform sampling in the transformed input space.

3.1.5 Philosophies on search space design

There are various approaches to search space design. The ‘programming by optimisation’ paradigm (Hoos 2012) is based on the idea of avoiding premature commitment to design choices; consequently, all design choices that cannot be made a priori on a strong basis are included in the search space, and it is up to the search algorithm to find good configurations within the resulting, typically vast search spaces. Various techniques have been proposed that work very well with large search spaces, including ones that allow for fast pruning on parts that are unlikely to improve over the best configurations encountered thus far (Wistuba et al. 2015a; Wang et al. 2021a), but also Bayesian optimisation methods that have been demonstrated to be able to relatively quickly find good candidate configurations within large search spaces (Bergstra et al. 2011; Hutter et al. 2011; Snoek et al. 2012).

In contrast, the paradigm of model-free optimisation utilises data from previous experiments to generate a set of complementary configurations that can then be performed sequentially on the target dataset (Wistuba et al. 2015b). This approach aims to search for a small set of configurations that together work well on a range of datasets. If a specific configuration covers already good performance on a cluster of datasets, other configurations will be searched to work well on other clusters of datasets. Pfisterer et al. (2021) showed that the generation of an optimal set of configurations is NP-complete and extended the previous greedy approach by surrogate models in order to make it applicable to machine learning algorithms with an arbitrary number of hyperparameters. Gijsbers et al. (2021) further extended this approach by introducing the notion of symbolic configurations, i.e., the list of configurations can also include dataset-dependent operators [such as the well-known random forest default, where the number of attributes is equal to the squared root of the number of attributes (Breiman 2001)]. The symbolic defaults employ a richer and more flexible search space. As such, it is harder to find a complementary set of symbolic defaults, but once found, they will achieve better performance.

The programming by optimisation and model-free optimisation paradigm can be combined, as was done by Feurer et al. (2022) (see Sect. 5).

3.1.6 Automated search space construction

We now briefly outline a number of methods that autonomously construct a search space for HPO.

Perrone et al. (2019) proposed a method aimed at determining a good search space based on historical data. Specifically, they consider a scenario where a high number of evaluations from a large number of similar datasets is already available. The assumption is that regions in the search space that worked well for these historic datasets will also work well for the new dataset. We note that this can be considered a form of transfer learning. The authors aim to define a search space that contains the best solutions of all previous tasks. They suggest two specific forms, a low-volume bounding box and a low-volume ellipsoid, and provide optimisation methods that can utilise this historic data to construct such a search space.

Alternatively, the AMS system supports the machine learning expert in generating a search space (Cambronero et al. 2020). More specifically, it will generate a search space based on a weak specification. In some cases, the machine learning expert wants to express some preference for a certain solution type, e.g., they suggest a logistic regression model. AMS utilises (1) the source code documentation to suggest similar model types and (2) a corpus of source code files to suggest complementary preprocessing components as well as hyperparameters that should be explored. The corpus of source code files will be scanned for similar pipelines, based on which other components are determined that are often used in combination with the models that were specific in the weak specification. These are all added to the search space. The resulting search space can be explored using various search algorithms; notably, the authors demonstrate their approach to work well in combination with random search and genetic programming.

3.2 Search strategy

In this section, we describe various search strategies for HPO. In this section, we focus primarily on conceptual methods rather than complete systems, although some of the conceptual methods do have well-maintained packages available. The key difference with complete systems is that they often combine various conceptual methods and come with an implementation (these are listed in Sect. 5). We list the packages that we are aware of as well. Table 1 gives an overview of the methods that we survey per category.

Table 1 Overview of methods that we survey in this chapter, categorised by their design philosophy

3.2.1 Grid search and random search

Grid search and random search are the earliest and simplest search techniques used for HPO. Grid search (also known as a parameter sweep) selects the best configuration by exhaustively evaluating all possible combinations of hyperparameter values. To use grid search on continuous hyperparameters, the respective domains have to be mapped to a set of discrete values. In general, grid search performs reasonably well in the low-dimensional search spaces induced by small numbers of hyperparameters. However, performing a grid search in high-dimensional search spaces or with high resolution for discretised hyperparameters involves the evaluation of very large numbers of configurations. To improve efficiency in such scenarios, Larochelle et al. (2007) proposed a multi-resolution approach in which, first, a configuration is selected from a coarse-grained configuration space, and next, a higher resolution search is performed in the vicinity of the selected configuration.

As an alternative to grid search, random search samples configurations from the search space at random; this does not require the discretisation of continuous hyperparameters. Grid and random search approaches share the advantage of simplicity, ease of implementation and excellent parallelisability. Bergstra and Bengio (2012) demonstrated that random search tends to outperform grid search in high-dimensional spaces when using the same computational budget. This is explained by the fact that each hyperparameter contributes differently to the overall loss. Grid search unnecessarily allocates too many resources to the evaluation of unimportant hyperparameters. Figure 3 provides an illustrative example comparing grid search and random search. Assuming that nine configurations are evaluated for optimising a function f, which is highly influenced by a function h, it can be seen that with random search, all nine evaluations explore distinct values over this function h, whereas grid search only explores three distinct values. Bergstra and Bengio (2012) further compared various sampling strategies for random search and found that Sobol sequences (Antonov and Saleev 1979) offer a particularly effective way to perform random sampling. Sobol sequences have later been adopted by systems such as SMAC3 (Lindauer et al. 2022).

Fig. 3
figure 3

The difference between random search and grid search. Each approach performs 9 evaluations for optimising a function \(f(x,y)=g(x)+h(y)\) that depends on two parameters, x and y. The functions g(x) and h(y) are shown on the left and above the representation of the search space, respectively. Assume that f(xy) is strongly influenced by h(y) and only weakly by g(x). Grid search only evaluates h(y) for three distinct values of y. However, with random search, all the nine evaluations explore distinct y and hence h(y), increasing the chance of also finding a good value of f(xy) (Bergstra and Bengio 2012)

Random and grid search are both commonly used as baselines, and many popular machine learning libraries include implementations of these simple search techniques [see, e.g., Scikit-learn (Buitinck et al. 2013), Tune (Liaw et al. 2018), Talos (Talos 2019), and H2O (H2O.ai 2017)].

3.2.2 Bayesian optimisation and variants

Evaluating a hyperparameter configuration can be a computationally expensive task since it requires training and validating a machine learning model. Both grid search and random search require a relatively large number of such function evaluations, especially when the number of hyperparameters is large. Bayesian optimisation (Garnett 2023; Hutter et al. 2011; Snoek et al. 2012), sometimes referred to as sequential model-based optimisation, aims to reduce the number of function evaluations by incorporating knowledge about the performance of configurations encountered earlier in the search process. It uses concepts from Bayesian statistics, in particular in the way a statistical model is used to estimate the probability distribution over the possible performance (loss) values of a previously unseen hyperparameter configuration.

Bayesian optimisation utilises knowledge about the search space by fitting a model to the data collected from previously evaluated configurations (the objective function f); this model is often referred to as a surrogate model or empirical performance model. The two key ingredients of Bayesian optimisation are the (1) surrogate model and the (2) acquisition function. The surrogate model is a probabilistic model for approximating the objective function f that maps hyperparameter configurations to the respective performance values. Bayesian optimisation uses a prior belief of the shape of f and updates this prior with new evaluations of f (on selected hyperparameter settings) to achieve a posterior that better approximates f (Shahriari et al. 2016; Jones et al. 1998). The acquisition function is used to guide the process of sampling from the hyperparameter space in the next round by evaluating the utility of candidate hyperparameter settings based on the surrogate model. The acquisition function, in principle, uses the mean and the uncertainty in the posterior distribution derived from the surrogate model to determine which hyperparameter configuration should be evaluated next. Figure 4 shows a snapshot from the Bayesian optimisation procedure after seven iterations.

Fig. 4
figure 4

Bayesian optimisation using a Gaussian process-based surrogate model and the upper confidence bound as acquisition function after seven function evaluations. The solid blue line (target) is the function that is to be optimised, and the dashed line and yellow surface are the mean performance and associated uncertainty derived from the surrogate model, respectively. There is no uncertainty at points where the target function has been evaluated. The acquisition function determines the utility of the configurations to be evaluated (Nogueira 2014). (color figure online)

Many different functions can be used as surrogate models and acquisition functions in Bayesian optimisation; for further details, we refer the interested reader to the survey on Bayesian optimisation by Shahriari et al. (2016). Many researchers have developed Bayesian optimisation methods for HPO (Snoek et al. 2012; Bergstra et al. 2011; Hutter et al. 2011). The key difference between these approaches lies in the choice of the surrogate model and the acquisition function.

We briefly introduce widely used surrogate models and acquisition functions below. For a more detailed discussion on Bayesian optimisation, we refer the reader to the textbook on Bayesian optimisation (Garnett 2023), which provides extensive information on the mathematical basis of Gaussian processes as surrogate models, on decision-theoretical questions (in particular, the determination of data points for evaluation), as well as on theoretical results and practical considerations. As Bayesian optimisation can be used to optimise any form of black-box function, the book also deals extensively with topics related to realistic problem settings that are also applicable to machine learning, such as uncertainty in the observation space.

Surrogate models The two most widely used surrogate models are Gaussian processes and random forests; the use of the former has been explored by various authors (Ginsbourger et al. 2010; Snoek et al. 2012; Desautels et al. 2014). A Gaussian process is a non-parametric statistical model defined based on its prior mean and covariance functions. The posterior mean and covariance at any evaluated point represent the model’s prediction and its uncertainty. Such a model is suitable as a surrogate model in Bayesian optimisation, as it provides the uncertainty estimates that are needed for the acquisition function to choose the next evaluation point. The computational complexity of training a Gaussian process model is \(O(n^3)\), the cost of inversion of its covariance matrix, where n is the number of observations. In the context of Bayesian optimisation, each observation is a function evaluation carried out earlier in the optimisation process. One of the main drawbacks of Gaussian processes arises from the fact that the computational cost of training increases with the number of evaluations that have been carried out. Gaussian processes with most standard kernels do not scale to high-dimensional datasets. However, methods such as sparsifying Gaussian processes (Seeger et al. 2003) have improved applicability to large datasets by reducing the rank of the covariance matrix.

Another limitation of Gaussian-process-based Bayesian optimisation methods is that they are only applicable to continuous hyperparameter spaces and cannot natively handle integer, categorical or conditional hyperparameters. Additional approximations need to be introduced to handle these commonly encountered types of hyperparameters. For instance, for integer-valued hyperparameters, continuous values are rounded to the closest integer after optimising the acquisition function. For categorical hyperparameters, an approximation based on a one-hot encoding can be used. Garrido-Merchán and Hernández-Lobato (2020) show that these approximations may lead to the failure of the Bayesian optimisation process, as they ignore that some configurations are invalid and may put probability mass on points where f cannot be evaluated. They further provide a systematic transformation of the categorical and integer variables that permits the assumption that the value of f remains constant in certain areas of the underlying search space.

An alternative class of surrogate models are based on random forest regression (Breiman 2001); these have been demonstrated to be more scalable than Gaussian processes (Hutter et al. 2011). A random forest model creates an ensemble of decision trees that can collectively approximate the response surface of the given objective function f. Since training individual trees is parallelisable, and each tree is trained only on a sample of data, this technique scales much better to large datasets. Furthermore, it can quite easily handle different hyperparameter types, including conditional hyperparameters. The drawbacks of random forest regression models are that the uncertainty estimation is known to be less accurate than those of a Gaussian process and that they do not extrapolate well outside the observed data points; the latter also applies to most Gaussian process models.

Acquisition functions The acquisition function determines, based on a given surrogate model, which point in the search space to evaluate next. Expected improvement is a commonly used acquisition function; it is composed of two terms relating to the (1) expected value at a given point of the function to be optimised and (2) the associated variance (or uncertainty). This combination naturally provides a trade-off between exploitation (around a promising point) and exploration (in unknown areas). By considering the expected value, the search is focused on regions of the search space containing good candidate solutions with high probability. In contrast, the uncertainty term encourages the exploration of regions that are weakly explored or in which candidate solutions of highly variable quality have been observed (Jones et al. 1998). There are other acquisition functions, such as the upper confidence bound (Hoffman and Shahriari 2014) of uncertainty for every query point, or information-theoretic approaches, such as entropy search (Hennig and Schuler 2012).

Parallelism Bayesian optimisation takes advantage of all information collected during the optimisation process by evaluating samples sequentially, one by one. This makes it much more data-efficient than the grid and random search approaches. However, as the surrogate model typically suggests only a single best configuration to evaluate next, it is not trivial to parallelise Bayesian optimisation. To address this issue, Ginsbourger et al. (2011) extend the original Bayesian optimisation framework by proposing a multi-point expected improvement criterion for the simultaneous selection of multiple points that are evaluated in parallel. In this approach, some evaluations can be performed while the results of previous evaluations are not yet fully available. This is enabled by injecting partial knowledge of ongoing evaluations into the expected improvement formulation.

Sequential model-based algorithm configuration (SMAC) (Hutter et al. 2011) is a Bayesian optimisation procedure for general algorithm configuration. SMAC uses random forests as surrogate models and expected improvement as an acquisition function. The random forest model allows SMAC to optimise conditional and categorical hyperparameters. Further improvements to SMAC have been proposed [e.g., pruning the search space to increase efficiency (Li et al. 2022)].

Tree-structured Parzen estimator (TPE) (Bergstra et al. 2011) is another approach based on Bayesian optimisation that uses expected improvement as an acquisition function and TPE to model the probabilities and distributions of the target function. Gaussian process approaches model \(p({y \mid \pmb {\lambda }})\) directly, where \({\pmb {\lambda }}\) denotes the configurations, and y indicates the performance observed for a given configuration \({\pmb {\lambda }}\). TPE, however, calculates \(p({\pmb {\lambda }}\mid y)\) and p(y). \(p({\pmb {\lambda }}\mid y)\) is modelled by replacing the distributions of the configuration prior with two non-parametric densities that make a distinction between good and bad configurations: (1) the density of configurations that have a loss below a given threshold (set to a quantile of observed y values) and (2) the density of those with higher loss. Through maintaining a sorted list of configurations in memory, the run-time of TPE scales linearly with respect to the number of hyperparameters. TPE can also be used for conditional hyperparameters and tree-structured configuration spaces. TPE is implemented in the HyperOpt library (Bergstra et al. 2013).

3.2.3 Reinforcement learning-based approaches

Reinforcement learning concerns sequential decision processes in state space (Sutton and Barto 2018). The agent or a controller learns to find optimal paths as a Markov decision process with a number of states \(\mathcal {S}\) and an action space \(\mathcal {U}\). At each state \(s_i \in \mathcal {S}\), there are a number of actions \(\mathcal {U}(s_i) \subseteq \mathcal {U}\) that can be selected by the agent. Taking an action \(u \in \mathcal {U}(s_i)\) will create a state transition from state \(s_i\) to state \(s_j\) with probability \(p_{s' \mid s,u}(s_j \mid s_i,u)\). The agent interacts with the environment at points in time. At each timestamp t, the agent receives an immediate reward \(r_t\), based on its action u and the transition between \(s_i\) and \(s_j\). The goal in reinforcement learning is to determine a policy, i.e., a function that determines for each state a probability distribution over actions, such that a cumulative reward function computed from the immediate rewards \(r_t\) (in many cases, a weighted sum over t) is maximised.

Value-based and policy-based approaches are two well-known classes of reinforcement learning methods. In value-based approaches, the agent estimates the optimal value function Q that defines which action to take in a particular state to achieve the maximum reward. The policy-based methods directly learn the optimal policy \(\pi\).

Q-learning (Watkins 1989) is an example of a value-based approach where the agent learns a look-up table of actions and states by iteratively updating the equation (Baker et al. 2017):

$$\begin{aligned} Q_{t+1}(s_i,u) = (1-\alpha ) \cdot Q_t(s_i,u) + \alpha \cdot \Big (r_t + \gamma \cdot \underset{u'\in \mathcal {U}(s_j)}{\max } Q_t(s_j,u') \Big ) \end{aligned}$$
(4)

Here, \(\alpha\) is the Q-learning rate determining the weight of new information to old information, and \(\gamma\) is a discount factor which defines the weight given to rewards depending on how far in the future they will be collected. Originally, Q-learning is defined for discrete spaces and which makes it useful for optimising discrete (or discretised) hyperparameters. There are, however, extension of Q-learning to continuous action spaces as well (Millán et al. 2002), which can potentially be used to optimise continuous hyperparameters.

Policy-based approaches have better convergence properties than Q-learning approaches and are suited for higher-dimensional and continuous actions (i.e., optimising continuous hyperparameters). The policy-gradient approach is an example of a policy-based approach that has been used for HPO. The approach taken is to directly learn the optimal policy \(\pi (u_t \mid s_t)\) given the history of state-action pairs \((s_t, u_t)\). Policies are defined by a number of parameters \(\theta\), and the general goal is to optimise the parameters of the policy such that the total expected reward is optimised. The REINFORCE algorithm (Williams 1992) is a well-known algorithm to calculate the policy parameters \(\theta\) by maximising the expected reward

$$\begin{aligned} J({\theta }) = \mathbb {E}_{P(u_{1:T}:\theta )}[R] \end{aligned}$$
(5)

Here, \(P(u_{1:T})\) denotes the probability of a sequence of T actions that have resulted in reward R after T time steps.

The search for good hyperparameter settings is usually seen as a sequential decision process, in which the values of one or more hyperparameters are modified in each step. Different approaches have been taken so far to formulate the HPO problem within an RL framework. Each state is commonly defined by hyperparameters and their values. The way states and actions are defined could allow step-wise changes to only a single hyperparameter (Wu et al. 2020; Zoph and Le 2017), to a selected group of hyperparameters (Baker et al. 2017) or to all hyperparameters at once (Jomaa et al. 2019). Additionally, the states could hold other information, such as dataset meta-features or the history of evaluated hyperparameter configurations (Jomaa et al. 2019). Formulating states that allow changes to only a single hyperparameter at each step, rather than to a group, allows treating the configuration of individual hyperparameters as a sequential decision process as well. This allows to implicitly consider the conditionality among hyperparameters in the search space by remembering previous decisions. This also impacts the size of the action space, which implicitly determines the search space. Let us assume that n hyperparameters are to be optimised, each with a domain by \(\Lambda _i\). Treating the hyperparameters individually will reduce the action space size from \(\varvec{\Lambda } = \Lambda _1 \times \Lambda _2 \times \cdots \times \Lambda _n\) to \(\varvec{\Lambda } = \Lambda _1 + \Lambda _2 + \cdots + \Lambda _n\) (Wu et al. 2020).

Hyp-RL (Jomaa et al. 2019) is a general HPO method based on Q-learning. In Hyp-RL each action corresponds to a hyperparameter configuration (to set all hyperparameters) being rewarded based on the validation loss of the configured model \(r_t\). The total reward is calculated by accumulating the validation loss of the models configured in a sequence of actions. The policy selects the actions that maximise the discounted cumulative reward through iterative Q-learning updates from an initial state (i.e., hyperparameters initialised randomly or to their default value).

In the policy-based approach proposed by Zoph and Le (2017), each action corresponds to setting one hyperparameter, and a sequence of T actions in a trajectory leading to the configuration of all hyperparameters is rewarded by the validation accuracy of the configured model R. The optimal policy is selecting the trajectory that maximises the expected reward J based on computing the policy gradient to update the controller parameters \(\theta\) based on the reward.

In general, reinforcement learning methods have been used for HPO (Jomaa et al. 2019; Wu et al. 2020) and more commonly in neural architecture search (examples for the latter will be given in Sect. 4.2.1).

3.2.4 Evolutionary algorithms

Evolutionary algorithms (Simon 2013; Bäck et al. 1997) are population-based global optimisation algorithms inspired by biological evolution. They work on a set (population) of P solution candidates (individuals). Starting from an initial population, evolutionary algorithms iteratively vary the population (giving rise to a sequence of generations), as illustrated in Algorithm 1 (Bäck et al. 2018). In each generation, a population P(t) (parent individuals) of size \(\mu\) are selected, from which new individuals \(P'(t),P''(t)\) (offspring individuals) are generated using variation operators; then, the next set of individuals (survivors) are selected from the previous population and the offspring. These selected individuals form the new population for the next generation. The selection of parent and survivor individuals is typically based on the fitness values \(\textbf{F}(t)\) (i.e., objective values) of the individuals, where individuals with higher fitness are preferred over individuals with lower fitness. Typical variation operators are crossover and mutation. Crossover combines two or more parent individuals to transfer the beneficial features of the parents to the offspring. The mutation operator applies small random changes to individuals to increase the diversity within the population.

Algorithm 1
figure a

Evolutionary algorithm

In the context of HPO, mutation and crossover are analogous to exploitation and exploration, respectively. When applying an evolutionary algorithm to an optimisation problem, one has to decide how solutions are encoded as individuals. For example, an integer variable can be encoded as an integer but also as a list of binary variables.

In evolutionary algorithms, the genome encoding of an individual includes the representation of a possible solution to the problem, the actual solution after evolution is termed phenotype and its encoding is termed genotype (Bäck et al. 2018). Tani et al. (2021) evaluated an evolutionary algorithm and particle swarm optimisation (Kennedy and Eberhart 1995) on two HPO tasks, concluding that both can outperform random search and gradient descent. There are two special types of evolutionary algorithms which are frequently used in the context of HPO and AutoML: genetic programming and CMA-ES. In the following, these approaches are briefly outlined.

Genetic programming (Koza 1994) is a form of an evolutionary algorithm that evolves programs composed of functions, which work on primary inputs and/or outputs of other functions. An example of such a program could be a mathematical expression, where the functions are mathematical operators (e.g., addition, sine, logarithm), and the actual optimisation task could be to find an expression, which best fits some experimental data. Often a tree representation is used to represent programs in genetic programming. TPOT (Olson et al. 2016a) is an example of an AutoML system that uses genetic programming for the optimisation of machine learning pipelines and their hyperparameters (see Sect. 5.4 for more details).

Covariance matrix adaptation evolution strategy (CMA-ES) (Hansen et al. 2003) is an evolutionary algorithm, which has been demonstrated to be very efficient for a number of black-box optimisation tasks, including HPO (Loshchilov et al. 2012; Watanabe and Roux 2014; Loshchilov and Hutter 2016; Jedrzejewski-Szmek et al. 2018). It samples candidate solutions from a multivariate normal distribution whose mean, and covariance matrix is updated based on the performance of the individuals in the population. CMA-ES works well on non-linear and non-convex optimisation tasks; it is typically used for problems with search spaces with three up to a hundred dimensions. CMA-ES has shown good performance compared to other black-box optimisers, such as Bayesian optimisation, on continuous black-box optimisation benchmarks (Loshchilov et al. 2013). While Bayesian optimisation is recommended for conditional search spaces, CMA-ES is recommended if the search space only contains continuous hyperparameters and the objective function is cheap, or the evaluation budget is large (Mendoza et al. 2016).

3.2.5 Monte Carlo tree search

Monte Carlo tree search (MCTS) is an approach for addressing state-space Markovian sequential decision problems (Kocsis and Szepesvári 2006) working based on a randomised evaluation of a search tree. This approach has been successfully used in game AI to predict the best moves to reach a winning position in a game (Chaslot et al. 2008a), such as Go (Silver et al. 2017) or Chess (Silver et al. 2018). For a given search problem, the MCTS algorithm builds a tree where each node (representing states) includes information about the current value of the position (usually the average of the results of simulated games visiting this node) and the visit count of this position. The MCTS algorithm repeats the following steps (Chaslot et al. 2008b): (1)  traverse the tree through the selection of the best next node to move to (through balancing between exploitation and exploration based on the statistics stored), (2) expansion of the selected node by adding new child nodes to the tree to increase the options to win the game, (3) simulation, or playout, to finish the game by traversing the search tree multiple ways in a random way and further assigning a reward based on calculating how close the output of random decisions was from the final winning output, and (iv) back-propagation to update each node that was traversed in the tree based on the result of the simulation.

The main power of MCTS-based approaches lies in addressing sequential problems and are thus better suited for optimising hyperparameters representing ordered decisions such as hyperparameters involved in creating a pipeline of a fixed number of components (e.g., first selecting a data preprocessors, afterwards feature selectors, and finally a machine learning algorithm). This works especially well in combination with pipelines with a fixed structure, as considered, for example, in MOSAIC (Rakotoarison et al. 2019) (see Sect. 5). When the pipeline structure is fixed, it can be represented as a search tree that can be traversed by MCTS. Internal nodes in the search tree represent partial configurations in which only the first preprocessing operators are fixed, whereas a leaf node represents a full configuration. A surrogate model that generalises over the full configuration space, which can also assess the quality of partial configurations (as represented by internal nodes), can be employed to determine the performance of a given node (Rakotoarison et al. 2019). This can be done, for example, by means of sampling techniques, where a pre-defined number of leaf nodes (representing full configurations) are being evaluated, and the average of those represents the quality of the partial configuration. Once a playout-operation has determined a suitable leaf-node, the configuration that belongs to the leaf node is instantiated and evaluated on the real data, and the measured performance is backpropagated into the internal tree representation. Also, the surrogate is being updated with this additional information. MCTS algorithms for HPO are only researched to a limited degree.

3.2.6 Gradient-based optimisation

The gradient descent algorithm, classically used for setting the parameters of models, can be extended to jointly optimise the hyperparameters of the algorithms as well. As mentioned before, random search has shown promising results in the context of optimising small numbers of hyperparameters. For moderately higher dimensions, more complex methods (e.g., Bayesian optimisation) are preferred. However, when dealing with neural networks, besides a moderate amount of hyperparameters, there are also millions (if not billions) of parameters to optimise (i.e., the weights and bias values). Typically, the parameters of a neural network are optimised using a gradient descent method, whereas the hyperparameters are optimised by an HPO method. However, the optimised validation loss with respect to the hyperparameter can be estimated, allowing gradient descent methods to traverse the loss landscape with respect to hyperparameter values. Recently, gradient-based optimisation has shown promising results for optimising very large numbers of (hyper)parameters (see, e.g., Lorraine et al. 2020) and also in meta-learning (see, e.g., Rajeswaran et al. 2019).

One of the earlier works in gradient-based optimisation for differentiable and continuous HPO was proposed by Bengio (2000). This approach uses reverse mode differentiation or backpropagation and focuses on small-scale continuous HPO based on differentiable objective functions, such as squared loss and logistic loss, that can be optimised with gradient descent. One of the main limitations of this approach is that it requires intermediate variables to be maintained in memory for the reverse pass of the backpropagation procedure. This gives rise to prohibitively large memory requirements and limits the practical applicability of this method.

There has been more recent research on memory-efficient methods for approximating the hypergradients (i.e., the gradient of the validation loss with respect to hyperparameters). These methods generally fall into two categories: (1) iterative differentiation (see, e.g., Franceschi et al. 2017; Maclaurin et al. 2015), which approximates the hypergradients by defining a sequence of functions that recursively approximate one another; and (2) approximate implicit differentiation (see, e.g., Lorraine et al. 2020), which defines an implicit function for the hypergradients through applying the implicit function theorem. Empirical results from several studies indicate that implicit differentiation methods tend to be more memory-efficient (see, e.g., Grazzi et al. 2020; Rajeswaran et al. 2019).

Gradient-based methods are only applicable to continuous hyperparameters and twice-differentiable loss functions. It is also possible to extend the use of these techniques to discrete hyperparameters using continuous relaxation methods (see, e.g., Jang et al. 2017). For instance, Jang et al. (2017) propose the Gumble-Softmax estimator to represent samples of one-hot encoded categorical variables with a differentiable distribution. As explained in Sect. 4.2.5, DARTS (Liu et al. 2019b) represents categorical hyperparameters by means of continuous variables, using the softmax operation in a similar manner.

3.3 Performance evaluation techniques

HPO techniques evaluate various configurations from the search space and, based on these evaluations, in each step, select the one deemed to be most promising. As was explained in Sect. 2.2.3, this is often done using nested cross-validation (Varma and Simon 2006): the dataset is split into a training, a validation and a testing partition. All configurations are trained on the training set and evaluated on the validation set. The testing partition is set aside and only used for the final evaluation of the best configuration. We thus find evaluation procedures at an inner and outer level. The HPO method employs the evaluation procedure at the inner level to assess how well a specific configuration will perform on the given dataset. Eventually, the evaluation procedure at the outer level assesses how well the HPO method as a whole works. The inner evaluation procedure is under the control of the (user of the) HPO method, whereas this is not the case for the outer evaluation procedure. The main focus of this section is on the inner evaluation procedure. Evaluating various candidate configurations can be computationally expensive, and therefore several procedures have been developed to speed up this process. We will review three general types of procedures: racing methods, methods that evaluate at lower budgets, and learning curve extrapolation methods.

3.3.1 Multi-objective evaluation metrics and trustworthiness

In the performance evaluation process, the most critical question concerns the choice of the performance metric to be optimised for. Naturally, many HPO methods optimise for accuracy or the area under the ROC curve (see, e.g., Hutter et al. 2011; Li et al. 2017; Huisman et al. 2023); however, it is also possible to consider additional measures in a multi-objective fashion (Karl et al. 2022). Computational cost is a prominent factor to take into consideration in hyper-parameter optimisation, both for the sake of environmental concerns (Tornede et al. 2023b) as well as the fact that machine learning models are being deployed on various devices that have fewer compute power (Evchenko et al. 2021). HPO techniques iterate over a range of possible configurations and are, therefore, typically computationally expensive. However, they can still search for a model that requires less cost at inference time. When a model has low complexity, it generally requires less computational cost when making predictions. This can, depending on the amount of predictions that are being made in production, drastically reduce the overall compute or power consumption. Lu et al. (2020) and Zhang et al. (2018) measure this amount of operations in FLOPS. Other measures that can be taken into consideration are

  • Class probabilities and uncertainty quantification Many machine learning models provide a measure of certainty per prediction, quantifying the amount of uncertainty that the prediction belongs to a certain class. However, it has been noted that these probabilities are not always well-calibrated (de Menezes et al. 2023). de Menezes et al. (2023) survey techniques that can be used to better calibrate uncertainty estimations. König et al. (2020) utilises HPO to build a wrapper around classifiers that have better uncertainty quantification.

  • Robustness Machine learning models, in particular deep learning models, are vulnerable to adversarial attacks (Goodfellow et al. 2015). An adversarial attack makes small perturbations to the input that will force the model to misclassify an otherwise correctly classified observation. Various methods have been proposed to train robust models against input perturbations (Bai et al. 2021). This robustness can also be quantified by a measure called adversarial accuracy (Zhang et al. 2019b). To quantify the robustness of a model, formal verification methods have been proposed (Meng et al. 2022; Botoeva et al. 2020; Wang et al. 2021c). As these techniques are computationally expensive, it is hard to integrate them in HPO methods. However, it is possible to utilise HPO to speed up robustness verification (König et al. 2022, 2023), or aid the selection of machine learning models that are both accurate and robust (Bosman et al. 2023).

  • Fairness and bias mitigation As the outcome of machine learning methods directly involves and affects human lives, care should be taken to ensure that the outcomes are fair, although the exact notion of fairness depends on the context of the decision situation. While one of the key elements to this is human agency and oversight (European Commission High Level Expert Group AI 2018), various measures can be optimised to steer the modelling in a useful direction, such as equalised odds or equality of opportunity (Hardt et al. 2016). Weerts et al. (2023) give an overview of challenges and opportunities of maintaining fairness in AutoML.

For a complete overview of evaluation metrics to consider or multi-objective HPO, the reader is referred to the work by Karl et al. (2023) introduced learning curve-based cross-validation (LCCV), an extension to cross-validation that takes into account the evaluation of the learning curve of a given hyperparameter configuration. LCCV considers all configurations in order and works with the concept of the best configuration encountered so far. The main assumption of their work is that learning curves are convex and provide empirical evidence that this holds for observation-based learning curves of many algorithms on most datasets. Using this convexity assumption, they make an optimistic estimation of what the maximum performance of a given configuration at a certain budget can be, similar to the formulations of Sabharwal et al. (2016). Using this optimistic estimation, LCCV determines whether a given configuration will still be able to improve over the currently best configuration. If this is not the case, then the configuration can be discarded prematurely. Thus, similar to racing, LCCV takes a rather conservative approach and only discards a configuration when it is rather certain that it will not improve over the currently best configuration.

3.4 HPO systems and libraries

In this section, we review various well-known systems and libraries that can be used for HPO. We note that there is some overlap with the works that are described in the previous sections; here, our focus is on the implementation of the underlying methods and practical considerations regarding their use. Our descriptions are based on those given in scientific publications. We note that in some cases, development efforts may have continued, leading to improvements in functionality and usability.

SMAC (Hutter et al. 2014; Lindauer et al. 2022) is a general-purpose algorithm configurator and HPO system based on Bayesian optimisation. One of the distinguishing features of SMAC is its use of a random forest as the underlying surrogate model, rendering the Bayesian optimisation procedure broadly applicable to various types of search spaces. SMAC is used at the core of various widely used AutoML systems, including AutoWEKA (Thornton et al. 2013) and Auto-sklearn (Feurer et al. 2015).

HyperOpt (Bergstra et al. 2013) implements various optimisation algorithms, including random search, TPE (Bergstra et al. 2011) and adaptive TPE. It can be parallelised using Apache Spark and MongoDB.

Spearmint (Snoek et al. 2012) is a Bayesian optimisation system. It uses Gaussian processes as a surrogate model and expected improvement as an acquisition function. Compared to vanilla Bayesian optimisation, Spearmint allows for effective parallelisation across multiple cores. Results of an empirical study comparing TPE, SMAC and Spearmint suggest that it is preferable to use SMAC and TPE when dealing with large and conditional search spaces, whereas Spearmint is recommended for low-dimensional and continuous problems (Eggensperger et al. 2013).

Optuna (Akiba et al. 2019) is an HPO system built upon TPE. Earlier optimisation frameworks, such as SMAC and HyperOpt, only allow a static definition of the hyperparameter space and cannot be used when no full description of the hyperparameter space is given by the user. Optuna, however, provides a define-by-run API that allows users to dynamically define and modify the search space. It also includes multi-fidelity strategies to speed up the optimisation process. Optuna covers several multi-fidelity strategies for performance evaluation, e.g., the asynchronous successive halving algorithm (Li et al. 2020a) and hyperband (Li et al. 2017).

Bayesopt (Martinez-Cantin 2014) is a flexible framework that supports the optimisation of continuous, discrete and categorical hyperparameters. It allows users to select from many components relevant to Bayesian optimisation, such as the initialisation procedure, the acquisition function, the optimisation of the acquisition function, the surrogate model and the evaluation metric. Furthermore, Bayesopt implements an improvement to calculate the covariance matrix of Gaussian processes that, at each iteration, determines how many new elements will appear in the covariance matrix and how many will remain the same. This reduces the computational complexity of computing this matrix from \(O(n^3)\) to \(O(n^2)\).

RoBO (Klein et al. 2017b) is a package that implements various HPO algorithms based on Bayesian optimisation, including several methods focusing on multi-fidelity and auxiliary tasks, such as multi-task Bayesian optimisation (Swersky et al. 2013), Bohamian (Springenberg et al. 2016), and Fabolos (Klein et al. 2017a).

BOHB (Falkner et al. 2018) implements, in addition to the equally named BOHB algorithm, various relevant baseline methods, such as successive halving and hyperband. The BOHB package supports parallel computing and aims to address various practical problems that arise when running HPO algorithms in parallel on multiple CPUs.

MOE (Yelp 2014) (metric optimisation engine) is an HPO package based on Bayesian optimisation implemented in Python. It uses expected improvement as an acquisition function and Gaussian processes as a surrogate model.

Scikit-optimize (Head et al. 2017) implements a sequential model-based approach to optimisation. It supports several methods, including sequential optimisation with decision trees/gradient boosted trees and Bayesian optimisation with Gaussian processes. For acquisition functions, it supports expected improvement, lower confidence bound, and probability of improvement.

Optunity (Claesen et al. 2014) covers several of the previously mentioned optimisers for HPO, including grid search, random search, particle swarm optimisation and CMA-ES.

Syne Tune (Salinas et al. 2022) is a package for distributed HPO. It includes a range of optimisers (including random search, Bayesian optimisation and evolutionary search) and multi-fidelity approaches (including BOHB and hyperband). In addition, it also supports more advanced methods for hyperparameter transfer learning, constrained HPO and multi-objective optimisation.

3.5 HPO benchmarks

The common way to create an HPO benchmark is to select a collection of datasets and apply a set of state-of-the-art methods to these. For a long, this has been the standard practice in the field. However, there are also drawbacks to this approach. Bischl et al. (2021) argued that this makes it hard to compare results across studies, gives rise to cherry-picking results and leads to a publication bias. They addressed this by utilising OpenML to create benchmark suites. Benchmark suites are a collection of tasks designed to thoroughly evaluate (machine learning) algorithms. They further argue that good benchmarking is done on a large collection of tasks. They integrate their approach with OpenML (Vanschoren et al. 2013), in order to make use of the API and eco-system of OpenML, making it easy for users to create benchmark suites and to run algorithms of interest on these. By defining a standard set of benchmark tasks, the research community is encouraged to use the same type of benchmarks for similar types of algorithms. This makes it easier to compare results across studies and harder to cherry-pick results. Of course, there can be valid reasons to use a specific subset of benchmarks, for instance, when a new algorithm is conjectured to address issues arising in the context of solving the respective tasks.

Bischl et al. (2021) introduce two benchmark suites, OpenML100 and its successor, OpenML-CC18. OpenML Curated Classification 2018 (OpenML-CC18) is a collection of machine learning tasks that are carefully selected and curated from OpenML. For example, for each of the tasks, it has been verified that a corresponding scientific publication (or other documentation) existed, it has been determined whether the original task was indeed a classification task (and not, for example, a binarised regression task) and whether the data was real-world data (to exclude artificial datasets).

Gijsbers et al. (2019) developed the AutoML benchmark, by extending OpenML-CC18. The AutoML benchmark focusess on bigger and more complex datasets as the OpenML-CC18 does, and might for that reason be a better fit to benchmark AutoML systems. Beyond an extended benchmark suite, the authors also developed a platform on which AutoML systems can be uploaded. These will then be automatically evaluated on the benchmark, and compared with earlier uploaded benchmark suites.

Eggensperger et al. (2021) developed HPOBench, a workbench for benchmarking HPO methods. HPOBench provides several HPO problems, including the optimisation of the hyperparameters of a support vector machine, gradient boosting, random forest and a multilayer perceptron. HPOBench supports multi-fidelity approaches (see Sect. 3.3), which makes it a good choice for evaluating such methods. HPOBench includes so-called raw, tabular and surrogate benchmarks. Where the raw benchmark contains just the dataset and task definition, implying that the actual configurations need to all be ran (incurring a computational cost), the tabular benchmark contains pre-computed evaluations for all possible hyperparameter configurations in the search space so that using it only incurs minor computational overhead. A surrogate benchmark also contains pre-computed evaluations, but does not do this for all possible configurations (as the configuration space might be too large or even infinite). Instead, a surrogate model (Eggensperger et al. 2015) is trained and subsequently used to replace the actual function evaluation. Therefore, when running a surrogate benchmark, no actual evaluations will be performed. Overall, tabular and surrogate benchmarks provide a convenient way of speeding up the development of HPO algorithms by allowing the algorithm designer to test-run the HPO algorithm on various benchmarks with little computational effort.

4 Neural architecture search

In this section, we turn our attention to AutoML systems for automatically designing deep neural network architectures. Conventionally, neural networks are represented in the form of computational graphs (Goodfellow et al. 2016) of nodes that perform operations (e.g., addition, convolution, pooling, activation) on the input they receive from their parent nodes. The architecture of a neural network represents the parents of each node (i.e., structure or node connections), as well as the operations performed by the nodes.

Training a neural network requires setting two sets of hyperparameters. The first group are training hyperparameters that mainly affect the training process. These are hyperparameters such as the learning rate, optimiser type or batch size. The second group, however, are architectural hyperparameters that define the network architecture, such as the number, sizes and operations of layers. Neural Architecture Search (NAS) research mainly considers optimising the latter category of hyperparameters. Therefore, in the rest of this section, the term hyperparameter mainly denotes architectural hyperparameter.

Network architecture engineering is still a time-consuming and expensive task that needs to be performed manually by experts. NAS methods aim at finding good architectures by optimising architectural hyperparameters. Conceptually, NAS is a sub-field of AutoML that builds, to a great extent, on the hyperparameter optimisation (HPO) techniques discussed in Sect. 3. We can frame NAS as an optimisation problem with the goal of finding an architecture that achieves the best possible performance in the target task within a predefined search space. We note that each layer within the network architecture defines new hyperparameters to be set for its operations, which leads to a tree-structured search space. This makes NAS more complex than HPO for classic machine-learning algorithms without conditional hyperparameters.

Neuro-evolution of augmenting topologies (NEAT) (Stanley and Miikkulainen 2002a) is one of the earlier approaches proposed in the early 2000s, aiming at automating the process of designing neural network architectures by jointly optimising network topology and parameters. NEAT is based on the idea of evolving the neural network architecture (structure and connection weights) using a genetic algorithm. However, being designed to work on the level of single neurons, this approach does not scale to deep network architectures with millions of neurons and different layer types. Zoph and Le (2017) were among the first who considered the idea of ‘neural architecture search’ with the goal of automating the design of modern deep neural networks. This initial attempt at NAS relied on the availability of very substantial computational resources (800 GPUs for 4 weeks). Many different approaches have been proposed to make NAS more computationally efficient while still attaining high performance.

NAS methods can be described in terms of the three components introduced in the context of HPO in Sect. 2.2 (Elsken et al. 2019b). In NAS, search spaces can comprise architectural and training hyperparameters. The efficient design of a search space using prior expert knowledge on the type of networks that perform best for a given class of problems can largely improve the efficiency of the search. The search strategy determines how to find good hyperparameter configurations (and hence, a good architecture) within a given, possibly vast, search space. Performance estimation approaches are used to decide which configurations will achieve high performance on a given dataset without the need to perform potentially very time-consuming, full training and validation.

In the following, we review prominent and important NAS methods focusing on the underlying search space design in Sect. 4.1, and search strategy, in Sects. 4.2 and 4.3 covers various performance evaluation approaches proposed for NAS. Finally, we will discuss available NAS libraries and benchmarks in Sects. 4.4 and 4.5, respectively.

Table 2 An overview of NAS methods categorised based on the search strategy and the search space

Table 2 gives an overview of prominent NAS methods based on their underlying search space and search strategy. As seen in the table, we distinguish micro-level, macro-level, and hierarchical search space design approaches. Widely used search strategies include methods based on reinforcement learning, Bayesian optimisation, gradient-based and evolutionary algorithms. We note that there are methods such as Lemonade (Elsken et al. 2019a) and RENAS (Chen et al. 2019c) that appear under more than one category in our table as their search strategy or search space covers more that one approach.

In what follows, we will describe the NAS techniques listed in Table 2 in more detail.

4.1 Search space

The search space of a given NAS method represents the space of all possible neural network architectures. However, searching within a vast space of all possible hyperparameter settings is an extremely computationally expensive task. Most research in search space design has focused on imposing simplifying constraints to reduce the number of architectural hyperparameter configurations. Two aspects have been considered in designing the search space of neural network architectures: (1) the design of the global architecture of the network, and (2) the design of sub-architectures that can be repeated to create a full neural network in a modular fashion. This directly leads to the concepts of macro-level and micro-level searches. The macro-level search considers optimising the entire network by searching for the operations and connections between layers (see, e.g., Zoph and Le 2017; Pham et al. 2018; Kandasamy et al. 2018; Brock et al. 2018). The micro-level search focuses on optimising cells or blocks (see, e.g., Liu et al. 2018a, 2019b) that will be stacked to construct the final network. There are also approaches that consider a hierarchical search space (see, e.g., Liu et al. 2018b, 2019a; Zhong et al. 2018; Tan et al. 2019) by making use of a combination of the two previously mentioned approaches that leads to a hierarchically structured search space. In the following sections, we elaborate on these approaches.

4.1.1 Macro-level search spaces

Macro-level search considers generating the entire structure of the network. Designing the architecture in a layer-wise manner greatly reduces the degrees of freedom within the global search space; at the same time, it typically still allows for an arbitrary sequence of layers, each with its own architectural hyperparameters. Macro-level search spaces typically include hyperparameters such as the number of layers, conditional hyperparameters, such as the type of each layer (e.g., convolution, pooling) and the hyperparameters of these operations (e.g., filter size and stride of a convectional layer).

Previously, Baker et al. (2017), Zoph and Le (2017), Kandasamy et al. (2018) and Brock et al. (2018) have considered this type of search space by generating new networks (with a pre-defined maximum number of layers) in each iteration of the search process. Baker et al. (2017) generated the network architecture by iteratively searching within the space of architectural hyperparameters for each layer (e.g., number of filters, filter height, and filter width). Zoph and Le (2017) further proposed to widen the search space by allowing skip connections and branching layers. Figure 6 shows two different examples of chain-structured networks that make up this type of search space. The network shown on the left is a simple example where every layer receives its input from the previous layer and transfers its outputs to the next. The network shown on the right has a more complex structure, including multiple branches and skip connections. In a layer-wise architecture, the space of possible networks increases exponentially with the possible number of layers. ** all other blocks fixed. Similarly, FBNET (Wu et al. 2019) and ProxylessNAS (Cai et al. 2019) support layer-wise search for blocks to increase the diversity of networks found. Chen et al. (2019d) proposed to use eight different types of blocks based on well-known network architectures such as ResBlocks and Inception.

4.1.3 Hierarchical search spaces

Hierarchical search spaces combine micro-level and macro-level approaches to improve layer diversity. One way to create a hierarchical search space is by means of recursion. For instance, Liu et al. (2018b) proposed an approach starting from lower-level motifs as a small set of primitive operations (convolution, depth-wise convolution, separable convolution, max-pooling) at the bottom level of the hierarchy, from which higher-level motifs are then built recursively, such that the highest-level motif corresponds to the full architecture. At each level of this hierarchy, motifs are represented in the form of DAGs. A similar approach has been proposed by Ru et al. (2020b) to create a three-level hierarchy: At the top level, there is a graph of cells. At the middle level, each cell is represented as a graph of nodes. At the bottom level, each node is represented as a graph of operations. By varying the graph generator hyperparameters at each level, a diverse range of architectures can be obtained.

MnasNet (Tan et al. 2019) makes use of a different hierarchical search space. In this approach, a full CNN architecture is factorised into different segments, each comprising a number of identical layers. For each segment, the operations and connections of a single layer, as well as the number of layers, are optimised. The optimised layer will be repeated to create a full segment. This approach was inspired by the idea that different parts of the network should be treated differently, as they play different roles in the overall accuracy and inference latency (e.g., in a CNN architecture, earlier blocks impact the inference latency more as they process larger input sizes). Liu et al. (2019a) took a different strategy in proposing a bi-level hierarchical search space that allows selecting the network-level structure by searching within a space of popular network designs.

4.2 Search strategy

After the search space has been decided, a search strategy is needed to find the best architecture within this space. In the following, we give an overview of search strategies used in NAS.

4.2.1 Reinforcement learning

NAS can be addressed using reinforcement learning. In this case, an agent’s goal is to generate a network from the action space of the search space. As outlined in Sect. 3, the policy-gradient approach is a well-known reinforcement learning technique that relies on optimising policies with respect to rewards. Using a policy-gradient-based method for NAS to design CNNs and RNNs was first proposed by Zoph and Le (2017). In this approach, the structure and connectivity of the elements of the neural network are encoded in the form of an architectural string (as illustrated in Fig. 8). These architectural strings are composed of a set of tokens (for CNNs, these are architectural hyperparameters such as filter height, filter width, stride height, stride width, and the number of filters for one layer). The controller is implemented as an RNN that will generate a child network by predicting the hyperparameters of such an architectural string one by one and stop** when a certain pre-defined number of layers has been generated. The hyperparameters that are predicted by the controller can be considered as a list of actions performed in the design of an architecture, and the reward corresponds to the accuracy achieved by the child network. The policy gradient is calculated to update the controller using this reward signal with the aim of maximising the expected accuracy of the generated architectures. The REINFORCE algorithm (Williams 1992) mentioned earlier in Sect. 3.2.3 is used to optimise the parameters of the controller.

Fig. 8
figure 8

The controller is implemented in the form of an RNN. Here, this controller samples a feedforward neural network with only convolutional layers (empty squares) and predicts an architectural string composed of the hyperparameters of the convolutional layers (filter height, filter width, stride height, stride width and the number of filters) (Zoph and Le 2017)

MetaQNN  (Baker et al. 2017) and BlockQNN (Zhong et al. 2018) use Q-Learning, the other popular reinforcement learning algorithm. MetaQNN (Baker et al. 2017) defines states as a group of hyperparameters and uses a learning agent to generate layers one by one, until the entire network has been generated. This approach assumes that a well-performing layer in one network will also perform well in another one. The generated network is then trained, and its accuracy is used as a reward for the agent. BlockQNN (Zhong et al. 2018) defines an action space that allows block-wise (as opposed to layer-wise) network generation with an agent that is trained to choose layers within a block.

ENAS (Pham et al. 2018) and NASNet (Zoph et al. 2018) are examples of reinforcement learning NAS approaches with an action space that allows generating architectural cells (explained earlier in Sect. 4.1.2).

Defining an action space based on function-preserving transformations allows to generate new architectures by transferring knowledge from previously trained networks and thus speeding up the evaluation process. Layer-level architecture transformations (Cai et al. 2018a) allow actions such as adding filters or layers, while path-level transformations (Cai et al. 2018b) allow actions that modify the path topology of the network.

4.2.2 Bayesian optimisation

As outlined in Sect. 3.2.2, Bayesian optimisation employs a surrogate model and an acquisition function in the optimisation process. For hyperparameter optimisation, Gaussian processes and random forests are the two commonly used surrogate models and expected improvement is a popular acquisition function.

Some of the design choices in the NAS search space cannot be encoded in the form of continuous variables (e.g., the number of layers or types of activation functions). While Gaussian processes in their basic form are only applicable to continuous variables, by using newly developed kernel functions, they can also handle categorical and conditional hyperparameters (Swersky et al. 2020c) is another Gaussian process-based technique that uses a customised kernel function to measure the similarity between architecture encodings. Since relevant operations (e.g., the number of channels, kernel size) considered for creating these encodings are diverse in type and incomparable, the kernel function used in GP-NAS works on groups of operations rather than on individual operations. NASBOT (Kandasamy et al. 2018) defines new kernel functions, referred to as optimal transport metrics for architectures of neural networks (OTMANN), to measure the similarity between two neural networks. OTMANN measures the distance between neural networks using three factors that determine the performance of a neural network: (1) the operations performed at each layer, (2) the types of these operations, and (3) how the layers are connected. The OTMANN distance can be computed by solving an optimal transport problem, a well-studied optimisation problem for which several effective solvers exist (Peyré and Cuturi 2019). Yang et al. (2023) propose a hierarchical optimal transport metric to measure the cell-level similarity (considering the similarity of nodes and information flow costs between node pairs) and network-level similarity (considering the cell-level similarity and the global position of each node within the network).

Ru et al. (

Fig. 9
figure 9

a A tree that encodes one hyperparameter and its possible values (16, 32, 48, 64, 80). b The same tree restructured with bisection. This allows more information to be shared between possible paths (e.g., sampling path to node 1 provides partial information about nodes 2 and 3) (Negrinho and Gordon 2017)

4.2.4 Monte Carlo tree search (MCTS)

Optimisation of neural architectures can also be viewed as a sequential decision process that can be solved using MCTS. The upper confidence bounds applied to trees (UCT) algorithm is a well-known Monte Carlo tree planning algorithm for tree-structured state-action spaces with a built-in exploration-exploitation mechanism that has been commonly used to search the space of network architectures (Negrinho and Gordon 2017; Wistuba 2017; Wang et al. 2020). MCTS works by defining a search tree, representing the search space that is traversed by visiting nodes based on a tree policy and a rollout policy. In the context of NAS, the internal nodes of this tree can represent hyperparameter values. A model is defined when reaching a leaf node of the tree. Originally, in MCTS, when visiting a node in the tree, all children of that node need to be expanded before any of their children is expanded. This is not ideal for nodes that represent numerical hyperparameters with a large range of values that lead to a similar performance. To address this issue, Negrinho and Gordon (2017) proposed combining UCT with a bisection approach that restructures the branches of the tree for such hyperparameters. At each node, instead of committing to a specific value of a hyperparameter, a sequential committing process is followed. It is first decided if the chosen hyperparameter value is in the first or second half of the set of hyperparameters. The bisection structure implicitly allows sharing information among similar paths over the search tree, enabling search in a large search space (see Fig. 9).

Techniques based on information sharing in different ways have been used to improve the performance of MCTS. Wistuba (2017) defines the NAS problem as a Markov decision process where each state describes the current network architecture, and each action adds a layer to the network. To address this problem with MCTS, Wistuba (2017) proposed two other variants of UCT that allow sharing information between branches of the search tree with the focus on reducing the time required for finding a good architecture through information sharing: the first of these shares information for the same action in similar states, while the second shares information between similar actions. This is achieved based on predictions of the final reward from previously selected actions. To improve the performance of MCTS, AlphaX (Wang et al. 2020) proposes to use a meta-deep neural network trained to estimate the accuracy achieved by candidate architectures.

4.2.5 Gradient-based methods

All the methods mentioned so far consider neural architecture search as a black-box optimisation problem over a discrete search space. This approach involves the evaluation of the neural architectures with a large set of parameters that takes a significant amount of time and computational resources. Another category of approaches considers optimising the architecture using gradient descent. This way, much fewer data is needed for optimising the hyperparameters compared to the black-box optimisation approach (Liu et al. 2019b).

Since gradient-based optimisation is only applicable to continuous search spaces, continuous relaxation approaches are used to transform the NAS search space to a continuous one. Some of the earlier approaches to creating continuous search spaces, such as (Ahmed and Torresani 2018; Saxena and Verbeek 2016; Shin et al. 2018), mainly focused on optimising a limited set of architectural hyperparameters. For instance, MaskConnect (Ahmed and Torresani 2018) focuses on optimising connectivity patterns by defining learnable masks in the form of binary vectors that are learnt jointly with network parameters, and Shin et al. (2018) focused on optimising a number of CNN hyperparameters, such as filter size, number of channels and group convolution, by defining a continuous function based on these hyperparameters.

Fig. 10
figure 10

a Operations on the edges of the DAG are unknown. b The continuous relaxation approach of DARTS (Eq. 6) creates a mixture of operations. c Solving a bilevel optimisation problem allows jointly optimising of the mixing weights and the network weights. d The final architecture is determined from the learned mixing weights (Liu et al. 2019b)

DARTS (Liu et al. 2019b) is a highly influential approach that considers a continuous relaxation of a cell-based search space applicable to both recurrent and convolutional networks. It aims to find optimal sub-architectures for the normal and reduction cells explained in Sect. 4.1.2. Consider a cell represented with a DAG composed of a set of nodes and edges. Each specific node \(x^{(i)}\) is a latent representation, and each edge (ij) is associated with an operation \(o^{(i,j)}\) that transforms the latent representation \(x^{(i)}\) to \(x^{(j)}\).

An intermediate node is calculated from its predecessors \(x^{(j)} = \sum _{i<j} \left( o^{(i,j)} \cdot x^{(i)} \right)\).

The DARTS approach, illustrated in Fig. 10, relaxes the categorical choice of a particular operation o on node \(x^{(i)}\) (e.g., convolution, max pooling) to a softmax over all possible operations \(\mathcal {O}\). This will acquire a mixture of candidate operations for each edge denoted by \(\bar{o}^{(i,j)}(x)\) (Liu et al. 2019b):

$$\begin{aligned} \bar{o}^{(i,j)}(x) = \sum _{o\in \mathcal {O}} \left( {\frac{\exp \left( \alpha _o^{(i,j)}\right) }{\sum _{o'\in \mathcal {O}}{\exp \left( \alpha _{o'}^{(i,j)}\right) }} \cdot o(x)} \right) \; , \end{aligned}$$
(6)

where o(.) denotes an operation applied to node \(x^{(i)}\), and the operation mixing weights of the pair of nodes \(x^{(i)}\) and \(x^{(j)}\) connected by the edge (ij) is parameterised by a vector \(\alpha ^{(i,j)}\) with the size of \(\mid \mathcal {O}\mid\). The goal of the architecture search process will be to identify an architecture encoding \(\alpha\) in the form of a set of continuous variables \(\alpha =\{ \alpha ^{(i,j)}_o\}\), each representing the weight of an operation. A discrete architecture will be produced by replacing mixed operations \(\bar{o}^{(i,j)}(x)\) with a single operation that has the highest weight (\(o^{(i,j)} \in \mathop {\mathrm {arg\,max}}\limits \limits _{o \in \mathcal {O}} \alpha _o^{(i,j)}\)).

The architecture search process then aims to find \(\alpha ^*\) that minimises the validation loss \(\mathcal {L}_{val}(w,\alpha ^*)\), while the weights of the network \(w^*\) are determined by minimising the training loss \(w^* = \underset{w}{\arg \min }\ \; \mathcal {L}_{train}(w, \alpha ^*)\). This bi-level optimisation problem can be solved using gradient descent to jointly optimise w and \(\alpha\).

There are a number of issues with this continuous relaxation approach that have been addressed by follow-up research. Increasing the depth of the candidate networks considered in DARTS exponentially increases the size of the space to be searched and, consequently, the GPU memory overhead. In light of this, Liu et al. (2019b) originally restricted architecture search to a network of 8 cells and evaluated it on a network of 20 cells. Proxyless (Cai et al. 2019), addresses the memory inefficiency of DARTS (Liu et al. 2019b) by defining proxy tasks (i.e., training on smaller datasets). This is achieved through a path binarisation approach for reducing the memory footprint. During the training of an overparameterised network, many paths remain active in memory. By defining binary gates instead of continuous gates, only one path will be retained active in memory at run-time.

Another issue with DARTS is caused by the difference in the behaviour of shallow and deep networks (i.e., faster gradient descent by shallow networks versus higher performance with deeper networks), leading into a so-called depth gap between search and evaluation. The networks preferred in the search process will not necessarily be optimal for evaluation (Chen et al. 2020) found evidence that the issue is caused by unfair competition among various operations in the bi-level optimisation process, which often leads to the dominance of skip-connections. They propose a cooperative mechanism implemented by additional activations for each parameter \(\alpha ^{(i,j)}\) to eliminate the competition by allowing operations to be switched on or off without being suppressed. Different forms of unfair competition between simultaneously trained operations in DARTS, leading to training stability issues, were identified and more generally studied by Hong et al. (2020), Gu et al. (2021). Hong et al. (2020) proposed to address this issue through the use of a new group drop-out operation, while (Gu et al. 2021) proposed a new continuous relaxation approach that completely decouples operations and topology search in differentiable architecture search.

Neural Architecture Optimisation (NAO) Luo et al. (2018) is another gradient-based NAS approach. While DARTS makes use of a continuous representation of the architecture, including its weights, NAO is based on a continuous embedding of the architecture search space only. In this approach, an auto-encoder is used to learn a continuous representation of neural network architectures. A surrogate model is trained on this continuous representation to predict the performance of previously unseen candidate architectures. In each iteration of the search process, the surrogate model is used in gradient-based optimisation to select a new architecture to be evaluated. This new architecture is obtained by map** the continuous representation back to a discrete one using a decoder learned as part of the autoencoder model. The surrogate model and autoencoder are updated at every iteration in order to minimise the encoder-decoder model loss and the performance prediction (surrogate) loss. While NAO uses gradient descent for hyperparameter optimisation, unlike DARTS, it does not consider a bi-level optimisation of weights and hyperparameters. However, NAO it can still use weight-sharing approaches (in Sect. 4.3.1) separately to speed up the evaluation of candidate networks.

Gradient-based approaches such as DARTS can be used in combination with gradient-based meta-learning approaches to further improve search efficiency. Recently, Elsken et al. (2020) have proposed MetaNAS, which integrates NAS with gradient-based meta-learning techniques. Their approach extends DARTS to optimise a meta-architecture with meta-weights during meta-learning; this facilitates the adaptation of the architecture to novel tasks.

4.2.6 Random search and grid search

As in the case of HPO, random search (Bergstra and Bengio 2012; Li and Talwalkar 2019) and grid search (Zagoruyko and Komodakis 2016) have been used in NAS. Both approaches are very simple and easy to run in parallel, but they are usually not considered to be very efficient. Random search has been considered a competitive baseline for NAS (Yu et al. 2020; Li and Talwalkar 2019). Comparing the performance of models on Penn Tree Band (PTB) (Marcus et al. 1994) language modelling and CIFAR-10 (Krizhevsky et al. 2010) image classification datasets,  Yu et al. (2020) demonstrated that, on average, a number of state-of-the-art NAS algorithms [ENAS (Pham et al. 2018), DARTS (Liu et al. 2019b) and NAO (Luo et al. 2018)] have similar performance to random search when using the same search space for finding a recurrent and convolutional cell.

Li and Talwalkar (2019) evaluated random search with early stop** and weight-sharing strategies (explained in Sect. 4.3.1) on the search space of DARTS and demonstrated that it achieves performance competitive with that of DARTS and ENAS. This is not due to the poor performance of the latter NAS algorithms but mainly the consequence of the constraints they have imposed on the search space. In a well-constrained search space, even random search can perform well. Another reason could be the weight-sharing strategy that negatively impacts architecture ranking during the search.

4.3 Performance evaluation in NAS

A major performance bottleneck of NAS systems arises from the need to train each candidate neural network. This is especially the case for deep neural networks, whose training takes substantial amounts of time. Therefore, performance evaluation is a much more important topic in the context of NAS than in HPO methods for classic machine-learning algorithms. Early NAS systems, such as NASNet (Zoph et al. 2018) and AmobaNet (Real et al. 2019), trained every candidate architecture from scratch, racking up thousands of days of GPU time. The design of cell-based search spaces, as discussed in Sect. 4, can, to some extent, decrease the total cost of performance evaluation. The search process can also be sped up significantly using network morphisms and function-preserving transformations (Chen et al. 2016) that allow modifying the structure of a network without majorly changing its predictions. Furthermore, efficient performance estimation strategies have recently become a major focus for research on NAS methods. In the following subsections, we discuss approaches taken to speed up the performance evaluation in NAS.

Fig. 11
figure 11

An example of a one-shot cell that receives three inputs concatenates them, applies a 1 \(\times\) 1 convolution operation, followed by multiple other operations and finally sums the result together. A sub-architecture, denoted by solid edges, can be evaluated by disabling inputs and operations by zeroing them out (Bender et al. 2018)

4.3.1 Parameter sharing

Parameter (or weight) sharing or inheritance between network architectures is an approach that can reduce the computational demands of NAS methods. Parameter sharing can be performed by reusing the weights of previously optimised architectures or sharing computations between different but related architectures.

Defining the search space that allows sampling sub-architectures from super-networks facilitates parameter sharing (Pham et al. 2018; Cai et al. 2018a; Elsken et al. 2018). ENAS (Pham et al. 2018), SMASH (Brock et al. 2018) and convolutional fabrics (Saxena and Verbeek 2016) use an over-parameterised super-network, defined in a discrete search space (also referred to as a one-shot model), that is a superposition of all possible sub-architectures, and enforce parameter sharing between sub-architectures. Once the weights of the super-network are trained, the performance of all sub-architectures on the validation set can be ranked and compared without training by enabling and disabling edges (see Fig. 11). The most promising sub-architectures need to be further retrained. Generating and training a super-network for a given search space is itself a complicated task. Muñoz et al. (2022) proposed a further approach, dubbed BootstrapNAS, that automates the generation and training of super-networks from pre-trained models.

Creating differentiable search spaces (see, e.g., Liu et al. 2019a, b; Xu et al. 2021b) is another Python library based on PyTorch and mainly focused on supporting NAS researchers. It systematically follows a modular and flexible approach that facilitates the reuse of search spaces and evaluation pipelines. NASLib covers cell-based and hierarchical search spaces, various search strategies (e.g., random search, Bayesian optimisation, evolutionary, and gradient-based search), and performance evaluation approaches (e.g., one-shot, zero-cost, weight-sharing, learning curve-based). Furthermore, it includes an extensive collection of NAS benchmarks that can help researchers evaluate and further develop performance prediction approaches.

Katib (George et al. 2019) and NAS-bench-101 (Ying et al. 2019) are among the first tabular benchmark datasets created in this manner. HPO-Bench covers 144 candidate architectures of a feedforward neural network and can be used for empirical evaluation of HPO methods. However, it is insufficient to evaluate NAS algorithms that focus on automatic design of advanced deep-learning architectures. NAS-Bench-101 is the first public dataset suitable for benchmarking NAS research. It covers all 423,000 unique convolutional architectures of a cell-based search space defined based on a DAG of seven nodes and three operations. Each architecture was trained multiple times on CIFAR-10 leading to a large dataset of over 5 million trained models. This benchmark can be used for comparing HPO methods as well as certain NAS algorithms that do not use parameter sharing or network morphisms.

NAS-Bench-201 (Dong and Yang 2020) is tailored towards the evaluation of more NAS algorithms on more image datasets but within a smaller space based on a DAG of 4 nodes and 5 operations, resulting in 15,625 neural cell candidates in total. NAS-Bench-1Shot1 (Zela et al. 2020) reuses the NAS-Bench-101 dataset with some modifications tailored to the evaluation of one-shot NAS methods. While benchmarks such as NAS-Bench-101, HPO-Bench, NAS-Bench-1Shot1 focus on solely architecture topology without considering the architecture size, NATS-Bench (Dong et al. 2021) covers both these factors by creating a benchmark of two sets of architectures with 15,625 neural cell candidates (4 nodes and 5 operations) to evaluate the architecture topology and 32,768 candidates for architecture sizes that can be used to evaluate all NAS methods. There have also been efforts to create more specialised benchmarks to evaluate more properties of a specific group of NAS systems. For instance, BenchENAS (**e et al. 2022b) focuses on evaluating EvNAS algorithms along with providing a platform for running them in a fair manner, including a special parallel processing component for evaluating populations of candidate architectures. This benchmark currently covers 9 EvNAS algorithms that cover fixed-length encoding and variable-length encoding strategies, as well as single- versus multi-objective optimisation. NAS-Bench-NLP (Klyuchnikov et al. 2022) focuses on benchmarking NAS for natural language processing by using RNN cells as opposed to the convolutional cells used in the previous NAS benchmarks. It covers 14,000 trained architectures along with evaluation results using metrics from language modelling and relevant downstream tasks (e.g., sentence tasks, similarity and paraphrase tasks).

The above-mentioned tabular NAS benchmarks rely on the computationally expensive task of an exhaustive evaluation of all architectures within a given search space. Therefore, to ensure feasibility, their search spaces are limited to very small architectural spaces compared to those usually considered in the NAS literature. This might compromise the generalisability of the models evaluated using existing tabular NAS benchmarks. To enlarge this search space, NAS-Bench-301 (Zela et al. 2022) provides a surrogate NAS benchmark that covers \(10^8\) architectures and is thus much larger than previous benchmarks. The underlying idea is to estimate the performance of candidate architectures using a surrogate model rather than actual evaluations. The authors of NAS-Bench-301 then created a new benchmark by sampling \(60\,000\) architectures from a much more complex search space over which they had trained the surrogate model. While the degree to which the resulting benchmark is realistic depends on the accuracy of the surrogate model, empirical results by Zela et al. (2022) demonstrate that a surrogate model can yield a better model of the architecture performance when compared to a tabular benchmark. The reason for this is that tabular benchmarks tend to present few training runs for each architecture acquired using a mini-batch procedure, which itself can have a variance. Thus, a tabular benchmark gives only a simple estimate of the true performance of an architecture. A surrogate model trained on a smaller subset of networks has the potential to yield a much more accurate estimation by taking into account extra information, such as the similarity between architectures.