Introduction

Current optimization algorithms achieve good results on low-dimensional problems that are smooth and have wide basins of attraction. Examples of smooth manifolds with wide basins of attraction within material science include process- and recipe-optimization problems such as tuning perovskite manufacturing variables to achieve higher efficiency1, optimizing microfluidics flow parameters to achieve ideal droplet formation2, optimizing silver nanoparticle recipes for optical properties3, and tuning perovskite compositions with physics-based constraints to maximize stability4. Optimization techniques like Bayesian optimization (BO) are well-suited to model these simple manifolds using a Gaussian Process (GP) surrogate5,6,7,8,9. However, the performance of this BO with a GP breaks down as the manifold complexity increases. Material property optimization problems that have high technological significance, such as discovering materials with rare properties or materials with a specific combination of properties, have search space manifolds that more closely resemble a Needle-in-a-Haystack10, shown in Fig. 1b, rather than a smooth or convex space.

Fig. 1: Archetypal Manifolds in Materials Science Optimization.
figure 1

a In process optimization, there often exists a real and continuous path between each condition. This 3D projected manifold is from a perovskite process optimization process, where X1 is spray flow rate, X2 is plasma voltage, and f(X) is cell efficiency1. b However, in materials optimization, there are often only discrete combinations of properties that define real materials, resulting in a rough topology with extreme outliers. For example, Li2NbF6 and Li2ZrF6 lay close to each other in space because they have similar density, formation energy, and structure, however, they have vastly different target properties: Li2NbF6 has a Poisson’s ratio of − 1.7 while Li2ZrF6 has a Poisson’s ratio of 0.321. Extreme outliers, such as Li2NbF6, consist of only a small fraction of the manifold hypervolume, resulting in a Needle-in-a-Haystack regime arising. This 3D projected manifold is obtained from the 6D Poisson’s ratio optimization problem presented in this paper, where X1 is density, X2 is formation energy, and f(X) is negative Poisson’s ratio20.

This Needle-in-a-Haystack (NiaH) problem arises when only few optimum conditions exist within the entire dataset, resulting in an extreme imbalance. Interpolating the parameter space of an imbalanced dataset with an estimation function, such as a GP, results in smoothing over the optimum or over-predicting the properties of the materials found near the optimum11,12,13. Examples of NiaH materials optimization problems include discovering auxetic materials (i.e., materials that have a highly negative Poisson’s ratio, ν) for energy absorptive medical devices or protective armor14,15,16 and discovering materials that have a combination of high electrical conductivity and low thermal conductivity (i.e., a highly positive thermoelectric figure of merit, ZT) used for improving sensor technology to enable ubiquitous solid-state cooling17,18,19. Optimization of these rare material properties illustrates examples where an extreme data balance exists in the dataset because only a fraction of the total number of materials exhibit these rare properties14,20,21,22,23. This NiaH optimization challenge of extremely imbalanced datasets is largely applicable to many fields, not just materials science, including the fields of ecological resource management24,25, fraud detection26,27, and rare diseases27,28.

Several challenges exist for the current landscape of computational tools that inhibit effective optimization of these complex NiaH problems. Firstly, the “needle" makes up only a small percentage of the total manifold search space, resulting in a weak correlation between the measured input parameters and the target property of interest, inhibiting discovery of the region containing the needle11,29,30. This challenge requires the development of an algorithm that can more quickly determine the plausible region of the manifold where the needle exists. The second challenge for algorithms, such as BO, to optimize NiaH manifolds is in the nature of the acquisition function to pigeonhole sampling into local minima because of the narrowness of the needle’s basin of attraction31,32. Standard BO acquisition functions, including expected improvement (EI)33 and lower confidence bound (LCB)7,12, are static sampling techniques that only adjust sampling based on the output of the surrogate model, which enacts smoothing of the needle5,6,11. To overcome this challenge, active learning-based tuning of the acquisition function hyperparameters can be implemented to improve the sampling quality and avoid pigeonholing. Lastly, there exists a computing challenge for NiaH problems where, typically, several thousands of samples must be observed to find an optimum when using an algorithm that is poorly suited to tackle NiaH manifolds10. The compute time of BO using a GP surrogate scales with the complexity O(n3), where n is the number of experiments sampled, hence, the compute time of traditional BO blows up as more data is required to find the optimum5,6,34,35,36,37,38. To solve this computing challenge, an algorithm must be designed that both efficiently optimizes the space in as few experiments as possible and reduces the effect of compounding compute times over the length of the optimization procedure.

In recent literature, algorithms have been developed to address some of these challenges individually, but not all of them together. The first class of solutions bound the search space using a trust region approach to sample regions with higher probability of containing the optimum. Eriksson et al. develop TuRBO39 that compiles a set of independent model runs, using separate GP surrogate models to compute a new, smaller search region, narrowed in on the target optimum. Regis develops TRIKE40 that utilizes maximization of the EI acquisition function to bound a trust region containing the global optimum. Diouane et al. develop TREGO41, which interleaves sampling between global and local search regions, where the local search regions are defined by the single best historical experiment sampled. Although these methods offer solutions to one of the three challenges presented, each method has its downfalls when optimizing NiaH problems. For example, TuRBO requires the computation of several GP model runs, which increases compute time and also does not guarantee that the needle will be resolved due to interpolation effects; TRIKE is inflexible to the use of other acquisition functions as it locks the user in to only using EI, which may pigeonhole into local minima; TREGO uses only the best sampled experiment to define its search regions, which will yield inconsistent or sub-optimal results when the needle consists of a fractional region of the manifold and single point is unlikely to land in its basin of attraction. The second class of solutions to the challenges presented in this paper are designed to decrease the computing time required to run an optimization procedure. A common method for reducing the compute time of BO with a GP surrogate is to introduce a sparse GP5,37,42. A sparse GP uses a small subset of pseudo data, often denoted as m, to reduce the GP time complexity from O(n3) to O(nm2)43. However, the process of selecting a useful subset requires minimizing the Kullback-Leibler divergence between the sparse GP and true posterior GP, which is often a computationally intensive procedure of using variational inference44. In addition to sparse GPs, algorithms have been developed in literature to improve the compute time of optimization in various ways. Van Stein et al. develop MiP-EGO45, which parallelizes the function evaluations of efficient global optimization (EGO) to discover optima faster and in fewer experiments using derivative-free computation46. Joy et al.47 use directional derivatives to accelerate hyperparameter tuning by 100× and achieve higher accuracy than the FABOLAS baseline by Klein et al.48. Zhang et al. develop FLASH49 to achieve optimization speed-ups of 50% by using a linear parametric model to guide algorithm search within high-dimensional spaces. Snoek et al.13 design a neural network-based parametric model that reduces the overall time complexity of BO to O(n) compared to the complexity of O(n3) of standard BO with a GP surrogate model. These existing methods from literature within the class of solutions for accelerating compute time are generally introducing external models necessary to perform optimization, such as neural networks, variational inference, or parametric models. While these external models do speed up compute time, they often lack the predictive capabilities to capture the weak correlation between measured input parameters and the target property of interest in NiaH problems. We illustrate this mechanism later in the paper when comparing the optimization results on two materials science NiaH problems of a fast algorithm MiP-EGO with that of TuRBO, an algorithm better suited for discovering optima within narrow basins of attraction.

Although these methods from existing literature address some of the challenges in optimizing NiaH problems, none of them have been designed specifically to quickly and efficiently discover a needle-like optimum within a haystack of sub-optimal points, resulting in all of them falling short of a full solution. Therefore, in this paper, we design an algorithm that addresses all three of the challenges faced when optimizing NiaH problems by (1) zooming in the manifold search bounds iteratively and independently for each dimension based on m number of best memory points to quickly converge to the plausible region containing the global optimum needle, (2) relieving compute utilization by pruning the low-performing and redundant memory points not being used to zoom in the search bounds, (3) anti-pigeonholing into local minima by using actively learned acquisition function hyperparameters to tune the exploitation-to-exploration ratio. The proposed algorithm, entitled [Zo]oming [M]emory-[B]ased [I]nitialization (ZoMBI), combines these three contributions into a method that efficiently optimizes NiaH problems quickly. Figure 2 demonstrates the accelerated convergence ability of the proposed (ZoMBI) algorithm compared to standard BO. In essence, this process of scanning broadly and then focusing in on points of interest based on memory was inspired by the way we humans solve similar problems, but stands in contrast to the way standard BO methods with static acquisition functions solve problems. We demonstrate the performance of this algorithm on three vastly different NiaH problems in materials science and ecological resource management: (1) discovery of materials with negative Poisson’s ratio, (2) discovery of materials with both high electrical conductivity and low thermal conductivity, and (3) detection of environmental conditions conducive of sustaining wildfires. The performance of the proposed ZoMBI algorithm is compared against standard BO with static acquisition functions as well as against three more algorithms: (1) HEBO, the winning submission of the NeurIPS 2020 Black-Box optimization challenge50 and one algorithm from each of the two classes of partial NiaH solutions (2) TuRBO (bounded search space)39 and (3) MiP-EGO (faster compute)45. Finally, we stress-test the proposed ZoMBI algorithm across 174 additional datasets varying the optimum needle width, optimum distance to edges, dimensionality, and initialization conditions.

Fig. 2: Accelerated Convergence to True Target using ZoMBI.
figure 2

Using a standard Bayesian optimization procedure, the discovery of a Needle-in-a-Haystack condition does not progress significantly after 10 additional experiments from the initial GP guess. However, using ZoMBI to zoom the bounds inward and prune redundant memory points, the needle-like optimum region is resolved to be accurately aligned with the true target. a The true target to optimize, which is a slice from the 6D Poisson’s Ratio dataset. b The initial guess of the target function using a GP surrogate with 20 randomly sampled experiments. c (top) The estimated target resolved by standard BO after 10 additional experiments sampled using a greedy LCB acquisition function (β = 0.1); (bottom) the estimated target resolved by ZoMBI after 10 additional experiments sampled using the same greedy LCB acquisition function. The red memory points do not assist in resolving this target after zooming in the bounds, hence, they are pruned from memory by ZoMBI.

Results

Zooming in the search bounds on the manifold addresses challenge number one of optimizing NiaH problems, which is the challenge of finding the general hypervolume region that contains the needle-like optimum. Figure 3 illustrates how the ZoMBI algorithm iteratively zooms in the search bounds based on the number of activations, α. An Ackley function is used as a simulated example due to its non-convexity and needle-like global optimum51,52. For each activation, m prior points that achieved the lowest target values, y, are retained in memory and used to zoom the search bounds in. This zooming occurs independently across each dimension and is based on the minimum and maximum values of the m memory points along each dimension, as shown in Equation (2). The red and orange rectangles illustrate the evolution of the bounds over space and time. Initially, sampling occurs across the entire manifold for ϕ forward experiments per activation, shown by the black markers. However, by using the best-performing memory points to zoom in the search bounds, pigeonholing into local minima can also be avoided as the search bounds are pulled away from these trap minima and move closer towards the global minimum basin of attraction. The iterative zooming of ZoMBI does not guarantee convergence on the global optimum, but if a sufficient initialization set is obtained, convergence often gets close to the global optimum as shown across several examples in Fig. 5 and Figs. 8, 9, 10. Furthermore, we comprehensively demonstrate the performance limitations of ZoMBI where initializations miss extreme needle-like optima in Fig. 6 and where optima are near the edges of a manifold in Supplementary Figure 4.

Fig. 3: Zooming Search Bounds.
figure 3

For every activation of ZoMBI, the search bounds are zoomed inward based on the prior best-performing memory points. A 4D Ackley function manifold is projected in 2D. The bounding regions of each 2D slice are illustrated by the red and orange boxes. The ϕ number forward experiments sampled for each activation, α, are illustrated as black markers. The global optimum is indicated by the red region of the heatmap.

As more experiments are amassed and committed to memory to run traditional BO by computing the GP surrogate, the compute time increases polynomially, following the O(n3) time complexity of GP matrix inversion5,6,34,37,38,53. This complexity is unfavorable as it leads to compounding compute times as more experiments are run. Therefore, we implement a memory pruning feature into the ZoMBI algorithm that iteratively selects which prior data points to keep and which to prune from the memory during each activation, α. Memory pruning is demonstrated to remove redundant features during the optimization procedure. Figure 2 illustrates how ZoMBI accelerates the convergence of a GP prediction to the precise location of the true. However, only data within the newly computed bounds of ZoMBI are used prediction of the true target, hence, all data outside this boundary becomes redundant and is pruned to decrease compute time.

Through memory pruning, the number of experiments used to train the GP surrogate varies between [i, i + ϕ] for every α, rather than being proportional to n, where the number of initialization samples is fixed at i = 5. In this paper, we use ϕ ∈ [0, 10], i.e., once ϕ = 10, the activation is complete and resets to ϕ = 0. This is computationally favorable because {Xi} ∪ {Xϕ} ⊆ {Xn}. Thus, for a single α, the time complexity is O((i+ϕ)3) ≈ O(ϕ3), since i is fixed. Furthermore, since the range of ϕ is capped, a non-increasing sawtooth pattern in compute time is exhibited, illustrated in Fig. 4. Therefore, the compute complexity of ZoMBI trends towards O(1) for α > 1 as a result of the efficient memory pruning process. After collecting 1000 experiments, the compute time of traditional BO trends towards > 400 seconds per experiment, whereas for ZoMBI the compute time maintains a constant trend of approximately 1 second per experiment. Therefore, the memory pruning feature of ZoMBI accelerates the optimization compute time by over 400× at n = 1000 and achieves further relative acceleration as n increases.

Fig. 4: Wall-clock Compute Time.
figure 4

The compute time per experiment is illustrated for traditional BO with a GP surrogate (orange) and for ZoMBI with a GP surrogate (blue) with the y-axis in log-scale. Four independent trials of each method were run to optimize a 6D Ackley function with a narrow basin of attraction using an NVIDIA Tesla Volta V100 GPU72. Each trial of standard BO and ZoMBI is run using one of the four acquisition functions: LCB, LCB Adaptive, EI, and EI Abrupt. The averages of the trials are shown as solid orange and blue lines while the shaded regions indicate the maximum and minimum compute times bounds. The red dashed line indicates the trend of the ZoMBI compute times. The measured compute time includes the time to compute the GP surrogate model and the time to acquire an experiment from the surrogate.

Pigeonholing into the local minima of a function occurs when an optimization algorithm has insufficient learned knowledge of the manifold topology to continue exploring potentially profitable regions or when the algorithm’s hyperparameters are improperly tuned, leading to overly exploitative tendencies1,9. The ZoMBI algorithm’s anti-pigeonholing capabilities are two-fold: (1) the zooming search bounds help the acquisition function to quickly stop sampling local minima once a better performing data point is found and (2) actively learned acquisition function hyperparameters use knowledge about the domain to help exit a local minimum. Figure 5 demonstrates the anti-pigeonholing capabilities of ZoMBI on optimizing a 6D Ackley function with both static and dynamic acquisition functions, compared to that of traditional BO.

Fig. 5: Acquisition Function Sampling Density.
figure 5

The colored heatmaps indicate the regions of a 2D slice from a 6D Ackley function where sampling density is high for each respective acquisition function: a LCB, b LCB Adaptive, c EI, and d EI Abrupt. The contour lines indicate the manifold topology with local minima as the circular and pointed regions of the contours. The red “x" indicates the global minimum. For each acquisition function, the left panel shows the sampling density after n = {20, 40, 80} evaluated experiments without the use of ZoMBI while the right panel shows the sampling density after n = {20, 40, 80} evaluated experiments with the use of ZoMBI.

The needle-like global minimum is indicated by the red “x" and the local minima are indicated by the circular and pointed regions of the contour lines. The sampling density of each acquisition function is illustrated by the heatmap, where the darker colors indicate higher sampling density regions. The goal is to get high sampling density near the red “x". It is shown that without ZoMBI being activated, the LCB, LCB Adaptive, and EI acquisition functions all end up pigeonholing into local minima. However, EI Abrupt initially pigeonholes into a local minima but then switches from an exploitative to an explorative mode to jump out of the local minimum and converge closer to the global. Conversely, when running the optimization procedure with ZoMBI active, all of the acquisition functions except the most exploitative, EI, converge onto the global minimum fast. LCB Adaptive and EI are shown to initially start sampling towards a local minima, but as ZoMBI is iteratively activated, the search bounds zoom in closer to the global minimum. Thus, with the combination of dynamic acquisition functions and zooming search bounds, pigeonholing into sub-optimal local minima can be more readily avoided while optimizing NiaH problems, although avoidance is not guaranteed, as shown by the sampling density of EI. The combination of the three foundational features of ZoMBI, (1) zooming bounds, (2) memory pruning, and (3) anti-pigeonholing drives fast optimization of NiaH problems and in most cases, does not sacrifice the ability to converge on the global optimum.

Before assessing the performance of ZoMBI on the three real-world datasets, we use 144 permutations of the Ackley function to stress-test the capability of ZoMBI to discover the global optimum basin of attraction, given two varying dataset hyperparameters: (1) basin of attraction width and (2) dimensionality. The basin of attraction hypervolume is determined by both the width of the basin and the dimensionality of the manifold, hence, as the basin becomes narrower in width and as the dimensionality increases, the percentage of hypervolume space taken up by the basin decreases, i.e. the optimum becomes more needle-like. The Ackley permutations have varying basin hypervolumes from 0.001% to 100% and varying manifold dimensionalities from 2D to 10D. For this experiment, we aim to determine types of manifold topologies that ZoMBI best optimizes while quantifying those limits with the Pareto front.

Figure 6 shows the results of this large-scale optimization experiment of 48 independent trials of ZoMBI across each of the 144 unique permutations of the Ackley function dataset with varying optimum hypervolumes and dimensionality. All points below the grey-shaded region fall within the optimum basin of attraction. The red trace of the Pareto front indicates the narrowest optimum hypervolume and dimensionality conditions of a dataset that result in the best minimum function value being discovered. We show that with an initialization set of i = 5, ZoMBI can reliably discover the global minimum region for needles as narrow as 0.05% of total hypervolume space. Moreover, as the optimum becomes narrower than 0.05% of the total hypervolume, the initialization set is no longer sufficient and ZoMBI gets trapped in local minima, as indicated by the greyed-out region. Conversely, as the optimum becomes wider than 5% of the total hypervolume, the manifold becomes flatter, expressing the greedy nature of ZoMBI to falsely zoom inward to less ideal function values than it would for narrower optimum conditions. This experiment quantifies the range of ZoMBI’s Goldilocks zone to be between 0.05% and 5% optimum hypervolume. Therefore, for ideal performance, ZoMBI is best used on datasets with optimum conditions consisting of between 0.05% and 5% of the total number of conditions. This optimum hypervolume trade-off of ZoMBI is further assessed relative to other optimization methods in Supplementary Figure 3.

Fig. 6: Varying Optimum Hypervolume.
figure 6

(left) Depiction of decreasing optimum basin of attraction hypervolume in 1D. (right) The Pareto-optimal dataset hyperparameters for usage with the ZoMBI algorithm over 144 analytical datasets with 48 independent trials each: 12 trials for each of the four acquisition functions, LCB, LCB Adaptive, EI, and EI Abrupt, for a total of 6912 independent trials. Each analytical dataset is a permutation of the Ackley function with a different optimum basin of attraction width and manifold dimensionality. Hypervolume percent makeup is synthetically decreased both by decreasing the basin of attraction width and by increasing the manifold dimensionality. Each scatter point represents the median final minimum function evaluation after 1000 experiments across the 48 independent trials initialized with a fixed set of i = 5 samples. The colorbar of the scatter point represents the dimensionality of the manifold tested and the error bars represent the variance across the 48 trials. The possible function values for every dataset vary between [0, 25], hence, for the Ackley function as further detailed in the Supplementary Information, trials achieving minimum function values < 10 are considered to have found the optimum basin of attraction while trials with function values ≥10 after 1000 experiments are considered to be trapped in local minima. Both the x- and y-axes are in log-scale.

Three real-world datasets are optimized using ZoMBI—each of these datasets has an extreme data imbalance, illustrated in Figure 7 within the specified ideal ranges of ZoMBI performance. The 6D Poisson’s Ratio dataset has an imbalance of 0.82% optimum conditions, the 6D Thermoelectric Figure of Merit dataset has an imbalance of roughly 1.32% optimum conditions, and the 11D wildfire detection dataset has an imbalance of 4.16% optimum conditions. This range of ideal performance of ZoMBI between 0.05% and 5% optimum hypervolume is facilitated by the initialization set. Hence, to improve performance for narrower optima, either the number of initialization samples must be increased, or initialization conditions should be adjusted. Additional initialization conditions experiments of ZoMBI are shown in Supplementary Information.

Fig. 7: Data Distributions of Real-world Needle-in-a-Haystack Datasets.
figure 7

(top) The histogram distributions of the full real-world datasets with callouts for optimum conditions: a Poisson’s Ratio with 146k materials in the dataset and \({\nu }_{\min }=\{-1.7,-1.2\}\), b Thermoelectric Figure of Merit with 1k materials in the dataset computed by BoltzTraP57 and ZT\({}_{\max }=\{1.4,1.9\}\), c Wildfire Detection with 128k meteorological conditions collected over 33 months from January 2018 to September 2020 from CIMIS62 and ψ < 0 conditions indicating those with a high likelihood of wildfire outbreaks. (bottom) The noisy, non-convex manifold topologies of each dataset generated by a random forest regression with 500 trees. Each manifold is a projected 3D slice of higher dimensional space with the z-axis and colorbar indicating the target property, where a X1 is density and X2 is formation energy, b X1 is formation energy and X2 is band gap, c X1 is evapotranspiration and X2 is precipitation.

The first experimental dataset is 6-dimensional and consists of 146k materials from the publicly available Materials Project database with different mechanical properties, described by Poisson’s Ratio, ν20. Only 0.82% of the total 146k materials have a negative Poisson’s Ratio, ν < 014,15,20,21. Hence, for this experiment, we aim to minimize ν. A positive ν > 0, describes a material that expands when a compressive load is applied to the orthogonal direction54,55. Conversely, a negative ν < 0 describes a material that contracts rather than expands when compressed in the orthogonal direction, denoted as an auxetic material—a rare phenomenon14,23. Auxetic materials with highly negative Poisson’s ratios have energy absorptive properties that are ideal materials for wearable medical devices and protective armor that must absorb the energy of large impacts to keep bones from shifting or to inhibit the penetration of the protective layer15,16.

Figure 8 demonstrates the optimization performance of ZoMBI on the Poisson’s Ratio dataset compared to MiP-EGO, TuRBO, and HEBO. The ZoMBI algorithm is run with each of the four acquisition functions: LCB, LCB Adaptive, EI, and EI Abrupt. In under 100 evaluated experiments, LCB and LCB Adaptive discover the global minimum NiaH material, Li2NbF6 (ν ≈ − 1.7). The variance of ν values for the final experiment across all ensemble runs is illustrated as a KDE plot for each method to highlight the sampling density and general rate of success. HEBO discovers the global minimum after ZoMBI with LCB and LCB Adaptive, however, the spread of runs for ZoMBI is narrower than that of HEBO, which indicates that for this problem, ZoMBI can more consistently discover the minimum, that is 3× lower than those discovered by MiP-EGO and TuRBO. Furthermore, the rate of convergence on Needle 1 is faster for ZoMBI than HEBO.

Fig. 8: Discovery of Rare Negative Poisson’s Ratio Materials.
figure 8

The optimization objective is to find the material with the minimum Poisson’s ratio in 100 experiments from the dataset presented in Fig. 7a. The green, blue, red, and orange lines indicate the median best running evaluated sample of ZoMBI using the LCB, LCB Adaptive, EI, and EI Abrupt acquisition functions, respectively. The pink, black, and teal lines indicate the median best running evaluated sample of the methods MiP-EGO, TuRBO, and HEBO respectively. Random sampling is illustrated as a dashed grey line for benchmarking. The median for each method is taken over the best 12 independent model runs. The shaded regions indicate the variance between model runs. The cross-hatched region indicates the space discovered by standard BO methods, without the use of ZoMBI, which use the same hyperparameters. The distribution across all 12 model runs of the final sampled experiment for each method is shown as a kernel density estimation (KDE) along the y-axis. The y-values for the needle-like optima are indicated by dashed black lines.

Figure 7a illustrates the distribution of ν values within the full dataset. The ground truth “needle" materials with the lowest ν values are Li2NbF6 with ν ≈ −1.7 and Na2CO3 with ν ≈ −1.2. ZoMBI with the LCB and LCB Adaptive acquisition functions and HEBO discover Li2NbF6, while ZoMBI with the EI Abrupt acquisition function discovers Na2CO3.

The second experimental dataset is 6-dimensional and consists of 1k materials with different thermal and electrical properties, described by the Thermoelectric Figure of Merit, ZT. Since ZT values are always positive, there is no clear cutoff for what “optimum" conditions are, but with a threshold of ZT > 0.8, 1.32% of the total 1k materials are considered optimum. A higher ZT indicates that the material is better able to convert a thermal gradient into an electrical current56. Hence for this experiment, we aim to maximize ZT. Unlike Poisson’s Ratio, Thermoelectric Merit is determined by a combination of several variables, rather than a single variable56:

$${{{\rm{ZT}}}}=\frac{{S}^{2}\sigma }{\kappa }T,$$
(1)

where S is the Seebeck coefficient, σ is electrical conductivity, T is the average temperature, and κ is thermal conductivity. The ZT is computed for each material with valid thermal and electrical properties in the Materials Project database using BoltzTraP57. ZT is a common figure of merit used to describe the thermal-to-electrical or electrical-to-thermal conversion efficiency of thermoelectric materials58,59,60,61. Materials with high ZT values have a range of applications from usage as solid-state cooling devices to being used as sensors that when heated up, will produce an electrical signal17,18,19.

Figure 9 demonstrates the optimization performance of ZoMBI on the Thermoelectric Figure of Merit dataset compared to MiP-EGO, TuRBO, and HEBO. In this experiment, although none of the tested methods are able to discover the maximum needle, LCB Adaptive discovers the second highest needle-in-a-haystack material, Na4Al3Ge3IO12 (ZT ≈ 1.4) in under 100 experiments. Neither HEBO, TuRBO, nor MiP-EGO are capable of discovering any needle-like ZT optima and MiP-EGO performs worse than random sampling in this experiment. The wide variance across runs for ZoMBI and HEBO, shown in the KDE plots, indicate that both methods operate relatively explorative to discover maxima in this topology. Ultimately, this experiment demonstrates that ZoMBI can optimize material objective functions that have a complex combination of variables (Equation (1)) with roughly 2× better performance than HEBO.

Fig. 9: Discovery of Rare Positive Thermoelectric Figure of Merit Materials.
figure 9

The optimization objective is to find the material with the maximum Thermoelectric Figure of Merit in 100 experiments from the dataset presented in Fig. 7b. The green, blue, red, and orange lines indicate the median best running evaluated sample of ZoMBI using the LCB, LCB Adaptive, EI, and EI Abrupt acquisition functions, respectively. The pink, black, and teal lines indicate the median best running evaluated sample of the methods MiP-EGO, TuRBO, and HEBO respectively. Random sampling is illustrated as a dashed grey line for benchmarking. The median for each method is taken over the best 12 independent model runs. The shaded regions indicate the variance between model runs. The cross-hatched region indicates the space discovered by standard BO methods, without the use of ZoMBI, which use the same hyperparameters. The distribution across all 12 model runs of the final sampled experiment for each method is shown as a kernel density estimation (KDE) along the y-axis. The y-values for the needle-like optima are indicated by dashed black lines.

Figure 7b illustrates the distribution of ZT values within the full dataset. The ground truth “needle" materials with the highest ZT values are Sr4Al6SO12 with ZT ≈ 1.9 and Na4Al3Ge3IO12 with ZT ≈ 1.4. ZoMBI with the LCB Adaptive acquisition function is the only method that discovers one of these needles, Na4Al3Ge3IO12.

The third experimental dataset is 11-dimensional and consists of 128k meteorological conditions and an index, ψ, that determines whether the set of conditions has a high likelihood of generating or sustaining a wildfire in the state of California—publicly available from the California Irrigation Management Information System (CIMIS) weather stations62. Only 4.16% of the total 128k meteorological conditions have a negative wildfire detection index, ψ < 0. A highly negative ψ indicates a high risk of wildfires. Hence, for this experiment, we aim to minimize ψ, to best detect meteorological conditions at high risk of wildfires. The dataset spans over two years of data collected from 2018 to 2020, during which over 2500 wildfires have occurred, burning over 24 million acres of land63. In California, temperature and precipitation alone are poor indicators for wildfire outbreaks (see Supplementary Fig. 1), resulting in researchers using computer-vision methods or convolutions of many meteorological variables to reliably detect wildfire conditions instead25,63. Thus, there is a high need for algorithmic support to aid humans in early wildfire detection.

Figure 10 demonstrates the optimization performance of ZoMBI on the Wildfire Detection dataset compared to MiP-EGO, TuRBO, and HEBO. In this experiment, LCB Adaptive, EI, and HEBO discover the lowest index value, ψ ≈ − 3.5, for detecting wildfires based on a high-dimensional convolution of ten meteorological variables. TuRBO and MiP-EGO also discover a low index value, ψ ≈ − 2.5, however, these methods have widely distributed variances, as shown by the KDE plots, indicating inconsistent optimization results given only 100 sampled experiments. Similarly, HEBO has high variance across model runs while the LCB Adaptive and EI ZoMBI methods have a tight distribution, indicating more reliable optimization results with a higher rate of success. Furthermore, ZoMBI methods achieve a faster rate of convergence than HEBO onto the Needle 1 optimum, similar to the optimization results on the Poisson’s Ratio dataset.

Fig. 10: Detection of Environmental Conditions with Wildfire Risk.
figure 10

The optimization objective is to find the meteorological conditions with the minimum wildfire detection index, ψ, in 100 experiments from the dataset presented in Fig. 7c. Conditions with ψ < 0 have the highest risk of sustaining wildfire. The green, blue, red, and orange lines indicate the median best running evaluated sample of ZoMBI using the LCB, LCB Adaptive, EI, and EI Abrupt acquisition functions, respectively. The pink, black, and teal lines indicate the median best running evaluated sample of the methods MiP-EGO, TuRBO, and HEBO respectively. Random sampling is illustrated as a dashed grey line for benchmarking. The median for each method is taken over the best 12 independent model runs. The shaded regions indicate the variance between model runs. The cross-hatched region indicates the space discovered by standard BO methods, without the use of ZoMBI, which use the same hyperparameters. The distribution across all 12 model runs of the final sampled experiment for each method is shown as a kernel density estimation (KDE) along the y-axis. The y-values for the needle-like optima are indicated by dashed black lines.

Figure 7c illustrates the distribution of ψ values within the full dataset. The ground truth “needle" conditions for detecting wildfires are those with the most negative detection index values, ψ. Although ZoMBI with the LCB Adaptive and EI acquisition functions as well as HEBO discover the lowest needle-like ψ conditions after 100 sampled experiments, none of the tested methods are able to find the global \({\psi }_{\min }\approx -12\). These results imply that, even for ZoMBI, with a narrow enough needle-like optimum, an LHS initialization of i = 5 experiments, may not be sufficient. Supplementary Fig. 4 demonstrates that extending the bounds of LHS initialization is shown to improve the performance of ZoMBI on certain manifold topologies.

Discussion

In this paper, we proposed the [Zo]oming [M]emory-[B]ased [I]nitialization (ZoMBI) algorithm that builds on the principles of Bayesian optimization to accelerate the optimization of Needle-in-a-Haystack problems by two-fold, firstly by requiring fewer experiments to achieve better optimum faster than existing MiP-EGO45, TuRBO39, and HEBO50 on a variety of real-world applications, and secondly by pruning the memory of low-performing historical experiments to speed-up compute time. The ZoMBI algorithm convergences onto narrow and sharp optima quickly in Needle-in-a-Haystack datasets by (1) using the values of the m best performing previously sampled memory points to iteratively zoom in the search bounds of the manifold uniquely on each dimension and (2) implementing two custom acquisition functions, LCB Adaptive and EI Abrupt, that adapt their hyperparameters to tune sampling of new experimental conditions based on learned information from the surrogate model. The main contributions of this algorithm solve three fundamental challenges of optimizing non-convex Needle-in-a-Haystack problems: (1) the challenge of locating the hypervolume region of the manifold containing the narrow global optimum basin of attraction11,29,30 is alleviated by introducing iterative search bounds based on learned knowledge of the manifold; (2) the challenge of polynomially increasing compute times of BO using a GP surrogate5,6,34,35,36,37,38 is addressed by actively pruning the retained memory of the algorithm after each activation, α, in turn, reducing the time complexity from O(n3) to O(ϕ3) for ϕ forward experiments per activation, α, which trends to a constant O(1) when α > 1; (3) unwanted pigeonholing into local minima5,6,31,32 is avoided by both the zooming mechanics of ZoMBI as well as using the two acquisition functions developed in this paper, LCB Adaptive and EI Abrupt, that tune their hyperparameters through adaptive learning. By develo** the ZoMBI algorithm to solve these challenges, it becomes possible to quickly and efficiently find optimal solutions to complex Needle-in-a-Haystack problems in fewer experiments.

Solving a Needle-in-a-Haystack problem that arises from extremely imbalanced data is a significant challenge that has important implications in science and engineering, especially within the field of materials science10,29. In this paper, we use ZoMBI to discover the optimum materials in two real-world materials science Needle-in-a-Haystack datasets where only a small fraction of the entire search space consists of the target optimum conditions. For breadth, we also extend our analysis to a third real-world dataset but for ecological resource management with the objective of discovering the environmental conditions that have a high likelihood of sustaining wildfires for early detection of wildfires. In the first materials dataset, we discover a material with a highly negative Poisson’s ratio, ν,20,21; in the second materials dataset, we discover a material with a highly positive thermoelectric figure of merit, ZT20,57, both rare material properties; and in the third dataset for ecological resource management, we discover a set of environmental conditions with a highly negative wildfire detection index, ψ25,62,63. For the first dataset, both the ZoMBI algorithm with the LCB and LCB Adaptive custom acquisition functions and HEBO50 discover the material with the minimum ν ≈ −1.7, however, the ZoMBI methods converge on this minimum in only 70 experiments while HEBO takes 90 experiments. TuRBO39 and MiP-EGO45 only discover materials with ν ≈ − 0.55 and ν ≈ − 0.20, respectively. For the second dataset, the ZoMBI algorithm with the LCB Adaptive custom acquisition function discovers the material with the maximum ZT ≈ 1.4, while HEBO50, TuRBO39, and MiP-EGO45 only discover ZT ≈ 0.78, ZT ≈ 0.65, and ZT ≈ 0.45, respectively. For the third dataset the ZoMBI algorithm with all acquisition functions and HEBO50 discover a minimum ψ ≈ − 3, while TuRBO39 and MiP-EGO45 both only discover ψ ≈ − 2. However, the ZoMBI methods converge on the minimum faster and with less variance. In general, we note HEBO50 outperforms the other benchmark methods, TuRBO39 and MiP-EGO45. Thus, for future investigation, we believe the performance of ZoMBI may be further improved by running optimization within the latent space of a variational autoencoder, similar to HEBO64,65. Overall, these results demonstrate that the ZoMBI algorithm is more well-suited to tackle various real-world Needle-in-a-Haystack optimization problems than current methods, however, ZoMBI has performance limitations for extremely narrow optima when instantiated with an insufficient initialization set. Therefore to assess these limitations, we stress tested ZoMBI on an additional 174 analytical datasets with varying optimum needle widths, optimum distance to edges, dimensionality, and initialization conditions. These results concluded that with a fixed initialization set of 5 samples, ZoMBI has ideal performance on datasets with needle-like optima consisting of between 0.05% and 5% of total hypervolume space. Furthermore, by extending the range of the initialization set, ZoMBI is capable of discovering global minima that lay on the absolute edge of a manifold’s limits. Thus, in these certain cases, convergence to a global optimum using ZoMBI is not guaranteed, but with slight modifications based on some a priori domain knowledge of the optimization landscape, ZoMBI produces high-performance and low-variance results.

Ultimately, the significance of develo** the ZoMBI algorithm is to quickly and efficiently tackle difficult Needle-in-a-Haystack optimization problems in extremely imbalanced datasets. In this paper, we showcased the ability of the developed algorithm to discover rare materials and conditions with highly-optimized properties in a short period of time using few experiments. Discovering rare materials quickly and efficiently enables widespread access to a new range of materials applications from engineering high-performance medical devices to ubiquitous solid-state cooling systems10,15,16,17,18,19. However, the application space for ZoMBI to accelerate the efficient discovery of highly-optimized solutions extends past materials science and is generally applicable for many Needle-in-a-Haystack problems, including those found in ecological resource management24,25, fraud detection26,27, and rare disease prediction27,28. We aim for this contribution to support the elimination of the time and resource barriers previously inhibiting the throughput of optimizing complex and challenging Needle-in-a-Haystack problems across a broad range of application spaces.

Methods

In this paper, we develop two major contributions: (1) the ZoMBI algorithm and (2) adaptive learning acquisition functions. Through the combination of these two contributions, the optimum region of a NiaH manifold can be quickly discovered in fewer experiments without pigeonholing into local minima. Thus, the three challenges of optimizing NiaH problems are addressed: (1) the challenge of finding a hypervolume within the manifold that contains the needle-like optimum11,29,30, (2) the challenge of the polynomially increasing compute times of BO using a GP surrogate5,6,35,36,37,38, (3) the challenge of avoiding pigeonholing into local minima1,9,31,32. We demonstrate the implementation of ZoMBI on a 6D analytical Ackley function, a 6D dataset of materials with Poisson’s ratios, a 6D dataset of thermoelectric materials, and an 11D dataset for wildfire detection, all of which exhibit an extreme data imbalance and a NiaH regime, and compare the performance to that of MiP-EGO45, TuRBO39, and HEBO50. For each of the three problems, the objective is to find the target value, y, with either the lowest or highest value depending on if the problem is minimization or maximization. This optimum y-value resembles a needle for each problem because it is located within a narrow and steep basin of attraction. Precisely, the needle optimum for each problem has a value of y = 0 for the Ackley function (minimization), y = −1.7 for Poisson’s ratio dataset (minimization), y = 1.9 for the thermoelectric merit dataset (maximization), and y = −12 for the wildfire detection dataset (minimization). To extend the applicability of ZoMBI optimization performance results to a wider array of applications, additional stress tests are conducted on 174 analytical datasets. First, a set of 144 analytical datasets are optimized to assess the failure and success conditions of ZoMBI on problems with extremely narrow optima and few initialization data points. Then, in the Supplemental Information, a set of 30 analytical datasets are optimized to assess the failure and success conditions of ZoMBI on problems with insufficient initialization data and cases where the global optimum is near the edge of the manifold.

The ZoMBI algorithm has two key features: (1) iterative inward bounding of proceeding search spaces using the m number of best-performing memory points within the prior search space and (2) iterative pruning of low-performing historical search space memory. The newly computed search space bounds are unique for each dimension, such that the optimum basin of attraction of complex, non-convex NiaH manifolds can be discovered. This algorithm leverages these two key features to guide the acquisition of new data towards more optimal regions while only fitting the surrogate within the suggested optimum region to resolve more detail of the space of interest, as shown in Figs. 3 and 2. This process subsequently reduces the compute time significantly compared to the compute of a GP in a standard BO procedure, as shown in Fig. 4.

Algorithm 1

Zooming Memory-Based Initialization (ZoMBI)

We define m as the number of retained memory points during an activation of ZoMBI. The m memory points are saved to memory while all other data are erased from memory. These are the historical data points that achieve the m lowest (for minimization) target values, y, and they are used to zoom in the search bounds. Using these memory points, the multi-dimensional upper and lower bounds of the zoomed search space are computed for each dimension, d. Let X ≔ {X1, X2, …, Xn} be a set of data points, where \({X}_{j}\in {{\mathbb{R}}}^{d}\). Let \(f:{{\mathbb{R}}}^{d}\to {\mathbb{R}}\) be the objective function. We first assume that the points in X are in general position so that f(X) contains unique elements. Then, for each mn define X(m) = {Xπ(1), …, Xπ(m)} where π is a permutation on {1, …, n} so that {f(Xπ(j))} is in ascending order. If f(X) contains repeated elements, we may first remove the points with repeated f values and apply the definition above. Then, for each d, the bounds are defined as:

$$\begin{array}{l}{{{{\mathcal{B}}}}}_{d}^{l}\,=\,\mathop{\min }\limits_{X\in {{{{\bf{X}}}}}^{(m)}}\{{X}_{d}\}\\ {{{{\mathcal{B}}}}}_{d}^{u}\,=\,\mathop{\max }\limits_{X\in {{{{\bf{X}}}}}^{(m)}}\{{X}_{d}\},\end{array}$$
(2)

where \({{{{\mathcal{B}}}}}_{d}^{l}\) and \({{{{\mathcal{B}}}}}_{d}^{u}\) computed lower and lower bounds for each dimension, d, respectively. The bounds \([{{{{\mathcal{B}}}}}_{d}^{l},{{{{\mathcal{B}}}}}_{d}^{u}]\) constrain the proceeding acquisition of new data as well as the computation of a GP, such that sampling cannot occur outsides of the bounded region. This constraining process operates independently for each dimension, such that each dimension has a unique lower and upper bound. To initialize the algorithm with data from the constrained space, i data points are sampled from the bounded region using Latin Hypercube Sampling (LHS). LHS splits a d-dimensional space into i*d equally spaced strata, where i is the number of points to sample uniformly over d dimensions with low variability, unlike random sampling that has high sampling variability66. A GP surrogate model is retrained on these i LHS points sampled from the constrained space and then for every proceeding experiment sampled from the space, denoted as a forward experiment, the surrogate model is retrained. Thus, the GP is only being trained on information within the constrained region and as the constrained region iteratively zooms inward and decreases in hypervolume, so does the region computed by the GP. This process allows for more information to be resolved within regions plausibly containing the global optimum basin of attraction. Up to ϕ forward experiments are sampled in serial, where {Xi} ∪ {Xϕ} ⊆ {Xn}. These forward experiments are sampled by maximizing an acquisition value, a ∈ [0, 1], computed by a user-selected acquisition function from one of the four functions EI, EI Abrupt, LCB, and LCB Adaptive, described in the Methods. Once i + ϕ number of experiments are sampled, the bounds are re-constrained using the m best performing experiments, i new experiments are sampled from the zoomed-in space using LHS, and then the memory is pruned. The process of collecting ϕ forward experiments is repeated. A complete constraining-resetting iteration is denoted as an activation, α. This iterative zooming and pruning process over several α significantly speeds up compute time. Implementation of ZoMBI is shown in Algorithm 1.

Traditional BO acquisition functions, such as EI67 and LCB68, use the computed means and variances from a surrogate model to compute an acquisition value; maximizing this acquisition value guides sampling of the manifold7,12,33. However, these traditional acquisition functions are static, such that they do not actively use any information about the performance of previously sampled experiments to guide sampling. Hence, we implement an adaptive learning approach into the acquisition functions to develop two functions, EI Abrupt and LCB Adaptive, that dynamically adapt their sampling based on the quantity and quality of previously sampled experiments. In contrast to a static acquisition function, these adaptive acquisition functions are initialized with an initial set of hyperparameter values to guide their search but then tune these values as sampling progresses. The developed EI Abrupt and LCB Adaptive functions are used within the ZoMBI framework to further accelerate optimization and avoid pigeonholing, see line 9 of Algorithm 1.

LCB Adaptive builds off of previous work that also tune sampling based on the number of experiments collected, n69,70,71. In this paper, we design LCB Adaptive to tune its hyperparameter to become less explorative as more samples are collected. For example, as the n increases, LCB Adaptive decays its β hyperparameter value to become less explorative and more exploitative. Specifically, this information feedback received by the function determines the contribution of both μ(X) and σ(X) to the acquisition value, a. Similar to EI Abrupt, LCB Adaptive computes an acquisition value, a ∈ [0, 1], for a given X, wherein the X with the highest a is selected by the acquisition function as the next suggested experiment to measure. LCB Adaptive is implemented for a minimization problem as:

$${a}_{{{{\rm{LCB}}}}{{{\rm{Adaptive}}}}}(X,n;\beta ,\epsilon )=\mu (X)-{\epsilon }^{n}\beta \sigma (X),$$
(3)

where n is the number of experiments sampled, and β = 3 and ϵ = 0.9 are hand-tuned initialization hyperparameters selected based on a priori domain knowledge of the function’s performance on a variety of different problems. Having a large β and an ϵ close to 1 supports a gradual decay from very explorative to very exploitative, rather than a rapid decay. The dynamic EI Abrupt and LCB Adaptive are shown to both discover optima faster and avoid pigeonholing into local minima better than their static counterparts by actively balancing the ratio of exploitation to exploration using learned information about the quality and quantity of previously sampled experiments.

EI Abrupt is an acquisition function that flips between the exploitative EI67 and explorative LCB68 acquisition functions based on the computed finite differences of recently evaluated experiments. For example, if the evaluated experiment y-values plateaus for three or more experiments in a row, EI Abrupt will abruptly switch from a greedy sampling policy to a more explorative sampling policy. Specifically, this information feedback received by the function determines if the current round of sampling should exploit the surrogate mean values, μ(X), or explore the surrogate variances, σ(X). EI Abrupt computes an acquisition value, a ∈ [0, 1], for a given X, wherein the X with the highest a is selected by the acquisition function as the next suggested experiment to measure. EI Abrupt is implemented for a minimization problem as:

$$\begin{array}{l}{a}_{{{{\rm{EI}}}}{{{\rm{Abrupt}}}}}(X,y;\beta ,\xi ,\eta )\,=\,\left\{\begin{array}{ll}\left(\mu (X)-{y}^{* }-\xi \right)\,\Phi \,(Z)+\sigma (X)\psi (Z),& {{\rm{if}}}\,\,| \Delta \{{y}_{n-3...n}\}| \le \eta \\ \mu (X)-\beta \sigma (X),& {{\rm{otherwise}}}\,\end{array}\right.\\ \qquad \qquad \qquad \qquad\,\,\, Z\,=\,\dfrac{\mu (X)\,-\,{y}^{* }\,-\,\xi }{\sigma (X)},\end{array}$$
(4)

where y* is the lowest measured target value thus far (i.e., the running minimum), Φ( ⋅ ) is the cumulative density function of the normal distribution, ψ( ⋅ ) is the probability density function of the normal distribution, and ∣Δ{yn−3...n}∣ is the absolute value of the finite differences of the set of target values of the last three sampled experiments. Moreover, β = 0.1, ξ = 0.1, and η = 0 are hand-tuned initialization hyperparameters used for the rest of the paper for EI Abrupt. Moreover, for standard LCB and EI, β = 1 and ξ = 0.1 hyperparameters are used, respectively. These hyperparameters were selected based on a priori domain knowledge of EI Abrupt performance on a variety of different problems. The most important hyperparameter for efficient sampling is β, whose ideal value is non-obvious, but it is found that β = 0.1 allows EI Abrupt to switch into an explorative sampling policy while still having a strong weight on the surrogate means, implying that exploration does not veer far.