Robust Super-Level Set Estimation Using Gaussian Processes

Zanette, Andrea; Zhang, Junzi; Kochenderfer, Mykel J.

doi:10.1007/978-3-030-10928-8_17

Andrea Zanette¹⁷,
Junzi Zhang¹⁷ &
Mykel J. Kochenderfer¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11052))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

2498 Accesses
7 Citations

Abstract

This paper focuses on the problem of determining as large a region as possible where a function exceeds a given threshold with high probability. We assume that we only have access to a noise-corrupted version of the function and that function evaluations are costly. To select the next query point, we propose maximizing the expected volume of the domain identified as above the threshold as predicted by a Gaussian process, robustified by a variance term. We also give asymptotic guarantees on the exploration effect of the algorithm, regardless of the prior misspecification. We show by various numerical examples that our approach also outperforms existing techniques in the literature in practice.

You have full access to this open access chapter, Download conference paper PDF

Sparsifying to optimize over multiple information sources: an augmented Gaussian process based algorithm

Article Open access 05 April 2021

Hesitant adaptive search with estimation and quantile adaptive search for global optimization with noise

Article 30 June 2023

Constrained Bayesian Optimization for Problems with Piece-wise Smooth Constraints

Keywords

1 Introduction

Many scientific and engineering problems involve determining the maximum value a function f over a region $\varOmega $. However, some applications require determining a large subregion of $\varOmega $ where the function under consideration exceeds a given threshold t. This problem of super-level set estimation arises naturally in the context of safety control, signal coverage, and environmental monitoring [1].

Formally, we consider the problem of finding the region where a function is above some threshold t with probability at least $\delta $:

$$\begin{aligned} \{\mathbf x \in \varOmega \mid P \left( f(\mathbf x)>t \right) >\delta \}. \end{aligned}$$

(1)

We assume that we only have access to a noise-corrupted version of the function and that function evaluations are costly. In order to fully specify the set in Eq. (1), a probabilistic model P for the function must be assumed. In this work, we use Gaussian processes [2], a standard model that directly provides confidence intervals and can easily incorporate new information from the samples.

The problem is closely related to Bayesian optimization, but the associated techniques are not directly transferable to the problem of level set estimation. A major issue is that it is unclear whether one should focus on identifying points around the threshold for better separation, or aim at points far from the threshold to accelerate the discovery of interested regions. Similar to other exploration-exploitation trade-offs in active learning, we give affirmative answers to both by proposing Robust Maximum Improvement for Level-set Estimation (RMILE), an algorithm to maximize the expected volume of the domain where a function exceeds the threshold with high probability, robustified by a exploration-driven variance term.

Furthermore, we discuss a criterion for establishing convergence of a generic acquisition function on finite grids that is robust to misspecification of the models. In particular, we show how this criterion applies to our algorithm.

Related Work. Some relevant techniques for addressing costly function evaluations are found in the field of Bayesian optimization, which have received growing attention in recent years [3]. Bayesian optimization aims at finding the maximum or minimum of a black box function by repeatedly updating prior beliefs over the function in a Bayesian fashion. This framework is particularly well suited for global optimization of costly functions because it makes effective use of all the information (i.e., the samples) acquired during the process by carefully selecting query points according to an acquisition function. Examples of acquisition functions include the probability of improvement [4], expected improvement [5], the Gaussian process upper confidence bound [6], and information-based policies [7].

In the literature of level set estimation, when level sets are estimated from an existing dataset, several approaches are available [8,9,10]. In the context of active learning, a topic of growing interest [11,12,13], one technique is known as the Straddle heuristic [14], where the expected value and variance given by a Gaussian process are combined to characterize the uncertainty at each candidate point, based on which the next query point is chosen. A refinement of the Straddle heuristic was suggested as the LSE algorithm, which is an online classification method based on confidence intervals with some information-theoretic convergence guarantees [1]. This idea is further developed in [15], with theoretical guarantees that offer a unifying framework of Bayesian optimization and level set estimation. In a different direction, a Gaussian process-based algorithm addressing time-varying level set estimation was proposed [16], where a global expected error estimate is adopted as the acquisition function.

A problem similar to ours was considered by [17], where the authors partition the space using a predefined coarse grid and the goal is to find sub-regions with an average “score” above some threshold. The computation of the scores relies on Bayesian quadrature, and no theoretical guarantee on the algorithm is provided. It was later extended to find regions matching general patterns not restricted to the excess of some threshold [18]. Although some preliminary analysis about exploration-exploitation trade-off is given, there is no discussion about the limit behavior of the algorithm.

In some applications, one is directly concerned with the online estimation of the volume of the super-level set with some given threshold [19, 20]. However, existing methods are still focused on uncertainty reduction mechanisms similar to the Straddle heuristic.

This paper is structured as follows. Section 2 provides the necessary preliminaries for our work. Section 3 discusses the relation between super-level set estimation and binary classification, and proposes the RMILE algorithm. The asymptotic behavior of RMILE is discussed in Sect. 4, followed by various numerical experiments in Sect. 5 and some conclusive remarks in Sect. 6.

2 Preliminaries

We assume that we only have noisy measurements of the function $f(\mathbf x)$. In other words, we have access to $f(\mathbf x) + \epsilon $ where the noise $\epsilon $ is normally distributed $\epsilon \sim \mathcal {N}(0,\sigma ^2_{\epsilon })$ and is independent from the sampling location or the function value. We consider a discrete domain $\varOmega $ with finitely many points where we would like to classify the function as either above the threshold or below it, with some degree of confidence. We assume that we are allowed to query the function at very few points with noisy evaluations, in which case the unseen regions of the domain can only be classified with some probability, for example, $f(\mathbf x)>t$ with probability at least $\delta $. In the design of our algorithm, we assume that the function $f(\mathbf x)$ is a sample from a Gaussian process (GP). However, our asymptotic analysis in Sect. 4 is model independent, i.e., it holds without any additional probabilistic assumptions on the function measurements.

2.1 Gaussian Processes

A GP $\{f(\mathbf x)\;|\;\mathbf x\in \varOmega \}$ is a collection of random variables, any finite subset of which is distributed according to a multivariate Gaussian specified by the mean function $\mu (\mathbf x)$ and the kernel $k(\mathbf x,\mathbf x')$. Suppose that we have a prior mean $\mu _0(\mathbf x)$ and kernel $k_0(\mathbf x,\mathbf x')$ for the GP, and n (noisy) measurements $\{(\mathbf x_i,y_i)\}_{i=1}^n$, where $y_i=f(\mathbf x_i)+\epsilon _i$ and $\epsilon _i\sim N(0,\sigma _{\epsilon }^2)$ for $i=1,\dots ,n$. The posterior of $\{f(\mathbf x)\;|\;\mathbf x\in \varOmega \}$ is still a GP, and its mean and kernel functions can be computed analytically as follows:

$$\begin{aligned} \begin{aligned}&\mu _n(\mathbf x)=\mu _{\mathbf x_{1:n},y_{1:n}}({\mathbf x})=\mu _0(\mathbf x)+k_n(\mathbf x)^T(K_n+\sigma ^2I)^{-1}(y_{1:n}-\mu _0(\mathbf x_{1:n})),\\&k_n(\mathbf x,\mathbf x')=k_{\mathbf x_{1:n}}({\mathbf x},{\mathbf x'})=k_0(\mathbf x,\mathbf x')-k_n(\mathbf x)^T(K_n+\sigma ^2I)^{-1}k_n(\mathbf x'), \end{aligned} \end{aligned}$$

(2)

where $k_n(\mathbf x)=[k_0(\mathbf x,\mathbf x_1),\dots ,k_0(\mathbf x,\mathbf x_n)]^T$, $K_n=[k_0(\mathbf x_i,\mathbf x_j)]_{i,j=1}^n$. In particular, the posterior variance at $\mathbf x$ is $\sigma _n^2(\mathbf x)=k_n(\mathbf x,\mathbf x)$. Intuitively, as the number of measurements increases, the actual f will be gradually revealed.

2.2 Framework for Super-Level Set Estimation

Algorithm 1 is the conceptual framework of the algorithms for level set estimation, which is adopted in most, if not all of the related literature. Here the last step in Algorithm 1 follows Eq. (1), in which $P_{GP}$ is the probability measure defined according to the posterior Gaussian process (i.e., conditioned on the filtration $\{(\mathbf x_i,y_i)\}_{i=1,...,n}$). The estimated super-level set is denoted $I_{GP}$. We remark that one can also decide the membership of the estimated level set in an online fashion, as is done in [1, 15].

2.3 Notation and Assumptions

At timestep n, we have observed $\{(\mathbf x_i,y_i) \}_{i=1,...,n}$, where $\mathbf x_i$ is the ith sampling location and $y_i$ is the resulting noisy observation. We denote with the subscript GP (e.g., $\mu _{GP}$, $\sigma _{GP}$ and $k_{GP}$) the quantities conditioned on the filtration $\{(\mathbf x_i,y_i) \}_{i=1,...,n}$. We use the subscript ${GP^+}$ to denote such quantities still conditioned on the filtration $\{(\mathbf x_i,y_i) \}_{i=1,...,n}$ and additionally on the sampling location denoted $\mathbf x^+$. Notice that, while $\mu _{GP}(\mathbf x)$ is a deterministic quantity that can be computed with the predictive Eq. (2), $\mu _{GP^+}(\mathbf x)$ depends on the random outcome $y^+$ at $\mathbf x^+$ and is therefore a random variable.

Unless otherwise specified, we always restrict ourselves to a finite fixed grid ${\mathbf z_1},\dots ,{\mathbf z_m}\in \varOmega $ as the set of all candidate sampling locations in $\varOmega $. Here $\mathbf {z}_{1:m}$ are all distinct. We will then slightly abuse notation to use $\varOmega $ to denote the set of grid points $\mathbf {z}_{1:m}$, with $|\varOmega |=m$. We also assume without loss of generality that the prior kernel $k_0$ is positive definite.

3 Super-Level Set Estimation

This section begins with some remarks on the relation between super-level set estimation and binary classification, and then describes our RMILE algorithm.

3.1 Relation to Binary Classification

A GP uniquely specifies a probability distribution for the unseen examples $\mathbf x \in \varOmega $ and can be used to infer the region where the threshold is exceeded with probability at least $\delta $. At any point $\mathbf x$, the posterior distribution of the random variable $f(\mathbf x)$ is still normal [2]. Let $\mu _{GP}(\mathbf x)$ and $\sigma ^2_{GP}(\mathbf x)$ denote its posterior mean and variance, respectively. Then the condition $P(f(\mathbf x)>t)>\delta $ in Eq. (1) can be reformulated as follows:

$$\begin{aligned} \mu _{GP}(\mathbf x) - \beta \sigma _{GP}(\mathbf x) >t, \end{aligned}$$

(3)

where $\beta $ is a fixed coefficient that depends on $\delta $. For a normal distribution, if $\delta = 97.5\%$ then $\beta \approx 1.96$. The user is free to select $\delta $ according to the application. If human safety is at stake, $\delta $ should be sufficiently high to avoid misclassification, although this will also result in a smaller $I_{GP}$.

Define as FP the set of false positives, that is, the points $\mathbf x$ such that $f(\mathbf x) \le t$ but are classified as $\in I_{GP}$, and FN the set of false negatives, which are all the points $\mathbf x$ such that $f(\mathbf x) > t$ but are classified as not in $ I_{GP}$. The following is a straightforward observation:

Lemma 1

The classification rule identified by Eq. (3) minimizes the expected weighted misclassification error:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}\left( \delta \mathbbm {1}\{\mathbf x \in \text {FP}\} + (1-\delta ) \mathbbm {1}\{\mathbf x \in \text {FN}\} \right) , \end{aligned}$$

(4)

among all deterministic classification rules under the posterior probability measure given by the Gaussian processes. Here, $\mathbf x\in \varOmega $ is arbitrary and fixed.

In the above expression, the expectation is conditioned on the filtration, i.e., on the observed samples $\{(\mathbf x_i,y_i)\}_{i=1,...,n}$. We can see that rule (3) penalize false positives much more than false negatives when $\delta $ is close to 1. As a result, they are relatively conservative in the inclusion of points in the estimated supe-level set, and thus balance with our “radical” acquisition function to be introduced below.

3.2 Robust Maximum Improvement in Level-Set Estimation

Our idea is to develop a method that aims to find the largest possible area where the function exceeds a given threshold with high probability.

Let $I_{GP}$ be the set of points currently classified as above the threshold using the posterior GP. If we could sample at an arbitrary $\mathbf x^+$ and incorporate the feedback $y(\mathbf x^+) = f(\mathbf x^+) + \epsilon $ into the GP, then we would obtain a new Gaussian process, GP$^+$, which is a function of the (random) outcome $y(\mathbf x^+)$ and the sampling location $\mathbf x^+$. Then GP$^+$ can be used to infer an updated classification $I_{GP^+}$. We thus consider maximizing the volume of $I_{GP^+}$, i.e., $|I_{GP^+}| := \sum _{\mathbf x \in \varOmega } \mathbbm {1} \left\{ P_{GP^+}\left( f(\mathbf x)>t \right) > \delta \right\} $ to find the “optimal” sampling location. Equivalently, we would like to find the point $\mathbf x^+$ that would yield the maximum improvement $|I_{GP^+}| -|I_{GP}| $ in expectation among all candidate sampling locations. Formally, the next sampling point is chosen as the solution to:

$$\begin{aligned} \mathop {\mathrm {arg\,max}}\limits _{\mathbf x^+\in \varOmega } {{\,\mathrm{\mathbb {E}}\,}}_{y^+} \left| I_{GP^+}\right| - \left| I_{GP}\right| , \end{aligned}$$

(5)

where the expectation is taken with respect to the random outcome $y^+$ (which is shorthand for $y(\mathbf x^+)$) resulting from sampling at $\mathbf x^+$, and is conditioned on the filtration $\{(\mathbf x_i,y_i)\}_{i=1,...,n}$. This criterion is similar to the expected improvement developed in the context of Bayesian optimization [5], and is also closely related to the criteria of [17, 18]. However, their framework differ from ours, and they all suffer from potential lack of exploration. In particular, [17, 18] focus on region-level detection instead of point-wise detection, as we focus on here. Although Eq. (5) does fall as a special case of the acquisition function proposed in [18] when a single point is chosen as the “linear functional” there, their explanation for exploration inside each region becomes meaningless as each region then contains only a single point. As a result, no convergence guarantees have been established for these algorithms, despite their empirical success in certain problems. In particular, it remains an open issue how an intrinsic exploration strategy can be included. To this end, we modify criterion (5) by introducing a trade-off between the posterior variances, which we prove to ensure certain asymptotic convergence, as we discuss below.

While Eq. (5) defines a reasonable acquisition function that seeks improvement in the discovery of points lying in the super-level set with high probability, it may suffer from potential model misspecification and lack of exploration. Consider an extreme case when all the points are classified as above the threshold by the chosen prior at the beginning. Then Eq. (5) may lead the procedure to stall and repeatedly sample at locations with the largest function values to maintain the largest super-level set specified by the prior. To remedy this issue, we modify the criterion in Eq. (5) so that the algorithm cannot “get stuck” indefinitely. To achieve this goal, we guarantee a minimum positive exploration bonus everywhere by introducing a marginal variance term. Let $|I_{GP}^\epsilon | = \sum _{\mathbf x \in \varOmega } \mathbbm {1} \left\{ P_{GP}\left( f(\mathbf x)>t - \epsilon \right) > \delta \right\} $ for $\epsilon > 0$, which is essentially a shift of the threshold from t to $t-\epsilon $ (the shift is mostly for technical reasons to simplify the analysis). Our final acquisition function is then defined as:

$$\begin{aligned} E_{GP}(\mathbf x^+) = \max \left\{ {{\,\mathrm{\mathbb {E}}\,}}_{y^+} \left| I_{GP^+}\right| - \left| I_{GP}^\epsilon \right| , \gamma \sigma _{GP}(\mathbf x^+)\right\} , \end{aligned}$$

(6)

for some small constants $\epsilon> 0, \gamma > 0$. Intuitively, the additional variance term ensures that the algorithm moves to a region with a higher variance when the expected improvement is sufficiently reduced at the current point.

3.3 Efficient Implementation

At each candidate sample point $\mathbf x^+\in \varOmega $, to evaluate Eq. (6), we would need to sample $f(\mathbf x^+)$ from the current GP, from which we can compute the posterior classification and repeat this procedure in a Monte Carlo fashion to estimate the expected improvement.

Nevertheless, it is possible to avoid sampling from the GP here. To see this, notice that by Eq. (2), the variance $\sigma ^2_{GP^+}(\mathbf x)$ is unaffected by a new observation $y^+$ as it only depends on the sampling location $\mathbf x^+$. On the other hand, the posterior mean $\mu _{GP^+}(\mathbf x)$ is linearly correlated with the sample $y^+$ (to indicate this dependency we would rewrite it as $\mu _{GP^+}(\mathbf x;y^+)$). Therefore, it is possible to compute the outcome $y^+$ that would change the classification for point $\mathbf x$ under consideration, that is, we only need to determine the “limit” value for the new sample $y^+$ that turns the indicator $\mathbbm {1} \left\{ P_{GP^+}\left( f(\mathbf x)>t \right) > \delta \right\} $ on or off in the computation of ${{\,\mathrm{\mathbb {E}}\,}}_{y^+}|I_{GP^+}|$. As a result, we obtain the following expression:

Lemma 2

$ {{\,\mathrm{\mathbb {E}}\,}}_{y^+} \left| I_{GP^+}\right| $ obtained by sampling at $\mathbf x^+$ can be computed analytically as follows:

$$\begin{aligned} \sum _{\mathbf x \in \varOmega } \varPhi \left( \frac{\sqrt{\sigma ^2_{GP}(\mathbf x^+) + \sigma ^2_{\epsilon }}}{|Cov _{GP}(f(\mathbf x), f(\mathbf x^+))|} \times \left( \mu _{GP}(\mathbf x) - \beta \sigma _{GP^+}(\mathbf x) -t \right) \right) \end{aligned}$$

(7)

where $\varPhi (\cdot )$ is the cumulative distribution function (CDF) of the standard normal random variables, and

$$\begin{aligned} \sigma ^2_{GP^+}(\mathbf x)=Cov _{GP^+}(f(\mathbf x),f(\mathbf x)) = \sigma ^2_{GP}(\mathbf x) - \dfrac{Cov _{GP}^2(f(\mathbf x), f(\mathbf x^+))}{\sigma ^2_{GP}(\mathbf x^+) + \sigma ^2_{\epsilon }}, \end{aligned}$$

(8)

and $Cov _{GP}(f(\mathbf x), f(\mathbf x^+))=k_{GP}(\mathbf x,\mathbf x^+)$ is the (current) posterior covariance between $f(\mathbf x)$ and $f(\mathbf x^+)$.

In the above derivation, we are implicitly assuming that $f(\mathbf x)$ can be modeled as a sample from the GP. The posterior covariance $Cov _{GP}(f(\mathbf x), f(\mathbf x^+))$ can be calculated, for example, by using Eq. (2). For a fixed grid, such computation can be rearranged so that the full posterior covariance matrix is stored and updated at each iteration through rank-one updates. Different trade-offs between computational and memory complexity are also possible. Lemma 2 is also shown in [17, 18], but to reduce notational confusion in cross-referencing, we provide a different and more direct proof in the appendix (see supplementary material).

The RMILE algorithm follows the super-level set estimation framework described in Algorithm 1. At each iteration, it calls SelectPoint (Algorithm 2). For a fixed sampling location $\mathbf x^+$, the algorithm computes the acquisition function (6) using Eq. (7).

Although it is possible to identify the level set at any time during the execution, we do not enforce an online classification scheme as in [1, 15]. Instead, the classification is done offline using all available information, which makes the algorithm work better in practice, as can be seen in the numerical experiments.

3.4 Connection to Uncertainty/Variance Reduction

So far we have assumed that the threshold is known a priori. In fact, the design process in engineering typically involves several “iterations” where the conceptual idea is revised, leading to changes in the requirements or the threshold. In such cases, one would like to obtain a model of the function that can later be used to identify different regions, say $I^{(t_1)}_{GP},I^{(t_2)}_{GP}, I^{(t_3)}_{GP}$ corresponding to different thresholds $t_1,t_2,t_3$ (this notation should not be confused with $I_{GP}^{\epsilon }$ previously used to indicate a shift in the threshold). For three thresholds, this can be easily done by redefining the objective function to maximize:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{y^+}\left( |I^{(t_1)}_{GP^+}|+|I^{(t_2)}_{GP^+}|+|I^{(t_3)}_{GP^+}|\right) - \left( |I^{(t_1-\epsilon )}_{GP}|+|I^{(t_2-\epsilon )}_{GP}|+|I^{(t_3-\epsilon )}_{GP}|\right) , \end{aligned}$$

(9)

so that the algorithm is biased towards identifying all three of them. If, for example, $f(\mathbf x)>\text {max}(t_1,t_2,t_3)$, then that point will contribute three times as much to the objective function.

We can naturally extend the idea of Eq. (9) and look for all thresholds in a given range [a, b]. That is, the finite summation in Eq. (9) can be replaced by an integral over all possible thresholds: ${{\,\mathrm{\mathbb {E}}\,}}_{y^+}\int _{-\infty }^{\infty } |I^{(t)}_{GP^+}|-|I_{GP}^{(t-\epsilon )}| dt$. Interestingly, if one considers the extreme case when $\epsilon =0$, then our algorithm reduces to a type of variance minimization:

Lemma 3

If the acquisition function is redefined as:

$$\begin{aligned} E_{GP}^{var}(\mathbf x^+):={{\,\mathrm{\mathbb {E}}\,}}_{y^+}\int _{-\infty }^{\infty } |I^{(t)}_{GP^+}| - |I^{(t)}_{GP}| dt, \end{aligned}$$

(10)

then Algorithm 2 minimizes the $l_1$-norm of the posterior standard deviation, i.e., the next query point $\mathbf x^+$ is selected as $\mathbf x^+=\mathrm {arg\,min}_{\mathbf x^+ \in \varOmega } \sum _{\mathbf x \in \varOmega } \sigma _{GP^+}(\mathbf x)$.

Since we can recast the objective function as $\sum _{\mathbf x \in \varOmega } \sigma _{GP^+}(\mathbf x) -\sigma _{GP}(\mathbf x)$, the acquisition function (10) chooses the point that maximizes the reduction in the standard deviation across the domain. This is similar to the acquisition function used in [15] for an appropriate choice of the parameters.

4 Asymptotic Behavior on Finite Grids

In the absence of noise, a well-designed algorithm should avoid re-sampling at the same location since additional information is not acquired. In other words, on finite grids every point should be sampled at most once before an algorithm terminates.

In the case of noisy measurements, however, an algorithm may need to re-sample at the same location multiple times in order to get a more accurate estimate of the function value. Intuitively, so long as an algorithm samples each point of the grid infinitely often, the underlying function should be gradually revealed [2]. Below, we first validate some reasonable assumptions that guarantee the asymptotic convergence of a generic acquisition function in Algorithm 1, and then show that RMILE satisfies these conditions.

Lemma 4

Let $E_{GP}(\mathbf x^+)$ be an acquisition function that depends on the posterior GP at a potential query point $\mathbf x^+$, such that for sufficiently small $\sigma ^2_{GP}(\mathbf x^+)$, there exists a function $u(\cdot )$ which only depends on the posterior variance $\sigma ^2_{GP}(\mathbf x^+)$, with

$$\begin{aligned} E_{GP}(\mathbf x^+) \le u(\sigma _{GP}(\mathbf x^+)),\quad \lim _{\sigma (\mathbf x^+)\rightarrow 0^+} u(\sigma _{GP}(\mathbf x^+)) = 0. \end{aligned}$$

(11)

In addition, assume that there exists a global lower bound $l(\cdot )$, such that

$$\begin{aligned} E_{GP}(\mathbf x^+)\ge l(\sigma _{GP}(\mathbf x^+)),\quad \lim _{\sigma (\mathbf x^+)\rightarrow 0^+} l(\sigma _{GP}(\mathbf x^+)) = 0, \end{aligned}$$

(12)

and assume that $l(\sigma _{GP}(\mathbf x^+))$ is strictly increasing in $\sigma _{GP}(\mathbf x^+)$.

If Algorithm 1 selects the next query point as $\mathrm {arg\,max}_{\mathbf x^+} E_{GP}(\mathbf x^+) $ and is run without termination, then there cannot be a point in the grid that is sampled only finitely many times.

The lemma does not assume that the true function can be represented as a sample from a Gaussian process; only the upper and lower bounds on the acquisition function are needed as a function of the posterior marginal variance. The intuition is that $E_{GP}(\mathbf x^+)$ can fluctuate as the sampling process progresses; however, as the variance $\sigma _{GP}(\mathbf x^+)$ of a point is progressively reduced, one more sample at $\mathbf x^+$ should bring less and less improvement as measured by $E_{GP}(\mathbf x^+)$. This implies that the algorithm will move to a location where $E_{GP}(\cdot )$ is higher. The proof can be found in the appendix (see supplementary material).

We are now ready to verify the robustness of our algorithm: as we show next, it satisfies the assumption of Lemma 4. Let $\overline{\sigma }^2:=\max _{i=1,\dots ,m}\sigma _{0}^2(\mathbf z_i)$, where $\mathbf z_i$ are the grid points in $\varOmega $.

Lemma 5

For the acquisition function (6) with $\gamma > 0$, $\epsilon > 0$, we have:

$u(\sigma _{GP}(\mathbf x^+)) = \max \left( |\varOmega | \varPhi \left( \frac{\sigma _{\epsilon }}{\bar{\sigma } \sigma _{GP}(\mathbf x^+)} \left( -\epsilon /2 \right) \right) ,\gamma \sigma _{GP}(\mathbf x^+) \right) $
$l(\sigma _{GP}(\mathbf x^+)) = \gamma \sigma _{GP}(\mathbf x^+)$

Also the lower bound $ l(\sigma _{GP}(\mathbf x^+))$ is monotonically increasing and

$$\lim _{\sigma (\mathbf x^+)\rightarrow 0^+} u(\sigma (\mathbf x^+)) = \lim _{\sigma (\mathbf x^+)\rightarrow 0^+} l(\sigma (\mathbf x^+)) = 0.$$

The roles of $\epsilon $ and $\gamma $ are important in terms of the asymptotic behavior. More precisely, the modification that leads to Eq. (6) ensures a minimum exploration bonus given by $\gamma \sigma _{GP}(\mathbf x^+)$ and is a crucial difference compared to [17, 18].

5 Numerical Experiments

This section empirically assesses the proposed procedure on numerical experiments. We use a standard squared exponential kernel

$$\begin{aligned} k(\mathbf x, \mathbf x') = \sigma ^2_{ker} \exp (-\Vert \mathbf x - \mathbf x'\Vert _2^2/(2l^2)) \end{aligned}$$

(13)

We start by examining the effectiveness of the robust modifications, and then proceed to comparing our proposed approach with state-of-the art algorithms. Although in principle the model noise level $\sigma _{\epsilon }$ can be different from the algorithm noise level (which we also denote as $\sigma _{\epsilon }$ with slight abuse of notation), we typically take them to be the same as is done conventionally in the literature, unless otherwise stated (e.g., in the next subsection). We also emphasize that to make the comparisons fair, the performance of all the algorithms is evaluated with classification criterion (3), instead of the criteria proposed in the original papers (e.g., posterior mean).

5.1 Robustification Effects

This section shows how the robust adjustment parameters $\epsilon $ and $\gamma $ in Eq. (6) help improve the performance of the algorithm. We compare two sets of parameters: (a) $\epsilon =\gamma = 10^{-8}$; (b) $\epsilon =0$, $\gamma =-\infty $. Notice that with parameter set (b), Eq. (6) reduces to Eq. (5), i.e., the one without guaranteed convergence.

To stabilize the performance, we first sample at 3 points chosen uniformly at random as seeds, and compute the resulting posterior distribution as the prior. To showcase the robustness of our algorithm, we keep the prior mean to be 0, and $\ln (\sigma _{ker})=4$, $\ln (l)=1$ throughout the experiments in this subsection. For each problem, we run 25 simulations of our algorithm.

We consider the negative Himmelblau’s function (Fig. 2) defined in $[-5,5]\times [-5,5]$, a commonly used multi-modal function. We take a uniform grid of $30\times 30$ points, and the threshold is set to $t=-50$. Here we consider two sets of noise levels: (1) a small noise setting with both model and algorithm noise levels $\sigma _{\epsilon }=0.1$; (2) a misspecified large noise setting, with model noise level 30 and algorithm noise level 3. The results are shown in Fig. 1. Here we label parameter set (a) as “RMILE” and parameter set (b) as “MILE”. We can see that in both cases, the robust version outperforms the vanilla one, and the difference is more dramatic in the second (harder) case.

Our algorithm is quite robust to the parameter choices of $\epsilon $ and $\gamma $, so long as they are positive. In particular, we obtained almost the same performance as above when setting $\epsilon =\gamma = 10^{-2}$.

For simplicity, hereafter we set $\epsilon = 10^{-12}$ and $\gamma = 10^{-10}$. We compare our approach against the Straddle heuristic [14] and the LSE algorithm [1], which are the most relevant algorithms to our work. Another relevant approach is the TruVaR algorithm, which has been found to perform similarly to LSE in numerical experiments for level-set estimation [15].

5.2 2D Synthetic Examples

We consider a sinusoidal function $\sin (10x_1) + \cos (4x_2) - \cos (3x_1x_2)$, whose contours are plotted in Fig. 4 defined in the box $[0,1]\times [0,2]$. We superimpose a grid of $30 \times 60$ points uniformly separated and run 25 simulations for our algorithm RMILE along with LSE and the Straddle heuristic. The normally distributed noise has standard deviation $\ln (\sigma _{\epsilon }) = -1$, and the prior has uniform mean $= 0$ with $\ln (\sigma _{ker}) = 1.0 $ and $\ln (l) = -1.5$. The threshold is set to $t=1$. These are supposed to be representative of the prior knowledge that the user may have about the function at hand, but are not necessarily the hyper-parameters that maximize the likelihood of some held-out data under the Gaussian process.

Similarly, we also consider again the Himmelblau’s function. We run 25 simulations on a $50 \times 50$ grid for our algorithm, the LSE algorithm and the Straddle heuristic. We assume that the true Himmelblau’s function can be evaluated with some normally distributed noise with standard deviation $\ln (\sigma _{\epsilon }) = 2.0$ and mean zero. The threshold chosen for the experiment is $t = -100$, with a prior mean of $-100$, prior standard deviation $\ln (\sigma _{ker}) = 4$, and $\ln (l) = 0 $.

The advantage of the procedure with respect to the state of the art is demonstrated by the $F_1$-score on the sinusoidal and Himmelblau’s function (Fig. 3). We also show precision and recall separately for the numerical experiments on Himmelblau’s function in Fig. 3 bottom left and bottom right, respectively.

In both numerical experiments it is relatively easy to find an initial point above the threshold. Our algorithm then proceeds by expanding $I_{GP}$ as much as possible at each step (Fig. 4). This is in contrast to the Straddle heuristic and LSE which seek to reduce the variance and thus tend to sample more widely in the initial phase. Notice that Straddle and LSE maximize a similar objective function for the selection of the next point, the “Straddle score” and the “ambiguity”, respectively. However, at least in the initial exploration phase, these metrics have fairly uniform high value (Fig. 2) across the domain because the variance given by the Gaussian process is initially high. In contrast, RMILE gives higher scores to points in $\varOmega $ that are likely to improve $I_{GP}$ the most, and therefore chooses to expand the current region above the threshold as much as possible before exploring regions far away from the current samples.

Thus, our algorithm is more suitable especially when a very limited exploration budget is available and one cannot afford to reconstruct a good model of the function. Although we use the well-established $F_1$-score for comparison, it may not always be the most appropriate and fair metric for our proposed problem. In particular, as we noted in Lemma 1, our classification rule penalizes false positives far more than false negatives.

5.3 Simulation Experiments: Aircraft Collision Avoidance

We evaluate our method in the task of estimating the sensor requirements for an aircraft collision avoidance system. We consider pairwise encounters between aircrafts, the behavior of which is dictated by a joint policy produced by modeling the problem as a Markov decision process and solving for optimal actions using value iteration. Observation noise is applied over two state variables, the relative angle and heading between the two aircrafts. The noise for each variable is sampled independently from a normal distribution with mean zero and standard deviation varying depending on assumed sensor precision. For each sample, 500 pairwise encounters are simulated, and the estimated probability of a near mid-air collision (NMAC) is returned. We apply the negative logit transformation to the output to map it to the real line. We look for a threshold $t = 1$, and the origin is given as a seed. Again, RMILE samples in a more structured way, progressively expanding $I_{GP}$ while balancing the reduction of the variance in the promising region with some exploration (Fig. 5).

5.4 Simulation Experiments: Required Sensor Precision

We assess our method on estimating actuator performance requirements in an automotive setting. We seek to determine the necessary precision for longitudinal and lateral acceleration maneuvers of simulated vehicles such that the likelihood of hard braking events is below a threshold. In these experiments, we simulate a single, five-second scenario involving twenty vehicles for 100 steps. The vehicles are propagated according to a bicycle model, with longitudinal behavior generated by the Intelligent Driver Model [21] and lane changing behavior dictated by the MOBIL [22] model. The two input parameters are sampled from a normal distribution, the standard deviation of which models the actuator precision.

We model the (estimated) probability of hard braking using the negative logit function $-\log \frac{y}{1-y}$, which maps the outcome $y \in [0,1]$ from the simulator to the real line. This is not strictly needed but ensures that the Gaussian process is consistent with the type of output. We run an exploratory simulation with a budget of 20 points for Straddle, LSE and our algorithm. While the underlying function has some random noises due to the Monte Carlo simulation, we fix the seed to have the same point-wise responses from the simulator when different algorithms are tested. We select a threshold of 1.0 and again choose the origin as the initial seed. In Fig. 6a we plot the contours for $\mu _{GP}(\mathbf x) - 1.96\sigma _{GP}(\mathbf x)$. It can be seen that LSE and the Straddle heuristic both try to reduce the uncertainty by spreading out the sample points. Crucially, our algorithm places more samples together to compensate for the noise and reduce the variance (Fig. 6c) in the promising region (Fig. 6b) above $t = 1.0$. This allows the classifier to use the posterior GP to make a more confident prediction and identify the area of interest with high confidence (yellow region in Fig. 6a).

6 Conclusions

We have considered the problem of level set estimation where only a noise-corrupted version of the function is available with a very limited exploration budget. The aim is to discover as rapidly as possible a region where the threshold is exceeded with high probability. We propose to select the next query point that maximizes the expected volume of the domain of points above the threshold in a one-step lookahead procedure, and derive analytical formulae to compute this quantity in closed forms. We give a simple criterion to verify convergence of generic acquisition functions and verify that our algorithm satisfies such requirements. Our algorithm also compares favorably with the state of the art on numerical experiments. In particular, it uses information gained from a few samples more effectively, making it suitable when a very limited exploration budget is available. At the same time, it retains asymptotic convergence guarantees, making it especially compelling in the case of misspecified models.

References

Gotovos, A., Casati, N., Hitz, G., Krause, A.: Active learning for level set estimation. In: International Joint Conference on Artificial Intelligence (IJCAI), pp. 1344–1350 (2013)
Google Scholar
Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2006)
MATH Google Scholar
Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., de Freitas, N.: Taking the human out of the loop: a review of Bayesian optimization. Proc. IEEE 104(1), 148–175 (2016)
Article Google Scholar
Kushner, H.J.: A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. J. Basic Eng. 86(1), 97–106 (1964)
Article Google Scholar
Mockus, J., Tiesis, V., Zilinskas, A.: The application of Bayesian methods for seeking the extremum. In: Dixon, L.C.W., Szego, G.P. (eds.) Towards Global Optimization, vol. 2. North-Holland Publishing Company, Amsterdam (1978)
Google Scholar
Srinivas, N., Krause, A., Kakade, S.M., Seeger, M.: Gaussian process optimization in the bandit setting: no regret and experimental design. In: International Conference on Machine Learning (ICML), pp. 1015–1022 (2010)
Google Scholar
Shahriari, B., Wang, Z., Hoffman, M.W., Bouchard-Cote, A., de Freitas, N.: An entropy search portfolio for Bayesian optimization. In: Advances on Neural Information Processing Systems (NIPS) (2014)
Google Scholar
Willet, R.M., Nowak, R.D.: Minimax optimal level-set estimation. IEEE Trans. Image Process. 16(12), 2965–2979 (2007)
Article MathSciNet Google Scholar
Soni, A., Haupt, J.: Level set estimation from compressive measurements using box constrained total variation regularization. In: IEEE International Conference on Image Processing (ICIP), pp. 2573–2576 (2012)
Google Scholar
Krishnamurthy, K., Bajwa, W.U., Willett, R.: Level set estimation from projection measurements: performance guarantees and fast computation. SIAM J. Imaging Sci. 6(4), 2047–2074 (2013)
Article MathSciNet Google Scholar
Krause, A., Singh, A., Guestrin, C.: Near-optimal sensor placements in Gaussian processes: theory, efficient algorithms and empirical studies. J. Mach. Learn. Res. 9, 235–284 (2008)
MATH Google Scholar
Martino, L., Vicent, J., Camps-Valls, G.: Automatic emulator and optimized look-up table generation for radiative transfer models. In: IEEE International Geoscience and Remote Sensing Symposium, pp. 1457–1460 (2017)
Google Scholar
Busby, D.: Hierarchical adaptive experimental design for Gaussian process emulators. Reliab. Eng. Syst. Saf. 94(7), 1183–1193 (2009)
Article Google Scholar
Bryan, B., Nichol, R.C., Genovese, C.R., Schneider, J., Miller, C.J., Wasserman, L.: Active learning for identifying function threshold boundaries. In: Advances in Neural Information Processing Systems (NIPS), pp. 163–170 (2005)
Google Scholar
Bogunovic, I., Scarlett, J., Krause, A., Cevher, V.: Truncated variance reduction: a unified approach to Bayesian optimization and level-set estimation. In: Advances in Neural Information Processing Systems (NIPS), pp. 1507–1515 (2016)
Google Scholar
Yang, J., Wang, Z., Wu, Z.: Level set estimation with dynamic sparse sensing. In: IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 487–491 (2014)
Google Scholar
Ma, Y., Garnett, R., Schneider, J.: Active area search via Bayesian quadrature. In: Artificial Intelligence and Statistics (AISTATS), pp. 595–603 (2014)
Google Scholar
Ma, Y., Sutherland, D., Garnett, R., Schneider, J.: Active pointillistic pattern search. In: Artificial Intelligence and Statistics (AISTATS), pp. 672–680 (2015)
Google Scholar
Bect, J., Ginsbourger, D., Li, L., Picheny, V., Vazquez, E.: Sequential design of computer experiments for the estimation of a probability of failure. Stat. Comput. 22(3), 773–793 (2012)
Article MathSciNet Google Scholar
Chevalier, C., Bect, J., Ginsbourger, D., Vazquez, E., Picheny, V., Richet, Y.: Fast parallel kriging-based stepwise uncertainty reduction with application to the identification of an excursion set. Technometrics 56(4), 455–465 (2014)
Article MathSciNet Google Scholar
Treiber, M., Hennecke, A., Helbing, D.: Congested traffic states in empirical observations and microscopic simulations. Phys. Rev. E 62(2), 1805 (2000)
Article Google Scholar
Kesting, A., Treiber, M., Helbing, D.: General lane-changing model MOBIL for car-following models. Transp. Res. Rec. 1999(1), 86–94 (2007)
Article Google Scholar

Download references

Acknowledgments

Blake Wulfe provided the simulator for the simulations experiments. The authors are grateful to the reviewers for their comments.

Author information

Authors and Affiliations

Stanford University, Stanford, CA, USA
Andrea Zanette, Junzi Zhang & Mykel J. Kochenderfer

Authors

Andrea Zanette
View author publications
You can also search for this author in PubMed Google Scholar
Junzi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Mykel J. Kochenderfer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mykel J. Kochenderfer .

Editor information

Editors and Affiliations

IBM Research - Ireland, Dublin, Ireland
Michele Berlingerio
Institute for Scientific Interchange, Turin, Italy
Francesco Bonchi
University of Nottingham, Nottingham, UK
Thomas Gärtner
University College Dublin, Dublin, Ireland
Neil Hurley
University College Dublin, Dublin, Ireland
Georgiana Ifrim

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 829 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zanette, A., Zhang, J., Kochenderfer, M.J. (2019). Robust Super-Level Set Estimation Using Gaussian Processes. In: Berlingerio, M., Bonchi, F., Gärtner, T., Hurley, N., Ifrim, G. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2018. Lecture Notes in Computer Science(), vol 11052. Springer, Cham. https://doi.org/10.1007/978-3-030-10928-8_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-10928-8_17
Published: 23 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-10927-1
Online ISBN: 978-3-030-10928-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)