Augmenting Stationary Covariance Functions with a Smoothness Hyperparameter and Improving Gaussian Process Regression Using a Structural Similarity Index

Chlingaryan, Anna; Leung, Raymond; Melkumyan, Arman

doi:10.1007/s11004-023-10095-5

Augmenting Stationary Covariance Functions with a Smoothness Hyperparameter and Improving Gaussian Process Regression Using a Structural Similarity Index

Open access
Published: 06 September 2023

Volume 56, pages 605–637, (2024)
Cite this article

Download PDF

You have full access to this open access article

Mathematical Geosciences Aims and scope Submit manuscript

Augmenting Stationary Covariance Functions with a Smoothness Hyperparameter and Improving Gaussian Process Regression Using a Structural Similarity Index

Download PDF

806 Accesses
Explore all metrics

Abstract

Gaussian process (GP) regression provides a probabilistic framework for modeling geochemistry in mineral resource estimation and environmental monitoring applications. An issue with this approach is that the kernel hyperparameters obtained by maximizing the log-marginal likelihood (LML) often produce GP posterior mean estimates that are overly smooth. This motivates the development of augmented kernels that are more capable of capturing the variability inherent in geological/geochemical processes than the existing stationary covariance functions like the Matérn kernels. This paper makes two contributions. First, it describes an extended class of stationary kernels that contain an extra smoothness hyperparameter ($\alpha $) which can be learned from the input data. Valid intervals for $\alpha $ that lead to positive semi-definiteness are determined. Second, it uses a statistical measure called the structural similarity index (SSIM) to quantify smoothness and the spatial fidelity of GP solutions with respect to the input samples. This provides a new way for validating and optimizing GP models. Statistical and spectral analyses provide insights into the behavior of $\alpha $ in the augmented kernels which retain useful properties such as sparsity. Results on the Northern Great Basin geochemical dataset demonstrate that, all things being equal, (1) adjusting $\alpha $ increases spatial fidelity in GP regression; (2) SSIM is a more reliable spatial quality measure than LML; and (3) the optimal $\alpha $ value obtained is correlated with the Wiener entropy of the random process, which indicates the spectral flatness of the chemical signal in the Fourier domain. For GP regression, function smoothness may be defined using Sobolev space. The results show that $\alpha $ regulates over-smoothing by moderating the rate of decay in the power spectrum of the equivalent kernel.

Graphic abstract

Reference database design for the automated analysis of microplastic samples based on Fourier transform infrared (FTIR) spectroscopy

Article Open access 06 July 2018

Exact distribution of change-point MLE for a Multivariate normal sequence

Article 24 June 2024

Applications of Machine Learning Models for Solving Complex Groundwater Modelling, Monitoring and Management Problems

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Gaussian processes (GPs) provide a useful tool for regression in supervised machine learning (Rasmussen and Williams 2006). The range of applications includes geophysics (Ray and Myer 2019), mining (Leung et al. 2022), hydrology (Yang et al. 2018), ecological monitoring (Grbić et al. 2013), robotics (Deisenroth et al. 2013), multi-sensor fusion (Osborne et al. 2008; Melkumyan et al. 2011) and remote sensing (Chlingaryan et al. 2016). In standard GP models, stationary covariance functions are generally used. This means that the covariance between any two points depends only on the lag, or Euclidean distance, irrespective of the location. A potential limitation with stationary GPs is that they may fail to adapt to variable smoothness in the function of interest (Gibbs 1998; MacKay 1998). According to Paciorek and Schervish (2003), this may be encountered in geophysical and geographical applications where domain knowledge suggests that the function may vary more quickly in some parts of the input space than in others. For example, in mountainous areas, environmental variables are likely to be much less smooth than in flat regions. For mining purposes, kriging is usually used. The kriged surface will basically be as smooth as possible given the constraints of the data—in many cases, probably smoother than the true surface (Bohling 2005). Spatial statistics researchers have made some progress in defining nonstationary covariance structures for kriging, a form of GP regression. The nonstationary covariance structure of Higdon et al. (1999)—for which Gibbs (1998) gives a special case—has been extended to a class of nonstationary covariance functions by Paciorek and Schervish (2003). However, nonstationary models are only as powerful as sample sufficiency allows. In data-deficient regions—which are commonly encountered in mining because assay sampling is sparse and costly—reliable estimation of the parameters may not even be possible. For this reason, this paper focuses on extending the capability of existing stationary covariance functions to aptly capture the inherent variability of geological/geochemical processes and produce high-quality GP regression results.

As motivation, Fig. 1a shows the chemical concentration of iron (Fe) at a test site. Figure 1b illustrates the posterior mean distribution obtained using GP and the standard Matérn3/2 kernel, where the length scale and noise parameters indicated are found by maximizing the log-marginal likelihood. The main observation is the blurriness in the mean distribution. For geochemical data, GP regression results are often excessively smooth. Figure 1c highlights the main proposition of this paper: augmenting the standard kernels with a smoothing parameter $\alpha $ enables a more fitting solution to be found—one vastly better at preserving spatial structures. Using a different $\alpha $, Fig. 1c appears much sharper even though the remaining hyperparameters are identical. Indeed, the details missing in (b) relative to (c) can be seen from the residual image in Fig. 1d. In essence, this work introduces a new family of stationary covariance functions ($K_\alpha $) that are more capable of capturing the inherent variability of geochemical random processes than the existing stationary covariance functions (K). The baseline family of K includes standard covariance functions such as squared exponential (SE), exponential and Matérn. In contrast to most covariance functions, $K_\alpha $ has the added flexibility of a parameter $\alpha $ that controls the differentiability (level of smoothness) of sample functions drawn from the GP distribution. And this parameter can be automatically learned from the input data.

This article is organized as follows. Section 2 briefly introduces Gaussian processes as a framework for modeling geochemical distributions. It explains the concept of over-smoothing and defines the smoothness of functions in Sobolev space. Section 3 formulates the new stationary covariance functions $K_\alpha $ and considers the valid intervals of $\alpha $ that make the resultant kernels positive semi-definite. Section 4 describes the data used in the experiments, the learning and inference procedures, and how performance is evaluated. It reviews a statistical measure called the structural similarity index (SSIM) which will be used to validate and quantify the spatial fidelity of GP regression models. Section 5 presents an analysis of the results. Section 6 explores some connections from a Fourier perspective and delves deeper into the results. Finally, concluding remarks are given in Sect. 7.

2 Background

2.1 Gaussian Processes: A Probabilistic Framework

Gaussian processes (GPs) represent a nonparametric technique for building a probabilistic model of a continuous function given a set of observations at known locations. In this paper, the quantities of interest are the concentration of chemicals such as lead or zinc, mostly measured in parts per million (ppm). From the GP perspective, it considers any single function value (i.e. any point on the function) as Gaussian distributed. Hence, the stochastic function is completely characterized by a mean and a variance. This viewpoint sets it apart from deterministic interpolation such as a fitted surface obtained by least-squares optimization.

Mathematically, GP is an infinite collection of random variables, any finite number of which have a joint Gaussian distribution (Rasmussen and Williams 2006). Machine learning using GPs consists of two steps: training and inference. GPs usually contain initially unknown hyperparameters, and the training step is aimed at optimizing those hyperparameters to produce a probabilistic model that best represents the training data. The hyperparameters define the function only when considered together with the training data. The hyperparameters used in this work are the length scale, which describes the rate of change of the output, the amplitude and additional hyperparameter, which describes the level of smoothness of the predictive model. Once the optimal hyperparameters are found, they can be used during the inference step to predict the values of the function of interest at new locations. When creating a GP model, a covariance function must be chosen that can be used to help describe the relationship between the inputs and outputs. The covariance function will also define the number and type of hyperparameters that are needed for the model.

Using $x_i\in \mathbb {R}^d$ and $y_i\in \mathbb {R}$ to denote the input and output, respectively, the supervised learning problem uses a given N-point training set $T=\{x_i,y_i\}_{i=1}^N$ to compute the predictive distribution $f(x_*)$ at a new test point $x_*$. A vector notation, $\textbf{x}_*$, is used to represent a collection of test points. Since the GP model places a multivariate Gaussian distribution over the space of function variables $f(\textbf{x})$, the GP is fully specified by its mean function $m(\textbf{x})$ and covariance function $k(\textbf{x},\textbf{x}'):\,f(\textbf{x})\sim GP(m(\textbf{x}),k(\textbf{x},\textbf{x}'))$. Suppose the function values are unknown and to be determined at the test points $T_*=\{x_{*i},y_{*i}\}_{i=1}^M$. The joint Gaussian distribution with zero mean function and covariance function K is

$$\begin{aligned} \begin{bmatrix}\textbf{f} \\ \textbf{f}_*\end{bmatrix} = \mathcal {N}\left( \textbf{0},\, \begin{bmatrix} K(X,X) &{} K(X,X_*)\\ K(X_*,X) &{} K(X_*,X_*) \end{bmatrix}\right) , \end{aligned}$$

(1)

where $\textbf{f}$ and $\textbf{f}_*$ are noise-free values of the function, $\mathcal {N}(\mu ,\Sigma )$ is a multivariate Gaussian distribution with mean $\mu $ and covariance $\Sigma $, and K is used to denote the covariance matrix computed between all points in the set. In particular, the covariance matrix between observed and unobserved locations has the form

$$\begin{aligned} K(X,X_*) = \begin{bmatrix} k(x_1,x_{*1}) &{} \ldots &{} k(x_1,x_{*M})\\ \vdots &{} \ddots &{} \vdots \\ k(x_N,x_{*1}) &{} \ldots &{} k(x_N,x_{*M}) \end{bmatrix} = K(X_*,X)^T. \end{aligned}$$

(2)

If we assume observations with Gaussian noise $\varepsilon $ and noise variance $\nu ^2$ such that $y=f(x)+\varepsilon $, then the joint distribution becomes

$$\begin{aligned} \begin{bmatrix}\textbf{y} \\ \textbf{f}_*\end{bmatrix} = \mathcal {N}\left( \textbf{0},\, \begin{bmatrix} K(X,X)\!+\!\nu ^2 I &{} \,K(X,X_*)\\ K(X_*,X) &{} \,K(X_*,X_*) \end{bmatrix}\right) . \end{aligned}$$

(3)

In this model, the measurement noise is assumed to be unbiased, independent and identically distributed. The key predictive equations for GP regression can be obtained by conditioning on the observed training points (Murphy et al. 2014). The resulting predictive distribution for the points being estimated can be obtained as

$$\begin{aligned} p(\textbf{f}_*\!\mid \!(X_*,X),\textbf{y})=\mathcal {N}(\mu _*,\Sigma _*), \end{aligned}$$

(4)

where the predictive mean and covariance are given via the formulas

$$\begin{aligned} \mu _*&=K(X_*,X)\left[ K(X,X)+\nu ^2 I\right] ^{-1}\textbf{y}, \end{aligned}$$

(5)

$$\begin{aligned} \Sigma _*&=K(X_*,X_*)-K(X_*,X)\left[ K(X,X)+\nu ^2 I\right] ^{-1}K(X,X_*)+\nu ^2 I. \end{aligned}$$

(6)

The predicted mean value $\mu _*$ in Eq. 5 is the main outcome of the GP regression. This also represents a solution to Kernel ridge regression where $\nu ^2$ serves to alleviate overfitting (Akian et al. 2022). The diagonal of the matrix $\Sigma _*$ defines the variance representing the uncertainty for those predictions.

Training a GP model is equivalent to learning the hyperparameters of the covariance function from a dataset. In the Bayesian framework, this can be performed by maximizing the log of the marginal likelihood (LML) with respect to $\theta $

$$\begin{aligned} \log p(\textbf{y}\mid X,\theta ){=}{-}\frac{1}{2}\textbf{y}^T\left[ K(X,X)+\nu ^2 I\right] ^{-1}\textbf{y} {-} \frac{1}{2}\log \left| K(X,X){+}\nu ^2 I\right| {-}\frac{N}{2}\log 2\pi , \end{aligned}$$

(7)

where $\left| \,\cdot \,\right| $ denotes the determinant. The marginal likelihood in Eq. 7 contains three terms that represent (from left to right) the data fit, complexity penalty (to include the Occam’s razor principle) and normalization constant. The first two terms in Eq. 7 depend on the values of the hyperparameters. As the marginal likelihood is a non-convex function of the hyperparameters, only local maxima can be obtained (Melkumyan and Ramos 2009). Local maxima were obtained via gradient descent using multiple starting points.

2.2 Connections

The smoothing phenomenon seen in Fig. 1 has been observed in kriging and reported by Journel et al. (2000) and Yamamoto (2005), among others. The problem manifests as conditional bias, and more specifically, a tendency of underestimating large values and overestimating small values. This smoothing effect is as relevant for kriging interpolation (unbiased linear estimators) as it is for Gaussian process regression, since the GP mean in (5) may be expressed as a linear combination of kernel functions centered on the training points, with $\alpha _i=\left[ K(X,X)+\nu ^2 I\right] ^{-1}\textbf{y}$ in (8).

$$\begin{aligned} \mu _*&=\sum _{i=1}^N \alpha _i k(\textbf{x}_*,\textbf{x}_i). \end{aligned}$$

(8)

For kriging, this smoothing effect is explained by a deficit of variance, $\text {Var}\{Z(\textbf{x})\}\!-\!\text {Var}\{Z_K^*(\textbf{x})\}\!=\!\sigma _K^2\ge 0$, which increases as the distance increases between the estimated location and known data (Journel et al. 2000). This discrepancy is usually corrected during post-processing with a cross-validation procedure that aims to reproduce the semivariogram. In Olea and Pawlowsky (1996), the estimation error is compensated using linear regression, whereas in Yamamoto (2005), cross-validation is used to estimate the kriging interpolation variance and estimation error. Following a different path, Yao (1998) proposed a conditional simulation approach which imparts structural information (the correct covariance model) on the kriging estimate using Fourier coefficients in the spectral domain. This iterative approach removes smoothing artifacts, albeit at the cost of local accuracy (a poorer estimate vs. true value correlation). Readers are referred to Journel et al. (2000) for a discourse on the limitations of kriging, conditional bias, uncertainty assessment, and the conflicting objectives of retaining global accuracy (reproducing texture, structures or covariance) and preserving local accuracy (minimization of error variance).

Outside of geostatistics, kriging is sometimes seen through the lens of GP in the machine learning community. In particular, kriging models the residual component using a stationary GP with zero mean (Shekaramiz et al. 2019). For Gaussian processes, it is instructive to consider “smoothing” as a smoothness misspecification that yields a flatter GP mean function, $\mu _*$, relative to the target function, f. Formally, the smoothness of a function can be measured in terms of the number of derivatives in Sobolev space $W_p^k(\mathbb {R}^d)$ (Wynne et al. 2021). For the case of $L^p$ norm with $p=2$,

$$\begin{aligned} W_2^k(\mathbb {R}^d)=\left\{ f\in L^2(\mathbb {R}^d):\Vert f\Vert _{W_2^k(\mathbb {R}^d)}^2 :=\int _{\mathbb {R}^d}\left( 1+\Vert \omega \Vert _2^2\right) ^k |\hat{f}(\omega )|^2 d\omega < \infty \right\} , \end{aligned}$$

(9)

where $\hat{f}(\omega )$ is the Fourier transform of f, $\Vert \cdot \Vert _2$ denotes the Euclidean norm, and $k>d/2$. A function f has smoothness $m_0$ in the Sobolev sense if it is integrable and has a finite norm, $\Vert f\Vert _{W_2^k(\mathbb {R}^d)}^2$, as specified in (9) for all $k\in \mathbb {R}<m_0$ where $m_0=\text {sup}\left\{ k\ge 0: f\in W_2^k(\mathbb {R}^d)\right\} $. The Fourier transform of the true auto-correlation function, $\Phi $, and GP kernel-estimated correlation function, $\Psi _\theta $, are assumed to satisfy the following bounds (Wang and **g 2022)

$$\begin{aligned}&c_1\left( 1+\Vert \omega \Vert _2^2\right) ^{-m_0}\le \mathcal {F}(\Phi )(\omega ) \le c_2\left( 1+\Vert \omega \Vert _2^2\right) ^{-m_0}, \forall \omega \in \mathbb {R}^d, \end{aligned}$$

(10)

$$\begin{aligned}&c_3\left( 1+\Vert \omega \Vert _2^2\right) ^{-m_\theta }\le \mathcal {F}(\Psi _\theta )(\omega ) \le c_4\left( 1+\Vert \omega \Vert _2^2\right) ^{-m_\theta }, \forall \omega \in \mathbb {R}^d. \end{aligned}$$

(11)

The $\theta $ subscript emphasizes that the auto-correlation estimate is obtained using a kernel with parameters $\theta $. Over-smoothing occurs when $m_\theta > m_0$. This has a clear interpretation based on the Wiener–Khintchine theorem (for a wide-sense stationary process, the power spectrum of f is given by the Fourier transform of its autocorrelation function, $\mathcal {F}(\Phi )$), it indicates the kernel power spectrum decays with spatial frequencies at a faster rate than the target function. $\mathcal {F}(\Psi )$ may be rewritten in product form as shown in (12). Applying the convolution theorem and conjugate property, this is equivalent to convolving the true autocorrelation function with an equivalent kernel (interpolator) in the spatial domain.

$$\begin{aligned} \mathcal {F}(\Psi )(\omega ) = \mathcal {F}(\Phi )(\omega )|Q_{K_\theta }(\omega )|^2. \end{aligned}$$

(12)

This expression shows the unknown power spectrum $\mathcal {F}(\Phi )(\omega )$ is shaped by an equivalent kernel frequency response, $Q_{K_\theta }(\omega )$, which depends on the kernel hyperparameters $\theta $. $Q_{K_\theta }(\omega )$ represents a decreasing function that decays at a rate of $1/(1+\Vert \omega \Vert ^2)^s$. The main source of smoothing considered in this paper is a significant mismatch in the rate of decay, namely $m_\theta \gg m_0$, which contributes to over-smoothing in the GP mean function estimate. The proposed $\alpha $ mechanism allows this decay rate to be adjusted to compensate for over-smoothing.

3 Formulation

This section considers the properties of a covariance function that impact the smoothness of models. The formula for the augmented stationary covariance functions $K_\alpha $ is presented, and the valid intervals for the smoothness parameter ($\alpha $) are determined.

3.1 Kernel Attributes that Affect the Smoothness of Models

The level of smoothness of the stochastic process $f(\textbf{x})$ generated by the GP impacts the level of the smoothness of the models produced by GP inference. The smoothness of the process $f(\textbf{x})$ depends on the smoothness of its covariance function $k(\textbf{x},\textbf{x}')$, where $\textbf{x},\textbf{x}'\in \mathbb {R}^d$. The first observation regarding mean square continuity and differentiability is attributed to Rasmussen and Williams (2006). For stationary processes, if the 2n^th order partial derivative $\partial ^{2n}{k(\textbf{x})}/\partial ^2 x_{i1}\ldots x_{in}$ exists and is finite at $\textbf{x}=0$, then the n^th order partial derivative $\partial ^{n}{k(\textbf{x})}/\partial x_{i1}\ldots x_{in}$ exists for all $\textbf{x}\in \mathbb {R}^d$ as a mean square limit. It is the properties of the kernel $k(\textbf{x})$ around $\textbf{0}$ that determine the smoothness properties (MS differentiability) of a stationary process.

Therefore, the behavior of a stationary covariance function at the origin affects the smoothness of the GP predictive model. In one dimension, these differences in behavior can be seen in Fig. 2 for the exponential, squared exponential, Matérn and sparse (raised-cosine) kernels. This sparse kernel, $h(d)=\sigma _f^2 \left[ \frac{1}{3}\left( 2+\cos (\frac{2d}{l})\right) \left( 1-\frac{d}{l\pi }\right) +\frac{1}{2\pi }\sin \left( \frac{2d}{l}\right) \right] $ for $d\le \pi $ where $d=\left| x-x'\right| $, and $h(d)=0$ otherwise, is described in Melkumyan and Ramos (2009).

3.2 Stationary Covariance Functions with Variable Smoothness ($\alpha $)

To modify the behavior of stationary covariance functions at the origin, a new covariance function is proposed

$$\begin{aligned} K_{\alpha }(l,\sigma _f,\alpha )=\sigma _f^2\left( 1-\left[ 1-\frac{K_\text {cov}(l,\sigma _f)}{\sigma _f^2}\right] ^{\alpha }\right) . \end{aligned}$$

(13)

The proposed covariance function has the following hyperparameters: $l=[l_x,l_y]$ is a two-dimensional length scale vector, $\sigma _f$ is the amplitude hyperparameter and $\alpha $ is interpreted as a smoothness parameter. All these hyperparameters can be learned from the input data based on Bayesian methods using the marginal likelihood in Eq. 7.

The proposed covariance function in Eq. 13 can adjust the smoothness of the base covariance function $K_\text {cov}$ by changing the value of $\alpha $. When $\alpha $ equals 1, the base covariance function is reproduced, that is, $K_\alpha (l,\sigma _f,\alpha \!=\!1)=K_\text {cov}(l,\sigma _f)$. Any stationary covariance function can be chosen as the base covariance function $K_\text {cov}$; this includes the exponential, Matérn and sparse kernels for instance. When $K_\alpha $ is applied to a base covariance function, it preserves the level of sparsity in $K_\text {cov}$. In fact, $K_\alpha (l,\sigma _f,\alpha )=0$ wherever $K_\text {cov}\!=\!0$ for all $\alpha $.

To find the hyperparameters, the partial derivatives of the marginal likelihood with respect to the hyperparameters are needed. Differentiating Eq. 7 yields (Rasmussen and Williams 2006)

$$\begin{aligned} \frac{\partial }{\partial \theta _j}\log p(\textbf{y}\mid X,\theta )&\,\,\,\, =\frac{1}{2}\textbf{y}^T K^{-1}\frac{\partial K_\alpha }{\partial \theta _j}K^{-1}\textbf{y}-\frac{1}{2} \text {tr}\left( K^{-1}\frac{\partial K}{\partial \theta _j}\right) \\&{\mathop {=}\limits ^{K\leftarrow K_\alpha }}\frac{1}{2}\text {tr}\left( (\textbf{w}\textbf{w}^T-K_\alpha ^{-1})\frac{\partial K_\alpha }{\partial \theta _j}\right) ,\text { where }\textbf{w}=K_\alpha ^{-1}\textbf{y}.\nonumber \end{aligned}$$

(14)

The gradient of the proposed covariance function, $K_\alpha $, can be analytically expressed via the gradient of the base covariance function as follows

$$\begin{aligned} \frac{\partial K_\alpha (l,\sigma _f,\alpha )}{\partial l}&={\left\{ \begin{array}{ll}\alpha \left[ 1-\frac{K_\text {cov}(l,\sigma _f)}{\sigma _f^2}\right] ^{\alpha -1}\frac{\partial K_\text {cov}(l,\sigma _f)}{\partial l}, &{} \text {if }x\ne x'\\ 0, &{} \text {if }x=x'\end{array}\right. }, \end{aligned}$$

(15)

$$\begin{aligned} \frac{\partial K_\alpha (l,\sigma _f,\alpha )}{\partial \sigma _f}&=\frac{2}{\sigma _f}K_\alpha (l,\sigma _f,\alpha ), \end{aligned}$$

(16)

$$\begin{aligned} \frac{\partial K_\alpha (l,\sigma _f,\alpha )}{\partial \alpha }&={\left\{ \begin{array}{ll}-\sigma _f^2\left[ 1-\frac{K_\text {cov}(l,\sigma _f)}{\sigma _f^2}\right] ^{\alpha }\text {ln}\left[ 1-\frac{K_\text {cov}(l,\sigma _f)}{\sigma _f^2}\right] , &{} \text {if }x\ne x'\\ 0, &{} \text {if }x=x'\end{array}\right. }. \end{aligned}$$

(17)

These partial derivatives are utilized by constrained optimization solvers (gradient descent algorithms) to find appropriate kernel hyperparameters.

3.3 Intuition Behind $\alpha $

Qualitatively, the smoothness parameter changes the shape of the base covariance function as shown in Fig. 3. Generally speaking, decreasing $\alpha $ from 1 towards 0 reduces the span of the kernel; this in turn makes the interpolating function more spatially localized. As a corollary, decreasing $\alpha $ increases the bandwidth of the kernel frequency response in the Fourier domain. This can be seen clearly from the power density spectra in Fig. 4 which shows that the base kernel $K_{\alpha =1}$ significantly attenuates high-frequency contents that correspond to edge structures and geochemical discontinuities in the spatial domain. This behavior explains why the base kernels tend to introduce more blurriness into GP regression results. The effects of reducing $\alpha $ is that it relaxes this cutoff frequency and allows more structural information to pass through. Hence, the augmented kernel, $K_\alpha $ provides a frequency tuning mechanism.

3.4 Positive Semi-Definiteness and Valid Intervals for $\alpha $

A continuous translation-invariant function f of vector variable $\textbf{x}\in \mathbb {R}^d$ is said to be positive semi-definite (psd) if

$$\begin{aligned} \sum _{m,n}^N c_m c_n f(\textbf{x}_m-\textbf{x}_n) = \textbf{c}^T F\textbf{c}\ge 0 \end{aligned}$$

(18)

for any $\textbf{x}_1,...,\textbf{x}_N\in \mathbb {R}^d$, given $c_1,...,c_N\in \mathbb {R}$ and $N\in \mathbb {N}$. This is equivalent to requiring F, whose elements are given by $F_{m,n}=f(\textbf{x}_m-\textbf{x}_n)$ for $1\le m,n\le N$, to be a Gram matrix which is Hermitian psd. In particular, f is a symmetric positive semi-definite function if and only if F corresponds to the covariance of a GP. However, not all symmetric functions are valid covariance kernels. According to Bochner’s theorem, a shift-invariant kernel $K_\alpha $ is psd if and only if its Fourier transform, $\hat{k}(\omega )$, has non-negative values. The Fourier transforms of symmetric, real-valued kernels are also symmetric and real-valued. Hence, the integral $\hat{k}_\alpha (\omega )=\frac{1}{2\pi }\int _{-\infty }^{\infty }K_\alpha (t) e^{-i\omega t} dt \propto \int _0^{\infty }K_\alpha (t) \cos (\omega t) dt$ is computed numerically for various $\alpha $ to determine the valid interval for which the proposed covariance function $K_\alpha $ is positive semi-definite. Figure 5(top) shows a series of $\hat{k}_\alpha (\omega )$ obtained by varying $\alpha $ over the valid intervals. Figure 5(bottom) shows a few cases of $\alpha $ for which the non-negative requirement is violated.

The valid intervals of $\alpha $ that guarantee a positive semi-definite $K_\alpha $ are obtained; these are shown in Table 1.

Table 1 Valid interval of $\alpha $ for which $K_\alpha $ is positive semi-definite

Full size table

Based on these results, GP regression results produced by the standard kernels ($K_\text {cov}\equiv K_{\alpha =1}$) and augmented kernels ($K_\alpha $) were compared. Figure 6 shows the expected smoothing behavior (slow transition and Gibbs oscillation) when the base kernels, with $\alpha $ set to 1, respond to a step change. This is consistent with the observations made in Sect. 3.3 whereby $\alpha =1$ is responsible for a wider kernel span and narrower frequency passband. As $\alpha $ is reduced, the GP mean prediction responds almost instantaneously to the step change. The interpolation becomes less distorted as higher spatial frequencies are better preserved.

4 Materials and Methods

This section describes the data and procedures used in the experiments.

4.1 The Northern Great Basin (NGB) Geochemical Dataset

This public-domain multi-element geochemical dataset was compiled by the US Geological Survey and other agencies for mineral and environmental assessments. It contains 10,261 measurements of surficial materials (stream sediment and soil samples) from a period that predates large-scale mining, covering northern Nevada, south-eastern Oregon and the north-eastern tip of California (see Fig. 7). In Coombs et al. (2002), the Mesozoic cratonal margin and the approximate extent of Tertiary volcanic cover are overlaid to highlight potential correlations between geological features and geochemistry. The majority of the samples are obtained by the inductively coupled plasma (ICP) analysis method which involves dissolving a sample in a series of acids and analyzing the resultant solution by inductively coupled plasma/atomic emission spectroscopy. Chemical concentrations are expressed in ppm (not as a percentage) unless otherwise stated.

4.2 Learning Kernel Hyperparameters and Inference

In this study, length scale parameters for $K_\alpha $ are learned by maximizing the marginal likelihood in Eq. 7 for two augmented kernels which derive from the Matérn 3/2 and squared exponential base kernels. A set of hyperparameters $\theta =[l_x,l_y,\sigma _f,\nu ,\alpha ]$ are found for each of the fifteen chosen chemicals: Al, As, Ba, Be, Ce, Fe, Mg, Mn, Na, Ni, P, Pb, Sb, V and Zn. For this dataset, it was appropriate to apply logarithmic transformation to the chemical measurements, $\textbf{z}$, before GP processing. This has the effect of making the geochemical distributions less skewed (see Fig. 8).

Thus, the $\textbf{y}$ vector is related to the raw measurements via $\textbf{y}=g^{-1}(\textbf{z})$, where $g^{-1}\equiv \log $. As a consequence, the GP posterior mean and variance predictions with respect to $\textbf{z}$ are obtained from the moment estimates of Y using Taylor expansion for moments of functions of random variables. Writing $\mu _Y\equiv \mu _*(\textbf{x}_*;\theta ,\alpha )$ and $\sigma _Y\equiv \sigma _*(\textbf{x}_*;\theta ,\alpha )$,

$$\begin{aligned} \mathbb {E}[Z]&=\mathbb {E}[g(Y)]\approx g(\mu _Y) + \frac{1}{2}g''(\mu _Y) \sigma _Y^2, \end{aligned}$$

(19)

$$\begin{aligned} \text {var}[Z]&=\text {var}[g(Y)]\approx \left( g'(\mu _Y)\right) ^2 \sigma _Y^2 - \frac{1}{4}\left( g''(\mu _Y)\right) ^2 \sigma _Y^4, \end{aligned}$$

(20)

where $g(\cdot )\!\equiv \exp (\cdot )$. Without loss of generality, the data Y are assumed to be centered.

4.3 Measuring Spatial Fidelity Using the Structural Similarity Index

In this study, the set of locations $X_*=\{x_{*i}\}_{i=1}^{M}$ for which f is unknown are defined over a dense uniform grid that covers a two-dimensional modeling region. To evaluate the quality of the GP regression for $X_*$, the predicted mean $\mu _*(\textbf{x}_*;\theta ,\alpha )\in \mathbb {R}^M$ obtained using the hyperparameters $\theta $ and $\alpha $ is compared with a reference scalar field, $v_\text {ref}(\textbf{x}_*)\in \mathbb {R}^M$, defined by the training samples T. The computational details of $v_\text {ref}(\textbf{x}_*)$ are described in Sect. 4.4. For validation, an established measure known as structural similarity index (SSIM) is chosen to indicate how well GP regression preserves spatial structure in the underlying distribution. When $\textbf{u}\equiv \mu _*(\textbf{x}_*;\theta ,\alpha )$ is compared with $\textbf{v}\equiv v_\text {ref}(\textbf{x}_*)$, the structural similarity index proposed by Wang et al. (2004) is computed from the statistical moments of $\textbf{u}$ and $\textbf{v}$, and defined as a product of three similarity terms

$$\begin{aligned} \text {luminance (mean) as }&l(\textbf{u},\textbf{v})=\frac{2\mu _u \mu _v+C_1}{\mu _u^2+\mu _v^2+C_1}, \end{aligned}$$

(21)

$$\begin{aligned} \text {contrast (variance) as }&c(\textbf{u},\textbf{v})=\frac{2\sigma _u \sigma _v+C_2}{\sigma _u^2+\sigma _v^2+C_2}, \end{aligned}$$

(22)

$$\begin{aligned} \text {shape (correlation) as }&s(\textbf{u},\textbf{v})=\frac{\sigma _{u,v}+C_3}{\sigma _u\sigma _v+C_3}, \end{aligned}$$

(23)

where $\mu _u$, $\mu _v$, $\sigma _u$, $\sigma _v$ and $\sigma _{u,v}$ represent the local means, local variances, and local covariance of u and v, respectively; $C_1$, $C_2$ and $C_3= \frac{C_2}{2}$ represent three tiny positive constants. The local mean, for instance $\mu _v$, may be estimated by convolving $v(x_{*})$ with a Gaussian window function with a standard deviation of 1.5 times the $X_*$ grid spacing. The local covariance is estimated in a similar manner, by convolving $(u(x_{*})-\hat{\mu }_u)(v(x_{*})-\hat{\mu }_v)$ with the same Gaussian low-pass filter. Hence, the values of $\mu _u$, $\mu _v$, $\sigma _u$, $\sigma _v$ and $\sigma _{u,v}$ are location-dependent. These signal processing concepts are described in Oppenheim et al. (2001). The structural similarity measure is finally given by

$$\begin{aligned} \text {SSIM}(\textbf{u},\textbf{v})&=\left| l(\textbf{u},\textbf{v})\right| \cdot \left| c(\textbf{u},\textbf{v})\right| \cdot \left| s(\textbf{u},\textbf{v})\right| \nonumber \\&=\frac{(2\mu _u\mu _v + C_1)(2\sigma _{u,v} + C_2)}{(\mu _u^2+\mu _v^2+C_1)(\sigma _u^2+\sigma _v^2+C_2)}\in \mathbb {R}_{+}^M. \end{aligned}$$

(24)

For our purpose, SSIM may be understood as a spatial degradation (or quality) measure. It satisfies the symmetry, boundedness and unique maximum properties, namely $\text {SSIM}(\textbf{u},\textbf{v})\!=\!\text {SSIM}(\textbf{v},\textbf{u})$, $\text {SSIM}(\textbf{u},\textbf{v})\!\le \!1$, and $\text {SSIM}(\textbf{u},\textbf{v})\!=\!1$ if and only if $\textbf{u}\!=\!\textbf{v}$. A related spatial distortion metric can be derived from SSIM. This metric, shown in Eq. 25, is referred to as the normalized root mean square error (NRMSE)

$$\begin{aligned} \text {NRMSE}(\textbf{u},\textbf{v})=\sqrt{1-c(\textbf{u},\textbf{v})s(\textbf{u},\textbf{v})}. \end{aligned}$$

(25)

Brunet et al. (2011) have shown that this metric satisfies quasi-convexity. In general, this is a useful property for nonlinear optimization, as it ensures the existence of a global minimum on any convex subset of the function domain.

4.4 Experiment 1: Evaluating the Effects of $\alpha $

The objective of the first experiment is to demonstrate both subjective and objective improvement when the augmented covariance functions, $K_\alpha $, are used in GP regression. In Sect. 3.4, this has been shown to be true for a signal that resembles a step function. Based on kernel frequency response arguments, the expectation is that this trend will hold for chemical distributions found in the Northern Great Basin (NGB) geochemical dataset.

The hyperparameters are learned by maximizing the log-marginal likelihood (LML) as described previously, with lower bounds [0.01, 0.01, 0.001, 0.01, 0.1] and upper bounds $[0.5,0.5,\infty ,\text {percentile}(\Delta \textbf{z},95)\!\times \!2,1]$ imposed on $[l_x,l_y,\sigma _f,\nu ,\alpha ]$, where $\Delta \textbf{z}$ represents the first-order difference performed on a sorted sequence of chemical measurements, $\text {sort}(\textbf{z})$. For computational efficiency, GP training is performed on $T_s$, an $L(\!=\!2000)$ point random subset of the supplied data, $T=\{(x_i,y_i\!=\!\log (z_i))\}_{i=1}^{N=10261}$, that minimizes the KL divergence.

The results presented in Sect. 5.1 will consist of visual inspection and quantitative analysis. For quantitative analysis, the $\chi ^2$ statistic and the mean structural similarity index (see SSIM in Sect. 4.3) averaged over the modeled region will provide a measure of the spatial fidelity of the GP mean estimates $\mu _*(\textbf{x}_*;\theta ,\alpha )\in \mathbb {R}^M$ obtained using K and $K_\alpha $ with respect to the reference scalar field, $v_\text {ref}(\textbf{x}_*)\in \mathbb {R}^M$, computed from the full dataset T. In practice, the normalized statistic $\bar{\chi }^2\!=\!\tfrac{1}{M}\chi ^2(\textbf{u},\textbf{v})$ is used, where $\chi ^2(\textbf{u},\textbf{v})=\sum _{i=1}^M \frac{(u_i-v_i)^2}{v_i}$. With $\textbf{u}\equiv \mu _*(\textbf{x}_*;\theta ,\alpha )\in \mathbb {R}^M$ and $\textbf{v}\equiv v_\text {ref}(\textbf{x}_*)\in \mathbb {R}^M$, $u_i$ denotes an observed regression value obtained using $K_\alpha $, and $v_i$ denotes the expected value obtained from the reference scalar field, $v_\text {ref}$.

In terms of implementation, specifically in relation to $v_\text {ref}(\textbf{x}_*)$, at each query location $x_{*}$, the reference value $v_\text {ref}(x_{*})$ is computed by interpolating the 16 nearest samples from $T=\{x_i,z_i\}_{i=1}^N$ using inverse distance weights with exponent 3. This choice is guided by what USGS had used for these data (Coombs et al. 2002). For this study, the reference should be seen as synthesized ground truth, rather than an attempt to reconstruct the underlying random function according to some optimality criteria. Hence, $v_\text {ref}$ only needs to satisfy $y_i=v_\text {ref}(x_i)$ at the sampled locations and exhibit realistic variation that is meaningful at spatial scales of interest.

For the inverse distance weights, the minimum separating distance is capped at 0.1 min latitude which equates to ten times the resolution (or one tenth the spacing) of the inference grid ($X_*$) at about 185 m. As an overview, Fig. 9 shows the location and value of samples, $T=\{(x_i,z_i)\}_{i=1}^N$, for various chemicals. The corresponding reference scalar fields—which are computed from T and evaluated on a regular grid $X_*$—are shown in Fig. 10.

Although the reference fields might suffice for certain applications that rely solely on visual interpretation, it is worth emphasizing that GP regression also computes the posterior variance and provides a mathematical framework for quantifying uncertainty of a random process which may be important for quantitative risk assessment. Figure 11 provides a compelling example of this. In Fig. 11a, the curves show the magnitude of the predicted mean and standard deviation (s.d.) as ordered by $\mu _*$ in ascending order. Figure 11b shows that the predicted s.d. captures mainly the epistemic uncertainty associated with the sampling in this example. An advantage of using GP is that it allows lower and upper bounds with arbitrary confidence, for example, $\mu _*-2\sigma _*$ and $\mu _*+2\sigma _*$, to be computed and used for probabilistic reasoning. This is shown in Fig. 11c, d.

4.5 Experiment 2: Validation Using SSIM and NRMSE

The second experiment determines whether there is a consistent relationship between the log-marginal likelihood (LML) and the structural similarity index measure (SSIM), whose properties are well established in the field of image processing. For instance, in Veras and Collins (2019), SSIM is found to be capable of capturing discernible differences that relate to spatial structures and patterns. A pertinent question is whether the $\alpha $ found by minimizing the negative LML across all hyperparameters (henceforth, referred as “NLML-optimized” alpha) is optimal in the SSIM sense. In practice, it is possible firstly for the hyperparameters (solution) to be stuck at a local minima during gradient descent; secondly the LML measure may not treat regression errors in a manner that reflects visual degradation to spatial structures or geochemical features.

To answer this question, we keep all the parameters other than alpha, namely $\theta \backslash \alpha =\{l_x,l_y,\sigma _f,\nu \}$, fixed for each chemical. GP regression is performed for each $\alpha $ between $\alpha _{\min }=0.1$ and $\alpha _{\max }=1$ in increments of 0.1. The mean SSIM, NRMSE and LML statistics are computed as a function of $\alpha $. These measures (discrete values parameterized by $\alpha $) are interpolated using a spline function, and the $\alpha $ values corresponding to the peak SSIM and LML are recorded.

4.6 Experiment 3: Establishing Relationships in the Spectral Domain

The third experiment investigates if there is any plausible connection between $\alpha $ (a property of the augmented stationary kernel $K_\alpha $) and spectral properties of the random process (geochemical distribution) in the frequency domain. The details of this are deferred until Sects. 5.3 and 6.1, where these issues are further discussed.

5 Results and Analysis

5.1 Evaluating the Effects of $\alpha $

To demonstrate an improvement in GP regression when augmented covariance functions are used, Fig. 12 provides a visual comparison of the posterior mean estimates, $\textbf{u}\equiv \mu _*(\textbf{x}_*;\theta ,\alpha )$, obtained using the Matérn 3/2 base kernel ($K_\text {cov}=K_{\alpha =1}$) and augmented kernel ($K_\alpha $ with variable $\alpha $) for four chemicals: As, Ce, Na and V.

Significant smoothing and blocking artifacts can be seen in V and Na from results in the top row which correspond to base kernels. Higher spatial fidelity is observed from results in the bottom row which correspond to augmented kernels. With $\alpha <1$, spatial structures in the chemical distribution are more faithfully preserved.

To objectively analyze the GP regression results, $\textbf{u}\equiv \mu _*(\textbf{x}_*;\theta ,\alpha )$ are compared with the reference $\textbf{v}\equiv v_\text {ref}(\textbf{x}_*)$ obtained from the full dataset, as described in Sect. 4.4. Table 2 measures discrepancies between $\textbf{u}$ and $\textbf{v}$ using the normalized $\bar{\chi }^2(\textbf{u},\textbf{v})$ statistic, and similarities between $\textbf{u}$ and $\textbf{v}$ using the SSIM. Henceforth, the subscripts “base” and $\alpha $ indicate whether the results are obtained with the base kernel, $K_\text {cov}$, or augmented kernel, $K_\alpha $.

A comparison of $\text {SSIM}_\text {base}$ versus $\text {SSIM}_{\alpha }$ shows that the proposed kernel $K_\alpha $ consistently produces mean estimates that are closer to the reference than those produced by the base kernel $K_\text {cov}$. Out of 15 chemical distributions, the only exception is Sb. The reason for this will be investigated in the next subsection. Similarly, a comparison of the $\bar{\chi }^2_\text {base}$ and $\bar{\chi }^2_{\alpha }$ columns reveals that the base kernel leads to higher spatial distortion. This evidence shows that even if the LML-optimized $\alpha $ parameters are suboptimal, they still produce higher-quality GP regression results relative to the baseline kernels.

Table 2 Comparison of $\textbf{u}\equiv \mu _*(\textbf{x}_*;\theta ,\alpha )$ with respect to $\textbf{v}\equiv v_\text {ref}(\textbf{x}_*)$ for the base kernels and augmented kernels (with variable $\alpha $)

Full size table

5.2 Validation Using SSIM and NRMSE

The goal is to establish whether there is a consistent relationship between the log-marginal likelihood (LML) and the structural similarity index (SSIM). To facilitate this, the SSIM($\alpha $), NRMSE($\alpha $) and LML($\alpha $) curves are all plotted as a function of $\alpha $ following the procedure described in Sect. 4.5. Notation-wise, the alpha value found during GP training—which optimizes all hyperparameters $\theta $ jointly by maximizing the LML—is denoted as $\alpha _0$. The alpha values corresponding to the peak of the SSIM and LML curves (obtained by fixing $\theta \backslash \alpha $) are denoted as $\alpha _*$ and $\alpha _1$, respectively. These alpha values are plotted along the horizontal axis in Fig. 13. Two difference terms are defined, $\Delta \alpha _{0,*}=\alpha _0\!-\!\alpha _*$ and $\Delta \alpha _{1,*}=\alpha _1\!-\!\alpha _*$. Similarly, changes in SSIM are defined as $\Delta S_{0,*}=\text {SSIM}(\alpha _0)\!-\!\text {SSIM}(\alpha _*)$ and $\Delta \text {S}_{1,*}=\text {SSIM}(\alpha _1)\!-\!\text {SSIM}(\alpha _*)$, respectively. These represent a quality gap along the vertical axis.

Table 3 Efficacy of SSIM and LML as a spatial fidelity measure for GP regression

Full size table

Looking at Table 3, the $\alpha _0$s—the smoothing parameter that forms part of hyperparameters $\theta =[l_x,l_y,\sigma _f,\nu ,\alpha ]$ that maximizes the LML—are generally smaller than the optimal value $\alpha _*$ obtained under the SSIM criterion. The more revealing finding is that even if one performs a line search with the other hyperparameters $\theta \backslash \alpha $ held constant, the $\alpha _1$’s that correspond to the peak on the LML curve often disagree with $\alpha _*$ (see bold figures in the $\Delta \alpha _{1,*}$ column). This suggests LML and SSIM target different aspects of the regression errors. LML is arguably less efficient at preserving sharp geochemical features, whereas SSIM prioritizes information that encodes spatial structures.

In terms of impact, the results may be dissected into three categories. For Al, As, Ba, Mn and Ni, there are no significant differences between the LML-optimal $\alpha _1$ and SSIM-optimal $\alpha _*$. For Be, Ce, Fe and P, $\left| \Delta \alpha _{1,*}\right| $ is in the range of 0.13 to 0.18; however, as $\alpha _1$ is relatively close to the plateau of the SSIM curve, its impact on the SSIM score is only moderate, with $\Delta S_{1,*}$ restricted to $-$0.05. For Mg, Na, Pb, Sb, V and Zn, the shift $\left| \Delta \alpha _{1,*}\right| $ is greater than 0.3; it also results in significant structural degradation relative to choosing the SSIM-optimal $\alpha _*$. These trends can be seen in Fig. 13. An interesting fact is that both SSIM($\alpha $) and NRMSE($\alpha $) are convex.

The bottom panels in Fig. 13 show instances where the LML is relatively flat for $\alpha $ in the [0.1,0.5] range. This indicates that the LML is insensitive to the changes detected by SSIM, which amount to significant visual differences (see Fig. 14). At opposite ends, these $\alpha $ values dictate whether spatial structures are well preserved (when $\alpha \approx 0.5$) or lost (when $\alpha \approx 0.1$). To appreciate the improvements possible, some GP regression results obtained with NLML-optimized $\alpha _0$ are contrasted with those obtained with SSIM-optimal $\alpha _*$ in Fig. 14. These visual comparisons show that the SSIM-optimal $\alpha _*$ indeed produces sharper and higher-quality results.

Indeed, as $\alpha $ approaches 0.1, the peaks and troughs in the geochemical distribution become less distinct. This may be explained by revisiting the magnitude response of the augmented kernels. Figure 15 illustrates the “leaky” nature of $\hat{k}_\alpha (\omega )$ as $\alpha $ is decreased. Due to inadequate suppression of higher frequencies (for $\omega > rapprox 0.2$), the observation noise power injected into the GP predictive mean in the form of a white noise spectrum (see $K(X,X)+\nu ^2 I$ term in Eq. 5) will be less attenuated than it would otherwise be, if $\alpha $ had increased. This effectively reduces the signal-to-noise ratio.

The extent of this deterioration is dependent on the spectral properties of each geochemical distribution or random process. This trade-off is hinted at by the graphs in Fig. 13. Selecting a suitable alpha amounts to balancing two extremes: being too aggressive ($\alpha \rightarrow 0.1$ admits more higher spatial frequencies, perhaps at a cost of noise pollution) and not being aggressive enough ($\alpha \rightarrow 1$ shrinks the low-frequency passband which may blur or distort the GP mean estimate).

5.3 Quality Maps and Spectral Perspective

The results in this section bring together the concept of structural integrity (SSIM, NRMSE) and notion of smoothness defined in terms of Sobolev space. As a demonstration, the reference geochemistry $v_\text {ref}(\textbf{x}_*)$ and GP mean function estimates $\mu _*(\textbf{x}_*;\theta ,\alpha \!=\!1)$ and $\mu _*(\textbf{x}_*;\theta ,\alpha \!<\!1)$ for Al, Ce, Mn and V are shown in Fig. 16.

Recall from Sect. 4.3 that spatial degradation from over-smoothing can be measured using SSIM (structural similarity) and the NRMSE (normalized root mean square error) metric. Figures 17 and 18 show that the augmented kernels $K_\alpha $ increase the spatial fidelity of GP mean predictions under the condition that other hyperparameters are fixed. For SSIM, brighter patches indicate local structures are better preserved. Conversely, darker regions in NRMSE indicate lower distortion. These maps provide an objective assessment of spatial quality (local accuracy) for GP regression.

The second aspect is concerned with reproducing the autocorrelation of the underlying random process, or reducing smoothness misspecification from the viewpoint of functional analysis. The concept of equivalent interpolating kernel (defined in Sect. 2.2) provides the intuition that adjusting $\theta $, but especially $\alpha $, changes the rate of decay in the frequency response of the equivalent kernel $Q_{K_\theta }(\omega )$ in (12). It therefore influences the smoothness of the GP mean function estimate in the Sobolev sense. This is demonstrated in Fig. 19, which shows that the $\alpha $ value obtained from LML optimization (7), or $\alpha _*$ if fine-tuned using SSIM, moderates the decay rate of the interpolating kernel in the Fourier domain. This compensates for over-smoothing and minimizes the $\Phi $ versus $\Psi $ discrepancies in a global sense. Although the effectiveness of smoothness compensation can be diminished by variability in the estimated hyperparameters, Monte Carlo simulation shows that the SSIM-optimized $\alpha _*$ values are particularly robust with respect to changes in the length scale parameters. These results are reported in Appendix B.

6 Discussion

The preceding analysis suggests it would be reasonable to hypothesize a spectral connection between the kernel and geochemical distribution (random process) which governs how $\alpha $ is selected. To explore this further, the concept of spectral flatness (Dubnov 2004) will now be introduced. Spectral flatness $\gamma _V^2$—also known as Wiener entropy—represents a ratio of the geometric and arithmetic means of a power spectrum, $S_V(\omega )$. It measures how fast the power density decays with spatial frequency in the spectral domain. The property $\gamma _V^2\le 1$ always holds, with equality if and only if $S_V(\omega )$ is flat.

6.1 Establishing Relationships in the Spectral Domain

In our application, the two-dimensional discrete Fourier transform (DFT) (Taubman and Marcellin 2002) is first computed for the reference scalar field, $v_\text {ref}(\textbf{x}_*)\in \mathbb {R}^{m_1\times m_2}$, to produce $V_{k_1,k_2}\in \mathbb {C}^{n\times n}$, where $n\ge \max \{m_1,m_2\}$. The DFT is computed using the N-point fast Fourier transform (FFT) algorithm. The DFT coefficients are given by $V_{k_1,k_2}=\frac{1}{n}\sum \limits _{p_1=0}^{n-1}\sum \limits _{p_2=0}^{n-1}v_{p_1,p_2}e^{-i\tfrac{2\pi }{n}(p_1 k_1+p_2 k_2)}$ for $0\le k_1,k_2 < n$. This first step is shown in Fig. 20(left) for two different chemical distributions, Al and Pb. Subsequently, a one-dimensional power density spectrum, $S_V(\omega _r)=\tfrac{2}{\pi }\int _{0}^{\pi /2} \left| V_{\omega _r,\theta }\right| d\theta $, is obtained by averaging the magnitude response over angles in the first quadrant $0\le k_1,k_2 < n/2$, with radial frequencies $\omega _r=\Vert \omega \Vert =\sqrt{k_1^2+k_2^2}$. This second step produces the power density spectra shown in Fig. 20(right).

It is worth noting that for three cases, Pb, Zn and Sb, where LML optimization failed to produce sensible $\alpha _1$ or $\alpha _0$ (see category 3 in Fig. 14, where the results of $\alpha _*$ and $\alpha _0$ diverge), all have relatively large higher-frequency contents. A common thread is that all three have a “nuggety” chemical distribution with sharp spatial gradients around many clusters. Thus, one would expect them to be characterized by a large spectral flatness value $\gamma _V^2$, and this can be confirmed in Table 4. Concretely, if $\Omega =\{\omega _r\}_r$ defines the set of sampled frequency points $\omega _r$, spectral flatness (the Wiener entropy associated with the random process) is computed as

$$\begin{aligned} \gamma _V^2=\frac{\exp \left( \tfrac{1}{|\Omega |}\sum _r \log _e S_V(\omega _r)\right) }{\tfrac{1}{|\Omega |}\sum _r S_V(\omega _r)}. \end{aligned}$$

(26)

Table 4 Spectral flatness (Wiener entropy) for different chemical processes

Full size table

Utilizing the results obtained for both the Matérn and SE augmented kernels, Fig. 21(left) shows that the SSIM-optimal $\alpha _*$ values are correlated with log(spectral flatness). The correlation coefficient $\rho (\alpha _*,\log (\gamma _V^2))$ is 0.718. Motivated by the concept of equivalent kernels (Sollich and Williams 2004), we define the $-20$ dB point as the cutoff frequency $\omega _c$ (or equivalent bandwidth) of the proposed covariance functions, $K_{\alpha _*}$. Figure 21(right) shows that this kernel bandwidth ($\omega _c$, for which $|\hat{k}_\alpha (\omega _c)|/|\hat{k}_\alpha (0)|\approx e^{-1}$) is linearly correlated with log(spectral flatness). The correlation coefficient $\rho (\omega _c(\alpha _*),\log (\gamma _V^2))$ is 0.834.

The implication is that in addition to (1) obtaining $\alpha _0$ directly by finding hyperparameters $\theta =\{l_x,l_y,\sigma _f,\nu ,\alpha \}$ that jointly maximize the log-marginal likelihood, and (2) performing a line search with the remaining hyperparameters $\theta \backslash \alpha $ fixed, finding an $\alpha _*$ that maximizes the structural similarity index SSIM($\alpha $)—this may require a few GP evaluations and interpolation of a convex function—there is a third option. First, spectral flatness ($\gamma _V^2$) may be computed on $v_\text {ref}$ using Eq. 26 for a chemical distribution of interest. Then, $\omega _c$ may be inferred using the linear relationship between $\log (\gamma _V^2)$ and kernel bandwidth. Finally, the $\tilde{\alpha }_\text {sf}$ (here, the subscript “sf” denotes spectral flatness) that generates a kernel $K_\alpha $ with the closest equivalent bandwidth is chosen as a proxy for $\alpha _*$. This has a time complexity of $O(n\log n)$ as opposed to $O(n^3)$ for GP regression. Thus, explicit SSIM optimization of $\alpha $ (option 2) may be deferred and performed as a refinement step only if $\alpha _0$ deviates significantly from $\tilde{\alpha }_\text {sf}$, say, $|\alpha _0-\tilde{\alpha }_\text {sf}|> \tau $ for some threshold $\tau $. When $\text {SSIM}(\tilde{\alpha }_\text {sf})\gg \text {SSIM}(\alpha _0)$, $\tilde{\alpha }_\text {sf}$ may be used instead of $\alpha _0$. Alternatively, two neighboring points, say, $\text {SSIM}(\tilde{\alpha }_\text {sf}-\delta )$ and $\text {SSIM}(\tilde{\alpha }_\text {sf}+\delta )$, may be evaluated to find the best $\alpha $ value in the $\left[ \tilde{\alpha }_\text {sf}-\delta ,\tilde{\alpha }_\text {sf}+\delta \right] $ range using a spline interpolator.

7 Concluding Remarks

A new class of stationary covariance functions ($K_\alpha $) with a control parameter $\alpha $ has been proposed to address the issue of excessive smoothing commonly observed in Gaussian process regression. To satisfy the positive semi-definite requirement, $\alpha $ intervals were determined to ensure that $K_\alpha $ remained a valid kernel. Partial derivatives with respect to the kernel hyperparameters were obtained and expressed in terms of the gradient of the base covariance functions. This allowed all the hyperparameters $\theta $ (including $\alpha _0$) to be learned from the data by maximizing the log-marginal likelihood (LML). Using the $\bar{\chi }^2$ formula and structural similarity index (SSIM), the spatial fidelity of GP mean predictions was evaluated against a reference informed by geochemical samples from the dataset. The performance of the proposed kernel $K_\alpha $ (with variable $\alpha $) was compared with the base kernel $K_\text {base}$ (with $\alpha $ fixed at 1). Visual inspection and quantitative analysis both demonstrated consistent improvement from using $K_{\alpha }$ relative to the baseline. Experiments also revealed that the $\alpha _1$ values which maximize the LML differed from the optimal value $\alpha _*$ found by SSIM 65% of the time. Among these, 60% experienced significant quality degradation whereby the GP regression distortion $\Delta S_{1,*}$ due to $\alpha _1$ varied from $-$0.1595 (lower quartile) to $-$0.3031 (upper quartile) relative to a perfect SSIM score of 1.

This study described how changes in $\alpha $ affect the frequency response of the proposed kernel. Setting $\alpha $ close to 1 shrinks the low-frequency passband, which may blur or distort the GP mean estimate; the extent depends on the spectral properties of the signal. Conversely, setting $\alpha $ close to 0.1 allows more structural information in the form of higher spatial frequencies to pass through; geochemical features are generally better preserved at the risk of white noise amplification. The optimal setting usually represents a balance between these two extremes. Formally, smoothness was defined in terms of having a finite norm $\Vert f\Vert _{W_2^k(\mathbb {R}^d)}^2$ for all $k<m_0$ in the Sobolev space. The results in Sect. 5.3 demonstrated that over-smoothing occurs when the equivalent kernel frequency response, $|Q_{K_\theta (\alpha )}(\omega )|^2$, decays at a much faster rate than the target function, $|\hat{f}(\omega )|^2$. From a spectral perspective, the $K_\alpha $ covariance function thus provides a mechanism for compensating over-smoothing by regulating this rate of decay. The final contribution was demonstrating a linear dependence of $\alpha _*$ on the log-spectral flatness of the power spectrum. The correlation coefficient for the kernel bandwidth and Wiener entropy of the geochemical process was 0.834. The significance of this result is that an approximate value of $\alpha _*$ can potentially be found using this relationship at a cost of $O(n\log n)$.

To the best of the authors’ knowledge, the concepts of structural similarity, Sobolev smoothness and spectral flatness have not been used previously in geostatistics, in relation to GP regression on soil samples. From a complementary perspective, the spatial information embedded in geochemical processes may also be analyzed using spatial statistics such as principal coordinates of neighbor matrices (PCNMs) (Borcard et al. 2004) or Moran’s eigenvector maps (MEMs) (Griffith and Peres-Neto 2006)]. This decomposition allows multiscale spatial variations to be visualized as geographical maps (Dray 2020). For analysis, significant spatial structures may be represented by significant PCNM variables with large eigenvalues. Since the eigenvalues correspond to Moran’s coefficient of spatial autocorrelation, a connection with the local power spectrum is conceivable (Messerschmitt 2006). These relationships and how they relate to geospatial patterns and kernel bandwidth may be investigated in future work.

References

Akian JL, Bonnet L, Owhadi H, Savin É (2022) Learning “best’’ kernels from data in Gaussian process regression with application to aerodynamics. J Comput Phys 470:111595. https://doi.org/10.1016/j.jcp.2022.111595
Article Google Scholar
Bohling G (2005) Kriging, C &PE940. Technical report, Kansas Geological Survey. https://issuu.com/igeography/docs/kriging
Borcard D, Legendre P, Avois-Jacquet C, Tuomisto H (2004) Dissecting the spatial structure of ecological data at multiple scales. Ecology 85(7):1826–1832
Article Google Scholar
Brunet D, Vrscay ER, Wang Z (2011) On the mathematical properties of the structural similarity index. IEEE Trans Image Process 21(4):1488–1499
Article Google Scholar
Chlingaryan A, Melkumyan A, Murphy RJ, Schneider S (2016) Automated multi-class classification of remotely sensed hyperspectral imagery via Gaussian processes with a non-stationary covariance function. Math Geosci 48(5):537–558. https://doi.org/10.1007/s11004-015-9622-x
Article CAS Google Scholar
Coombs MJ, Kotlyar BB, Ludington S, Folger HW, Mossottil VG (2002) Multielement geochemical dataset of surficial materials for the Northern Great Basin. (Open-File Report 02-227). Technical report, US Geological Survey. https://pubs.usgs.gov/of/2002/0227/grids.html
Deisenroth MP, Fox D, Rasmussen CE (2013) Gaussian processes for data-efficient learning in robotics and control. IEEE Trans Pattern Anal Mach Intell 37(2):408–423
Article Google Scholar
Dray S (2020) Moran’s Eigenvector maps and related methods for the spatial multiscale analysis of ecological data. R Tutorial (Multivar Multiscale Spat Anal) 1(97):56
Google Scholar
Dubnov S (2004) Generalization of spectral flatness measure for non-gaussian linear processes. IEEE Signal Process Lett 11(8):698–701
Article Google Scholar
Gibbs MN (1998) Bayesian Gaussian processes for regression and classification. PhD thesis, University of Cambridge
Grbić R, Kurtagić D, Slišković D (2013) Stream water temperature prediction based on Gaussian process regression. Expert Syst Appl 40(18):7407–7414
Article Google Scholar
Griffith DA, Peres-Neto PR (2006) Spatial modeling in ecology: the flexibility of eigenfunction spatial analyses. Ecology 87(10):2603–2613
Article Google Scholar
Higdon D, Swall J, Kern J (1999) Non-stationary spatial modeling Bayesian. Statistics 6(1):761–768
Google Scholar
Journel AG, Kyriakidis PC, Mao S (2000) Correcting the smoothing effect of estimators: a spectral postprocessor. Math Geol 32:787–813
Article Google Scholar
Leung R, Balamurali M, Lowe A (2022) Surface war** incorporating machine learning assisted domain likelihood estimation: a new paradigm in mine geology modeling and automation. Math Geosci 54(3):533–572
Article Google Scholar
MacKay DJ et al (1998) Introduction to Gaussian processes. NATO ASI Ser F Comput Syst Sci 168:133–166
Google Scholar
Melkumyan A, Ramos FT (2009) A sparse covariance function for exact Gaussian process inference in large datasets. In: 21st International joint conference on artificial intelligence
Melkumyan A, Hatherly P, Zhou H (2011) Fusion of drill monitoring data with geological borehole assays. In: 12th ISRM Congress, OnePetro
Messerschmitt D (2006) Autocorrelation matrix eigenvalues and the power spectrum, Technical report no UCB/EECS-2006-90. EECS Dept, Univ of California, Berkeley
Murphy RJ, Chlingaryan A, Melkumyan A (2014) Gaussian processes for estimating wavelength position of the ferric iron crystal field feature at $\sim $ 900 nm from hyperspectral imagery acquired in the short-wave infrared (1002–1355 nm). IEEE Trans Geosci Remote Sens 53(4):1907–1920
Article Google Scholar
Olea RA, Pawlowsky V (1996) Compensating for estimation smoothing in kriging. Math Geol 28:407–417
Article Google Scholar
Oppenheim AV, Buck JR, Schafer RW (2001) Discrete-time signal processing. Prentice Hall, Upper Saddle River
Google Scholar
Osborne MA, Roberts SJ, Rogers A, Ramchurn SD, Jennings NR (2008) Towards real-time information processing of sensor network data using computationally efficient multi-output Gaussian processes. In: 2008 International conference on information processing in sensor networks (IPSN 2008). IEEE, pp 109–120
Paciorek C, Schervish M (2003) Nonstationary covariance functions for Gaussian process regression. Adv Neural Inf Process Syst 16:32
Google Scholar
Rasmussen CE, Williams CK (2006) Gaussian processes for machine learning, vol 2. MIT Press, Cambridge
Google Scholar
Ray A, Myer D (2019) Bayesian geophysical inversion with trans-dimensional Gaussian process machine learning. Geophys J Int 217(3):1706–1726
Article Google Scholar
Shekaramiz M, Moon TK, Gunther JH (2019) A note on Kriging and Gaussian processes. Technical report, Information Dynamics Laboratory, Electrical and Computer Engineering Department, Utah State University. https://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=1213 &context=ece_facpub
Sollich P, Williams C (2004) Using the equivalent kernel to understand gaussian process regression. Adv Neural Inf Process Syst 17:23
Google Scholar
Taubman DS, Marcellin MW (2002) JPEG2000: image compression fundamentals, standards and practice. Kluwer Academic Publishers, Boston
Book Google Scholar
Veras R, Collins C (2019) Discriminability tests for visualization effectiveness and scalability. IEEE Trans Visual Comput Graph 26(1):749–758
Article Google Scholar
Wang W, **g BY (2022) Gaussian process regression: optimality, robustness, and relationship with kernel ridge regression. J Mach Learn Res 23(193):1–67
Google Scholar
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612
Article Google Scholar
Wynne G, Briol FX, Girolami M (2021) Convergence guarantees for Gaussian process means with misspecified likelihoods and smoothness. J Mach Learn Res 22(1):5468–5507
Google Scholar
Yamamoto JK (2005) Correcting the smoothing effect of ordinary kriging estimates. Math Geol 37:69–94
Article Google Scholar
Yang J, Jakeman A, Fang G, Chen X (2018) Uncertainty analysis of a semi-distributed hydrologic model based on a Gaussian process emulator. Environ Modell Softw 101:289–300
Article Google Scholar
Yao T (1998) Automatic covariance modeling and conditional spectral simulation with fast Fourier Transform. PhD thesis, Stanford University. https://searchworks.stanford.edu/view/3952866

Download references

Acknowledgements

This work was supported by the Australian Centre for Field Robotics and the Rio Tinto Centre for Mine Automation.

Funding

Open Access funding enabled and organized by CAUL and its Member Institutions.

Author information

Authors and Affiliations

Australian Centre for Field Robotics, The University of Sydney, Sydney Robotics Hub J18, Sydney, NSW, 2006, Australia
Anna Chlingaryan, Raymond Leung & Arman Melkumyan

Authors

Anna Chlingaryan
View author publications
You can also search for this author in PubMed Google Scholar
Raymond Leung
View author publications
You can also search for this author in PubMed Google Scholar
Arman Melkumyan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

AC: conceptualization (kernel formulation), methodology (compute partial derivatives and valid intervals), validation (validate kernels using real and synthetic data; learn and verify hyperparameters), writing—original draft (focused on mathematical properties and GP framework). RL: conceptualization (evolution of research goals and aims; introduce the concepts of SSIM and Wiener entropy), methodology (design experiments and evaluation procedures), software (implement SSIM, spectral analysis and non-GP-related code), validation, investigation, formal analysis, writing—original draft (focused on overall structure, materials and methods, results and interpretations), visualization. AM: conceptualization, methodology, supervision, review and feedback.

Corresponding author

Correspondence to Raymond Leung.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendices

Appendix A: Kernel Hyperparameters

Table 5 reports the Matérn and SE kernel hyperparameters obtained by maximizing the log-marginal likelihood for each chemical in the Northern Great Basin geochemical dataset. An interesting observation is that $\alpha $ approaches the lower bound ($\alpha \rightarrow 0.1$) when $\nu \ll \sigma _f$ in the case of Mg, Na, Sb and Zn. These $\alpha $ estimates are suboptimal compared with the SSIM-optimized $\alpha _*$ values reported in Table 3.

Table 5 Learned hyperparameters for the proposed kernels, $K_\alpha $

Full size table

Table 6 Monte Carlo results focused on length scales and smoothness variation

Full size table

Appendix B: Monte Carlo Experiments

In the proposed method and GP parameter estimation more broadly, different sets of length scales and smoothness parameters may produce multiple local minima with respect to the negative LML objective. To evaluate stability in parameter estimates, Monte Carlo experiments were conducted. This comprised $N\!=\!10$ runs (each determined by a random seed). In each run, 100 random subsets of size 1,000 were generated by randomly selecting 1,000 samples from the 10,261 measurements without replacement. The random subset that minimizes the KL divergence with regard to the full training data was chosen and used to learn the GP hyperparameters in each run. The results for Al and V are presented in Table 6.

The $l_x$ and $l_y$ columns show that the length scale parameters can vary between the (KL divergence-optimal) random subsets from different runs. However, the dispersion around the mean (CV=$\sigma /\mu $) is relatively small (between 0.09 and 0.23). The $\alpha _0$ column shows that the smoothness parameter obtained by maximizing the LML is generally more sensitive to the training data (CV between 0.22 and 0.25). This leads to fluctuations in the power density estimates, $\Psi _{\alpha _0}^{(i)}(\omega )$, for individual runs, which can be seen in Fig. 22. Nevertheless, its overall impact on spectral estimation is small, since $\Psi _{\alpha _0}^{(i)}$ still maintains a sizable gap above $\Psi ^{(i)}(\omega )$. Slower spectral decay helps reduce over-smoothing based on Sobolev’s interpretation. For reference, these ideas were described in Sect. 4.3, and the corresponding results without Monte Carlo simulation were presented in Sect. 5.3.

The joint effects of length scale and $\alpha _0$ variations on the equivalent kernel power spectrum can be seen in Fig. 23. Previous results from Fig. 19 showed that $|Q_{K_{\alpha _0}}(\omega )|^2$ had a larger bandwidth than $|Q_{K_{\theta }}(\omega )|^2$. In Fig. 23, this observation continues to hold. Hence, structural details in the target function continue to be well preserved in spite of changes in $\alpha _0$.

Returning to Table 6 again, the $\alpha _*$ column shows a high degree of consensus among the SSIM-optimized $\alpha $ estimates. In particular, the CV varies roughly from 0.02 to 0.04.

The following figures (Figs. 24 and 25) are the counterparts to Figs. 22 and 23. Instead of using $\alpha _0$, the SSIM-optimized $\alpha _*$ values are used whenever the augmented kernels $K_\alpha $ are used for GP mean prediction. A comparison of the power spectral densities $\Psi _{\alpha _*}$ in Fig. 25 with $\Psi _{\alpha _0}$ in Fig. 23—and similarly $|Q_{K_{\alpha _*}}(\omega )|^2$ in Fig. 24 with $|Q_{K_{\alpha _0}}(\omega )|^2$ in Fig. 22—shows that the $\alpha _*$ ensemble forms an even tighter cluster.

The main finding is that SSIM produces consistent $\alpha _*$ estimates that are robust with length-scale variations. It also achieves consistent improvement and alleviates over-smoothing in the GP posterior mean estimates. This can be seen from the SSIM($\alpha _*$) column, where the CV varies from 0.01 to 0.035.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Chlingaryan, A., Leung, R. & Melkumyan, A. Augmenting Stationary Covariance Functions with a Smoothness Hyperparameter and Improving Gaussian Process Regression Using a Structural Similarity Index. Math Geosci 56, 605–637 (2024). https://doi.org/10.1007/s11004-023-10095-5

Download citation

Received: 11 December 2022
Accepted: 02 August 2023
Published: 06 September 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s11004-023-10095-5

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Augmenting Stationary Covariance Functions with a Smoothness Hyperparameter and Improving Gaussian Process Regression Using a Structural Similarity Index

Abstract

Graphic abstract

Similar content being viewed by others

Reference database design for the automated analysis of microplastic samples based on Fourier transform infrared (FTIR) spectroscopy

Exact distribution of change-point MLE for a Multivariate normal sequence

Applications of Machine Learning Models for Solving Complex Groundwater Modelling, Monitoring and Management Problems

1 Introduction

2 Background

2.1 Gaussian Processes: A Probabilistic Framework

2.2 Connections

3 Formulation

3.1 Kernel Attributes that Affect the Smoothness of Models

3.2 Stationary Covariance Functions with Variable Smoothness (\(\alpha \))

3.3 Intuition Behind \(\alpha \)

3.4 Positive Semi-Definiteness and Valid Intervals for \(\alpha \)

4 Materials and Methods

4.1 The Northern Great Basin (NGB) Geochemical Dataset

4.2 Learning Kernel Hyperparameters and Inference

4.3 Measuring Spatial Fidelity Using the Structural Similarity Index

4.4 Experiment 1: Evaluating the Effects of \(\alpha \)

4.5 Experiment 2: Validation Using SSIM and NRMSE

4.6 Experiment 3: Establishing Relationships in the Spectral Domain

5 Results and Analysis

5.1 Evaluating the Effects of \(\alpha \)

5.2 Validation Using SSIM and NRMSE

5.3 Quality Maps and Spectral Perspective

6 Discussion

6.1 Establishing Relationships in the Spectral Domain

7 Concluding Remarks

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Appendices

Appendix A: Kernel Hyperparameters

Appendix B: Monte Carlo Experiments

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation