Abstract
We develop a fast variational approximation scheme for Gaussian process (GP) regression, where the spectrum of the covariance function is subjected to a sparse approximation. Our approach enables uncertainty in covariance function hyperparameters to be treated without using Monte Carlo methods and is robust to overfitting. Our article makes three contributions. First, we present a variational Bayes algorithm for fitting sparse spectrum GP regression models that uses nonconjugate variational message passing to derive fast and efficient updates. Second, we propose a novel adaptive neighbourhood technique for obtaining predictive inference that is effective in dealing with nonstationarity. Regression is performed locally at each point to be predicted and the neighbourhood is determined using a measure defined based on lengthscales estimated from an initial fit. Weighting dimensions according to lengthscales, this downweights variables of little relevance, leading to automatic variable selection and improved prediction. Third, we introduce a technique for accelerating convergence in nonconjugate variational message passing by adapting step sizes in the direction of the natural gradient of the lower bound. Our adaptive strategy can be easily implemented and empirical results indicate significant speedups.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11222-015-9600-7/MediaObjects/11222_2015_9600_Fig1_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11222-015-9600-7/MediaObjects/11222_2015_9600_Fig2_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11222-015-9600-7/MediaObjects/11222_2015_9600_Fig3_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11222-015-9600-7/MediaObjects/11222_2015_9600_Fig4_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11222-015-9600-7/MediaObjects/11222_2015_9600_Fig5_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11222-015-9600-7/MediaObjects/11222_2015_9600_Fig6_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11222-015-9600-7/MediaObjects/11222_2015_9600_Fig7_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11222-015-9600-7/MediaObjects/11222_2015_9600_Fig8_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11222-015-9600-7/MediaObjects/11222_2015_9600_Fig9_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11222-015-9600-7/MediaObjects/11222_2015_9600_Fig10_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11222-015-9600-7/MediaObjects/11222_2015_9600_Fig11_HTML.gif)
Similar content being viewed by others
References
Amari, S.: Natural gradient works efficiently in learning. Neural Comput. 10, 251–276 (1998)
Attias, H.: Inferring parameters and structure of latent variable models by variational Bayes. In: Laskey, K., Prade, H. (eds.) Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence, pp. 21–30. Morgan Kaufmann, San Francisco, CA (1999)
Attias, H.: A variational Bayesian framework for graphical models. In: Solla, S.A., Leen, T.K., Müller, K.-R. (eds.) Advances in Neural Information Processing Systems 12, pp. 209–215. MIT Press, Cambridge, MA (2000)
Blei, D.M., Jordan, M.I.: Variational inference for Dirichlet process mixtures. Bayesian Anal. 1, 121–144 (2006)
Boughton, W.: The Australian water balance model. Environ. Model. Softw. 19, 943–956 (2004)
Braun, M., McAuliffe, J.: Variational inference for large-scale models of discrete choice. J. Am. Stat. Assoc. 105, 324–335 (2010)
Gelman, A.: Prior distributions for variance parameters in hierarchical models. Bayesian Anal. 1, 515–533 (2006)
Gramacy, R.B., Apley, D.W.: Local Gaussian process approximation for large computer experiments. J. Comput. Gr. Stat. To appear (2014)
Haas, T.C.: Local prediction of a spatio-temporal process with an application to wet sulfate deposition. J. Am. Stat. Assoc. 90, 1189–1199 (1995)
Hastie, T., Tibshirani, R.: Discriminant adaptive nearest neighbor classification. IEEE Trans. Pattern Anal. Mach. Intell. 18, 607–616 (1996)
Hoffman, M.D., Blei, D.M., Wang, C., Paisley, J.: Stochastic variational inference. J. Mach. Learn. Res. 14, 1303–1347 (2013)
Honkela, A., Valpola, H., Karhunen, J.: Accelerating cyclic update algorithms for parameter estimation by pattern searches. Neural Process. Lett. 17, 191–203 (2003)
Huang, H., Yang, B., Hsu, C.: Triple jump acceleration for the EM algorithm. In: Han, J., Wah, B. W., Raghavan, V., Wu, X., rastogi, R. (eds.) Proceedings of the 5th IEEE International Conference on Data Mining, pp. 649–652. IEEE Computer Society, Washington, DC, USA (2005)
Kim, H.-M., Mallicka, B.K., Holmesa, C.C.: Analyzing nonstationary spatial data using piecewise Gaussian processes. J. Am. Stat. Assoc. 100, 653–668 (2005)
Knowles, D.A., Minka, T.P.: Non-conjugate variational message passing for multinomial and binary regression. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 24, pp. 1701–1709. Curran Associates, Inc., Red Hook, NY (2011)
Lázaro-Gredilla, M., Quiñonero-Candela, J., Rasmussen, C.E., Figueiras-Vidal, A.R.: Sparse spectrum Gaussian process regression. J. Mach. Learn. Res. 11, 1865–1881 (2010)
Lázaro-Gredilla, M., Titsias, M.K.: Variational heteroscedastic Gaussian process regression. In: Getoor, L., Scheffer, T. (eds.) Proceedings of the 28th International Conference on Machine Learning, pp. 841–848. Omnipress, Madison, MI, USA (2011)
Lindgren, F., Rue, H., Lindström, J.: An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach. J. R. Stat. Soc. Ser. B 73, 423–498 (2011)
Magnus, J.R., Neudecker, H.: Matrix Differential Calculus with Applications in Statistics and Econometrics. Wiley, Chichester, UK (1988)
Nguyen-Tuong, D., Seeger, M., Peters, J.: Model learning with local Gaussian process regression. Adv. Robot. 23, 2015–2034 (2009)
Nott, D.J., Tan, S.L., Villani, M., Kohn, R.: Regression density estimation with variational methods and stochastic approximation. J. Comput. Gr. Stat. 21, 797–820 (2012)
Ormerod, J.T., Wand, M.P.: Explaining variational approximations. Am. Stat. 64, 140–153 (2010)
Park, S., Choi, S.: Hierarchical Gaussian process regression. In Sugiyama, M. and Yang, Q. (eds.) Proceedings of 2nd Asian Conference on Machine Learning, pp. 95–110 (2010)
Qi, Y., Jaakkola, T.S.: Parameter expanded variational Bayesian methods. In: Schölkopf, B., Platt, J., Hofmann, T. (eds.) Advances in Neural Information Processing Systems 19, pp. 1097–1104. MIT Press, Cambridge (2006)
Quinlan, R.: Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243. University of Massachusetts, Morgan Kaufmann, Amherst (1993)
Quiñonero-Candela, T., Rasmussen, C.E.: A unifying view of sparse approximate Gaussian process regression. J. Mach. Learn. Res. 6, 1939–1959 (2005)
Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA (2006)
Ren, Q., Banerjee, S., Finley, A.O., Hodges, J.S.: Variational Bayesian methods for spatial data analysis. Comput. Stat. Data Anal. 55, 3197–3217 (2011)
Salakhutdinov, R., Roweis, S.: Adaptive overrelaxed bound optimization methods. In: Fawcett, T., Mishra, N. (eds.) Proceedings of the 20th International Conference on Machine Learning, pp. 664–671. AAAI Press, Menlo Park, CA (2003)
Snelson, E., Ghahramani, Z.: Sparse Gaussian processes using pseudo-inputs. In: Weiss, Y., Schölkopf, B., Platt, J. (eds.) Advances in Neural Information Processing Systems 18, pp. 1257–1264. MIT Press, Cambridge, MA (2006)
Snelson, E., Ghahramani, Z.: Local and global sparse Gaussian process approximations. In: Meila, M., Shen, X. (eds) JMLR Workshop and Conference Proceedings, vol. 2: AISTATS 2007, pp. 524–531 (2007)
Stan Development Team: RStan: the R interface to Stan, Version 2.5.0. http://mc-stan.org/rstan.html (2014)
Stein, M.L., Chi, Z., Welty, L.J.: Approximating likelihoods for large spatial data sets. J. R. Stat. Soc. Ser. B 66, 275–296 (2004)
Tan, L.S.L., Nott, D.J.: Variational inference for generalized linear mixed models using partially non-centered parametrizations. Stat. Sci. 28, 168–188 (2013)
Tan, L.S.L., Nott, D.J.: A stochastic variational framework for fitting and diagnosing generalized linear mixed models. Bayesian Anal. 9, 963–1004 (2014). doi:10.1214/14-BA885
Titsias, M.K.: Variational learning of inducing variables in sparse Gaussian processes. In: van Dyk, D., Welling, M. (eds.) Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, pp. 567–574 (2009)
Urtasun, R., Darrell, T.: Sparse probabilistic regression for activity-independent human pose inference. IEEE Conf Comput. Vis. Pattern Recognit. 2008, 1–8 (2008)
Vecchia, A.V.: Estimation and model identication for continuous spatial processes. J. R. Stat. Soc. Ser. B 50, 297–312 (1988)
Walder, C., Kim, K.I., Schölkopf, B.: Sparse multiscale Gaussian process regression. In McCallum, A. and Roweis, S. (eds.) Proceedings of the 25th International Conference on Machine Learning, pp. 1112–1119. ACM Press, New York (2008)
Wand, M.P., Ormerod, J.T., Padoan, S.A., Frührwirth, R.: Mean field variational Bayes for elaborate distributions. Bayesian Anal. 6, 847–900 (2011)
Wand, M.P.: Fully simplified multivariate normal updates in non-conjugate variational message passing. J. Mach. Learn. Res. 15, 1351–1369 (2014)
Wang, B., Titterington, D.M.: Inadequacy of interval estimates corresponding to variational Bayesian approximations. In: Cowell, R. G., Ghahramani, Z. (eds.) Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, pp. 373–380. Society for Artificial Intelligence and Statistics (2005)
Wang, B., Titterington, D.M.: Convergence properties of a general algorithm for calculating variational Bayesian estimates for a normal mixture model. Bayesian Anal. 3, 625–650 (2006)
Winn, J., Bishop, C.M.: Variational message passing. J. Mach. Learn. Res. 6, 661–694 (2005)
Acknowledgments
We thank Lucy Marshall for supplying the rainfall-runoff data set. Linda Tan was partially supported as part of the Singapore Delft Water Alliance’s tropical reservoir research programme. David Nott, Ajay Jasra and Victor Ong’s research was supported by a Singapore Ministry of Education Academic Research Fund Tier 2 grant (R-155-000-143-112). We also thank the referees and associate editor for their comments which have helped improved the manuscript.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: Derivation of \(E_q(Z)\) and \(E_q(Z^TZ)\)
Lemma 1
Suppose \(\lambda \sim N(\mu ,{\varSigma })\) and \(t_1\), \(t_2\) are fixed vectors the same length as \(\lambda \). Let \(t_{12}^-=t_1-t_2\) and \(t_{12}^+=t_1+t_2\), then
By setting \(t_2=0\) in the first and third expressions, we get
Proof
\(E[\exp \{i\lambda ^T(t_1-t_2)\}]=\exp \{i \mu ^T (t_1-t_2)- \tfrac{1}{2}(t_1-t_2)^T {\varSigma } (t_1-t_2)\}\) implies
and
Replacing \(t_2\) by \(-t_2\), we get
and
(17) + (19) gives the first equation of the lemma, (17) – (19) gives the second and (18) + (20) gives the third. \(\square \)
Using Lemma 1, we have
where
![](http://media.springernature.com/full/springer-static/image/art%3A10.1007%2Fs11222-015-9600-7/MediaObjects/11222_2015_9600_Equ46_HTML.gif)
and \(t_{ir}=s_r \odot x_i\) for \(i=1,\dots ,n\), \(r=1,\dots ,m\). We also have \(E_q(Z^TZ)=\sum _{i=1}^n E_q(Z_iZ_i^T)\) where \(E_q(Z_iZ_i^T)=\left[ {\begin{matrix} P_i &{} Q_i^T \\ Q_i &{} R_i \end{matrix}}\right] \), where \(P_i\), \(Q_i\), \(R_i\) are all \(m\times m\) matrices and
\(t_{irl}^-=t_{ir}-t_{il}\), \(t_{irl}^+=t_{ir}+t_{il}\) for \(r=1,\dots ,m\), \(l=1,\dots ,m\).
Appendix 2: Derivation of lower bound
From (6), the lower bound is given by
where
The terms in the lower bound can be evaluated as follows:
Putting these terms together and making use of the updates in steps 5 and 6 of Algorithm 1 gives the lower bound in (12).
Appendix 3: Derivation of simplified updates in Algorithm 2
It can be shown (see Wand 2014; Tan and Nott 2013) that the natural parameter of \(q(\lambda )=N(\mu _\lambda ^q,{\varSigma }_\lambda ^q)\) is
where \(D_d\) is a unique \(d^2 \times \tfrac{d}{2}(d+1)\) matrix that transforms \(\text {vech}(A)\) into \(\text {vec}(A)\) for any \(d \times d\) symmetric square matrix A, that is, \(D_d\text {vech}(A)=\text {vec}(A)\). We use \(\text {vech}(A)\) to denote the \(\tfrac{1}{2}d(d+1) \times 1\) vector obtained from \(\text {vec}(A)\) by eliminating all supradiagonal elements of A. Magnus and Neudecker (1988) is a good reference for the matrix differential calculus involved in the derivation below. From (13) and (Tan and Nott 2013, pg. 7), we have
where \(\dfrac{\partial S_a}{\partial \text {vec}({\varSigma }_\lambda ^q)}\) and \(\dfrac{\partial S_a}{\partial \mu _\lambda ^q}\) are evaluated at
Let
The first line of (21) simplifies to
The second line of (21) gives
Rights and permissions
About this article
Cite this article
Tan, L.S.L., Ong, V.M.H., Nott, D.J. et al. Variational inference for sparse spectrum Gaussian process regression. Stat Comput 26, 1243–1261 (2016). https://doi.org/10.1007/s11222-015-9600-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-015-9600-7