Log in

Variational inference for sparse spectrum Gaussian process regression

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

We develop a fast variational approximation scheme for Gaussian process (GP) regression, where the spectrum of the covariance function is subjected to a sparse approximation. Our approach enables uncertainty in covariance function hyperparameters to be treated without using Monte Carlo methods and is robust to overfitting. Our article makes three contributions. First, we present a variational Bayes algorithm for fitting sparse spectrum GP regression models that uses nonconjugate variational message passing to derive fast and efficient updates. Second, we propose a novel adaptive neighbourhood technique for obtaining predictive inference that is effective in dealing with nonstationarity. Regression is performed locally at each point to be predicted and the neighbourhood is determined using a measure defined based on lengthscales estimated from an initial fit. Weighting dimensions according to lengthscales, this downweights variables of little relevance, leading to automatic variable selection and improved prediction. Third, we introduce a technique for accelerating convergence in nonconjugate variational message passing by adapting step sizes in the direction of the natural gradient of the lower bound. Our adaptive strategy can be easily implemented and empirical results indicate significant speedups.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  • Amari, S.: Natural gradient works efficiently in learning. Neural Comput. 10, 251–276 (1998)

    Article  Google Scholar 

  • Attias, H.: Inferring parameters and structure of latent variable models by variational Bayes. In: Laskey, K., Prade, H. (eds.) Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence, pp. 21–30. Morgan Kaufmann, San Francisco, CA (1999)

    Google Scholar 

  • Attias, H.: A variational Bayesian framework for graphical models. In: Solla, S.A., Leen, T.K., Müller, K.-R. (eds.) Advances in Neural Information Processing Systems 12, pp. 209–215. MIT Press, Cambridge, MA (2000)

    Google Scholar 

  • Blei, D.M., Jordan, M.I.: Variational inference for Dirichlet process mixtures. Bayesian Anal. 1, 121–144 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  • Boughton, W.: The Australian water balance model. Environ. Model. Softw. 19, 943–956 (2004)

    Article  Google Scholar 

  • Braun, M., McAuliffe, J.: Variational inference for large-scale models of discrete choice. J. Am. Stat. Assoc. 105, 324–335 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Gelman, A.: Prior distributions for variance parameters in hierarchical models. Bayesian Anal. 1, 515–533 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  • Gramacy, R.B., Apley, D.W.: Local Gaussian process approximation for large computer experiments. J. Comput. Gr. Stat. To appear (2014)

  • Haas, T.C.: Local prediction of a spatio-temporal process with an application to wet sulfate deposition. J. Am. Stat. Assoc. 90, 1189–1199 (1995)

    Article  MATH  Google Scholar 

  • Hastie, T., Tibshirani, R.: Discriminant adaptive nearest neighbor classification. IEEE Trans. Pattern Anal. Mach. Intell. 18, 607–616 (1996)

    Article  Google Scholar 

  • Hoffman, M.D., Blei, D.M., Wang, C., Paisley, J.: Stochastic variational inference. J. Mach. Learn. Res. 14, 1303–1347 (2013)

    MathSciNet  MATH  Google Scholar 

  • Honkela, A., Valpola, H., Karhunen, J.: Accelerating cyclic update algorithms for parameter estimation by pattern searches. Neural Process. Lett. 17, 191–203 (2003)

    Article  MATH  Google Scholar 

  • Huang, H., Yang, B., Hsu, C.: Triple jump acceleration for the EM algorithm. In: Han, J., Wah, B. W., Raghavan, V., Wu, X., rastogi, R. (eds.) Proceedings of the 5th IEEE International Conference on Data Mining, pp. 649–652. IEEE Computer Society, Washington, DC, USA (2005)

  • Kim, H.-M., Mallicka, B.K., Holmesa, C.C.: Analyzing nonstationary spatial data using piecewise Gaussian processes. J. Am. Stat. Assoc. 100, 653–668 (2005)

    Article  MathSciNet  Google Scholar 

  • Knowles, D.A., Minka, T.P.: Non-conjugate variational message passing for multinomial and binary regression. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 24, pp. 1701–1709. Curran Associates, Inc., Red Hook, NY (2011)

    Google Scholar 

  • Lázaro-Gredilla, M., Quiñonero-Candela, J., Rasmussen, C.E., Figueiras-Vidal, A.R.: Sparse spectrum Gaussian process regression. J. Mach. Learn. Res. 11, 1865–1881 (2010)

    MathSciNet  MATH  Google Scholar 

  • Lázaro-Gredilla, M., Titsias, M.K.: Variational heteroscedastic Gaussian process regression. In: Getoor, L., Scheffer, T. (eds.) Proceedings of the 28th International Conference on Machine Learning, pp. 841–848. Omnipress, Madison, MI, USA (2011)

    Google Scholar 

  • Lindgren, F., Rue, H., Lindström, J.: An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach. J. R. Stat. Soc. Ser. B 73, 423–498 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  • Magnus, J.R., Neudecker, H.: Matrix Differential Calculus with Applications in Statistics and Econometrics. Wiley, Chichester, UK (1988)

    MATH  Google Scholar 

  • Nguyen-Tuong, D., Seeger, M., Peters, J.: Model learning with local Gaussian process regression. Adv. Robot. 23, 2015–2034 (2009)

  • Nott, D.J., Tan, S.L., Villani, M., Kohn, R.: Regression density estimation with variational methods and stochastic approximation. J. Comput. Gr. Stat. 21, 797–820 (2012)

    Article  MathSciNet  Google Scholar 

  • Ormerod, J.T., Wand, M.P.: Explaining variational approximations. Am. Stat. 64, 140–153 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Park, S., Choi, S.: Hierarchical Gaussian process regression. In Sugiyama, M. and Yang, Q. (eds.) Proceedings of 2nd Asian Conference on Machine Learning, pp. 95–110 (2010)

  • Qi, Y., Jaakkola, T.S.: Parameter expanded variational Bayesian methods. In: Schölkopf, B., Platt, J., Hofmann, T. (eds.) Advances in Neural Information Processing Systems 19, pp. 1097–1104. MIT Press, Cambridge (2006)

    Google Scholar 

  • Quinlan, R.: Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243. University of Massachusetts, Morgan Kaufmann, Amherst (1993)

  • Quiñonero-Candela, T., Rasmussen, C.E.: A unifying view of sparse approximate Gaussian process regression. J. Mach. Learn. Res. 6, 1939–1959 (2005)

    MathSciNet  MATH  Google Scholar 

  • Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA (2006)

    MATH  Google Scholar 

  • Ren, Q., Banerjee, S., Finley, A.O., Hodges, J.S.: Variational Bayesian methods for spatial data analysis. Comput. Stat. Data Anal. 55, 3197–3217 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  • Salakhutdinov, R., Roweis, S.: Adaptive overrelaxed bound optimization methods. In: Fawcett, T., Mishra, N. (eds.) Proceedings of the 20th International Conference on Machine Learning, pp. 664–671. AAAI Press, Menlo Park, CA (2003)

    Google Scholar 

  • Snelson, E., Ghahramani, Z.: Sparse Gaussian processes using pseudo-inputs. In: Weiss, Y., Schölkopf, B., Platt, J. (eds.) Advances in Neural Information Processing Systems 18, pp. 1257–1264. MIT Press, Cambridge, MA (2006)

    Google Scholar 

  • Snelson, E., Ghahramani, Z.: Local and global sparse Gaussian process approximations. In: Meila, M., Shen, X. (eds) JMLR Workshop and Conference Proceedings, vol. 2: AISTATS 2007, pp. 524–531 (2007)

  • Stan Development Team: RStan: the R interface to Stan, Version 2.5.0. http://mc-stan.org/rstan.html (2014)

  • Stein, M.L., Chi, Z., Welty, L.J.: Approximating likelihoods for large spatial data sets. J. R. Stat. Soc. Ser. B 66, 275–296 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  • Tan, L.S.L., Nott, D.J.: Variational inference for generalized linear mixed models using partially non-centered parametrizations. Stat. Sci. 28, 168–188 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  • Tan, L.S.L., Nott, D.J.: A stochastic variational framework for fitting and diagnosing generalized linear mixed models. Bayesian Anal. 9, 963–1004 (2014). doi:10.1214/14-BA885

    Article  MathSciNet  MATH  Google Scholar 

  • Titsias, M.K.: Variational learning of inducing variables in sparse Gaussian processes. In: van Dyk, D., Welling, M. (eds.) Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, pp. 567–574 (2009)

  • Urtasun, R., Darrell, T.: Sparse probabilistic regression for activity-independent human pose inference. IEEE Conf Comput. Vis. Pattern Recognit. 2008, 1–8 (2008)

    Google Scholar 

  • Vecchia, A.V.: Estimation and model identication for continuous spatial processes. J. R. Stat. Soc. Ser. B 50, 297–312 (1988)

    MathSciNet  Google Scholar 

  • Walder, C., Kim, K.I., Schölkopf, B.: Sparse multiscale Gaussian process regression. In McCallum, A. and Roweis, S. (eds.) Proceedings of the 25th International Conference on Machine Learning, pp. 1112–1119. ACM Press, New York (2008)

  • Wand, M.P., Ormerod, J.T., Padoan, S.A., Frührwirth, R.: Mean field variational Bayes for elaborate distributions. Bayesian Anal. 6, 847–900 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  • Wand, M.P.: Fully simplified multivariate normal updates in non-conjugate variational message passing. J. Mach. Learn. Res. 15, 1351–1369 (2014)

  • Wang, B., Titterington, D.M.: Inadequacy of interval estimates corresponding to variational Bayesian approximations. In: Cowell, R. G., Ghahramani, Z. (eds.) Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, pp. 373–380. Society for Artificial Intelligence and Statistics (2005)

  • Wang, B., Titterington, D.M.: Convergence properties of a general algorithm for calculating variational Bayesian estimates for a normal mixture model. Bayesian Anal. 3, 625–650 (2006)

    MathSciNet  MATH  Google Scholar 

  • Winn, J., Bishop, C.M.: Variational message passing. J. Mach. Learn. Res. 6, 661–694 (2005)

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

We thank Lucy Marshall for supplying the rainfall-runoff data set. Linda Tan was partially supported as part of the Singapore Delft Water Alliance’s tropical reservoir research programme. David Nott, Ajay Jasra and Victor Ong’s research was supported by a Singapore Ministry of Education Academic Research Fund Tier 2 grant (R-155-000-143-112). We also thank the referees and associate editor for their comments which have helped improved the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Linda S. L. Tan.

Appendices

Appendix 1: Derivation of \(E_q(Z)\) and \(E_q(Z^TZ)\)

Lemma 1

Suppose \(\lambda \sim N(\mu ,{\varSigma })\) and \(t_1\), \(t_2\) are fixed vectors the same length as \(\lambda \). Let \(t_{12}^-=t_1-t_2\) and \(t_{12}^+=t_1+t_2\), then

$$\begin{aligned}&E\left\{ \cos \left( t_1^T\lambda \right) \cos \left( t_2^T\lambda \right) \right\} =\tfrac{1}{2}\left[ \exp \left( -\tfrac{1}{2}{t_{12}^-}^T{\varSigma } t_{12}^-\right) \right. \\&\quad \left. \cdot \cos \left( {t_{12}^-}^T\mu \right) +\exp \left( -\tfrac{1}{2}{t_{12}^+}^T{\varSigma } t_{12}^+\right) \cos \left( {t_{12}^+}^T\mu \right) \right] \\&E\left\{ \sin \left( t_1^T\lambda \right) \sin \left( t_2^T \lambda \right) \right\} =\tfrac{1}{2}\left[ \exp \left( -\tfrac{1}{2}{t_{12}^-}^T{\varSigma } t_{12}^-\right) \right. \\&\quad \left. \cdot \cos \left( {t_{12}^-}^T\mu \right) -\exp \left( -\tfrac{1}{2}{t_{12}^+}^T{\varSigma } t_{12}^+\right) \cos \left( {t_{12}^+}^T\mu \right) \right] \\&E\left\{ \sin \left( t_1^T\lambda \right) \cos \left( t_2^T \lambda \right) \right\} =\tfrac{1}{2}\left[ \exp \left( -\tfrac{1}{2}{t_{12}^-}^T{\varSigma } t_{12}^-\right) \right. \\&\quad \left. \cdot \sin \left( {t_{12}^-}^T\mu \right) + \exp \left( -\tfrac{1}{2}{t_{12}^+}^T{\varSigma } t_{12}^+\right) \sin \left( {t_{12}^+}^T\mu \right) \right] \end{aligned}$$

By setting \(t_2=0\) in the first and third expressions, we get

$$\begin{aligned}&E\left\{ \cos \left( t_1^T\lambda \right) \right\} = \exp \left( -\tfrac{1}{2}t_1^T{\varSigma } t_1\right) \cos \left( t_1^T\mu \right) \;\; \text {and} \\&E\left\{ \sin \left( t_1^T\lambda \right) \right\} =\exp \left( -\tfrac{1}{2}t_1^T{\varSigma } t_1\right) \sin \left( t_1^T\mu \right) . \end{aligned}$$

Proof

\(E[\exp \{i\lambda ^T(t_1-t_2)\}]=\exp \{i \mu ^T (t_1-t_2)- \tfrac{1}{2}(t_1-t_2)^T {\varSigma } (t_1-t_2)\}\) implies

$$\begin{aligned} E\left[ \cos \left\{ \lambda ^T\left( t_1-t_2\right) \right\} \right]= & {} E\left\{ \cos \left( t_1^T\lambda \right) \cos \left( t_2^T\lambda \right) \right. \nonumber \\&\left. +\sin \left( t_1^T\lambda \right) \sin \left( t_2^T\lambda \right) \right\} \nonumber \\= & {} \exp \left\{ -\tfrac{1}{2}(t_1-t_2)^T {\varSigma } (t_1-t_2)\right\} \nonumber \\&\cdot \cos \left\{ \mu ^T (t_1-t_2)\right\} \end{aligned}$$
(17)

and

$$\begin{aligned} E\left[ \sin \left\{ \lambda ^T\left( t_1-t_2\right) \right\} \right]= & {} E\left\{ \sin \left( t_1^T\lambda \right) \cos \left( t_2^T\lambda \right) \right. \nonumber \\&\left. -\cos \left( t_1^T\lambda \right) \sin \left( t_2^T\lambda \right) \right\} \nonumber \\= & {} \exp \left\{ -\tfrac{1}{2}(t_1-t_2)^T {\varSigma } (t_1-t_2)\right\} \nonumber \\&\cdot \sin \left\{ \mu ^T (t_1-t_2)\right\} . \end{aligned}$$
(18)

Replacing \(t_2\) by \(-t_2\), we get

$$\begin{aligned} E\left[ \cos \left\{ \lambda ^T\left( t_1+t_2\right) \right\} \right]= & {} E\left\{ \cos \left( t_1^T\lambda \right) \cos \left( t_2^T\lambda \right) \right. \nonumber \\&\left. -\sin \left( t_1^T\lambda \right) \sin \left( t_2^T\lambda \right) \right\} \nonumber \\= & {} \exp \left\{ -\tfrac{1}{2}(t_1+t_2)^T {\varSigma } (t_1+t_2)\right\} \nonumber \\&\cdot \cos \left\{ \mu ^T (t_1+t_2)\right\} \end{aligned}$$
(19)

and

$$\begin{aligned} E[\sin \{\lambda ^T(t_1+t_2)\}]= & {} E\left\{ \sin \left( t_1^T\lambda \right) \cos \left( t_2^T\lambda \right) \right. \nonumber \\&\left. +\cos \left( t_1^T\lambda \right) \sin \left( t_2^T\lambda \right) \right\} \nonumber \\= & {} \exp \left\{ -\tfrac{1}{2}(t_1+t_2)^T {\varSigma } (t_1+t_2)\right\} \nonumber \\&\cdot \sin \left\{ \mu ^T (t_1+t_2)\right\} . \end{aligned}$$
(20)

(17) + (19) gives the first equation of the lemma, (17) – (19) gives the second and (18) + (20) gives the third. \(\square \)

Using Lemma 1, we have

$$\begin{aligned} E_q(Z)=[E_q(Z_1), \dots , E_q(Z_n)]^T, \end{aligned}$$

where

and \(t_{ir}=s_r \odot x_i\) for \(i=1,\dots ,n\), \(r=1,\dots ,m\). We also have \(E_q(Z^TZ)=\sum _{i=1}^n E_q(Z_iZ_i^T)\) where \(E_q(Z_iZ_i^T)=\left[ {\begin{matrix} P_i &{} Q_i^T \\ Q_i &{} R_i \end{matrix}}\right] \), where \(P_i\), \(Q_i\), \(R_i\) are all \(m\times m\) matrices and

$$\begin{aligned} {P_i}_{rl}= & {} \frac{1}{2}\left\{ \exp \left( -\frac{1}{2}{t_{irl}^-}^T{\varSigma }_\lambda ^qt_{irl}^-\right) \cos \left( {t_{irl}^-}^T\right) \mu _\lambda ^q \right. \\&\left. + \exp \left( -\frac{1}{2}{t_{irl}^+}^T{\varSigma }_\lambda ^qt_{irl}^+\right) \cos \left( {t_{irl}^+}^T\right) \mu _\lambda ^q\right\} ,\\ {Q_i}_{rl}= & {} \frac{1}{2}\left\{ -\exp \left( -\frac{1}{2}{t_{irl}^-}^T{\varSigma }_\lambda ^qt_{irl}^-\right) \sin \left( {t_{irl}^-}^T\right) \mu _\lambda ^q \right. \\&\left. + \exp \left( -\frac{1}{2}{t_{irl}^+}^T{\varSigma }_\lambda ^qt_{irl}^+\right) \sin \left( {t_{irl}^+}^T\right) \mu _\lambda ^q\right\} ,\\ {R_i}_{rl}= & {} \frac{1}{2}\left\{ \exp \left( -\frac{1}{2}{t_{irl}^-}^T{\varSigma }_\lambda ^qt_{irl}^-\right) \cos \left( {t_{irl}^-}^T\right) \mu _\lambda ^q \right. \\&\left. - \exp \left( -\frac{1}{2}{t_{irl}^+}^T{\varSigma }_\lambda ^qt_{irl}^+\right) \cos \left( {t_{irl}^+}^T\right) \mu _\lambda ^q\right\} , \end{aligned}$$

\(t_{irl}^-=t_{ir}-t_{il}\), \(t_{irl}^+=t_{ir}+t_{il}\) for \(r=1,\dots ,m\), \(l=1,\dots ,m\).

Appendix 2: Derivation of lower bound

From (6), the lower bound is given by

$$\begin{aligned} \mathcal {L}=E_q\{\log p(y,\theta )\}-E_q\{\log q(\theta )\} \end{aligned}$$

where

$$\begin{aligned} E_q\{\log p(y,\theta )\}= & {} E_q\{\log p(y|\alpha ,\lambda ,\gamma )\} + E_q\{\log p(\alpha |\sigma )\} \\&+E_q\{\log p(\lambda )\}+E_q\{\log p(\sigma )\} \\&+E_q\{\log p(\gamma )\}, \end{aligned}$$
$$\begin{aligned} E_q\{\log q(\theta )\}= & {} E_q\{\log q(\alpha )\} +E_q\{\log q(\lambda )\} \\&+E_q\{\log q(\sigma )\}+E_q\{\log q(\gamma )\}. \end{aligned}$$

The terms in the lower bound can be evaluated as follows:

$$\begin{aligned}&E_q\{\log p(y|\alpha ,\beta ,\lambda ,\gamma )\}=-\frac{n}{2}\log (2\pi )-\frac{n}{2}E_q(\log \gamma ^2) \\&\quad -\frac{1}{2}\big [y^Ty-2y^TE_q(Z)\mu _\alpha ^q+\text {tr}\{(\mu _\alpha ^q{\mu _\alpha ^q}^T+{\varSigma }_\alpha ^q)E_q(Z^TZ)\}\big ] \\&\quad \cdot {\mathcal {H}(n,C_\gamma ^q,A_\gamma ^2)}/{\mathcal {H}(n-2,C_\gamma ^q,A_\gamma ^2)}\\&E_q\{\log p(\alpha |\sigma )\}=-m\log (2\pi ) -mE_q\{\log \sigma ^2\} \\&\quad +m\log m -\frac{m}{2}\frac{\mathcal {H}(2m,C_\sigma ^q,A_\sigma ^2)}{\mathcal {H}(2m-2,C_\sigma ^q,A_\sigma ^2)}\{{\mu _\alpha ^q}^T\mu _\alpha ^q+\text {tr}({\varSigma }_\alpha ^q)\}\\&E_q\{\log p(\lambda )\}=-\frac{d}{2}\log (2\pi ) -\quad \frac{1}{2}\log |{\varSigma }_\lambda ^0| \\&\quad -\frac{1}{2}(\mu _\lambda ^q-\mu _\lambda ^0)^T{{\varSigma }_\lambda ^0}^{-1}(\mu _\lambda ^q-\mu _\lambda ^0)-\frac{1}{2}\text {tr}({{\varSigma }_\lambda ^0}^{-1}{\varSigma }_\lambda ^q)\\&E_q\{\log p(\sigma )\}=\log (2A_\sigma )-\log \pi -E_q\{\log (A_\sigma ^2+\sigma ^2)\}\\&E_q\{\log p(\gamma )\}=\log (2A_\gamma )-\log \pi -E_q\{\log (A_\gamma ^2+\gamma ^2)\}\\&E_q\{\log q(\alpha )\}=-m\log (2\pi )-\frac{1}{2}\log |{\varSigma }_\alpha ^q|-m\\&E_q\{\log q(\lambda )\}=-\frac{d}{2}\log (2\pi ) -\frac{1}{2}\log |{\varSigma }_\lambda ^q|-\frac{d}{2}\\&E_q\{\log q(\sigma )\}=-C_\sigma ^q \frac{\mathcal {H}(2m,C_\sigma ^q,A_\sigma ^2)}{\mathcal {H}(2m-2,C_\sigma ^q,A_\sigma ^2)} - 2mE_q\{\log \sigma \}\\&\quad -\log \mathcal {H}(2m-2,C_\sigma ^q,A_\sigma ^2) -E_q\{\log (A_\sigma ^2+\sigma ^2)\}\\&E_q\{\log q(\gamma )\}=-C_\gamma ^q {\mathcal {H}(n,C_\gamma ^q,A_\gamma ^2)} /{\mathcal {H}(n-2,C_\gamma ^q,A_\gamma ^2)} \\&\quad -\log \mathcal {H}(n-2,C_\gamma ^q,A_\gamma ^2)-nE_q\{\log \gamma \}-E_q\{\log (A_\gamma ^2+\gamma ^2)\} \end{aligned}$$

Putting these terms together and making use of the updates in steps 5 and 6 of Algorithm 1 gives the lower bound in (12).

Appendix 3: Derivation of simplified updates in Algorithm 2

It can be shown (see Wand 2014; Tan and Nott 2013) that the natural parameter of \(q(\lambda )=N(\mu _\lambda ^q,{\varSigma }_\lambda ^q)\) is

$$\begin{aligned} \eta _{\lambda } = \left[ {\begin{array}{c}{- \frac{1}{2}D_{d}^{T} {\text {vec}}\left( {\mathop \sum \nolimits _{\lambda }^{{q^{{ - 1}} }} } \right) } \\ {\sum \nolimits _{\lambda }^{{q^{{ - 1}} }} {\mu _{\lambda }^{q} } } \\ \end{array} } \right] , \end{aligned}$$

where \(D_d\) is a unique \(d^2 \times \tfrac{d}{2}(d+1)\) matrix that transforms \(\text {vech}(A)\) into \(\text {vec}(A)\) for any \(d \times d\) symmetric square matrix A, that is, \(D_d\text {vech}(A)=\text {vec}(A)\). We use \(\text {vech}(A)\) to denote the \(\tfrac{1}{2}d(d+1) \times 1\) vector obtained from \(\text {vec}(A)\) by eliminating all supradiagonal elements of A. Magnus and Neudecker (1988) is a good reference for the matrix differential calculus involved in the derivation below. From (13) and (Tan and Nott 2013, pg. 7), we have

$$\begin{aligned} \left[ {\begin{array}{c} { - \frac{1}{2}D_{d}^{T} {\text {vec}}\left( {\mathop \sum \nolimits _{\lambda }^{{q(t)^{{ - 1}} }} } \right) {\text { }}} \\ {\sum \nolimits _{\lambda }^{{q(t)^{{ - 1}} }} {\mu _{\lambda }^{{q(t)}} } {\text { }}} \\ \end{array} } \right] = (1 - a_{t} )\cdot \left[ {\,\begin{array}{c} { - \frac{1}{2}D_{d}^{T} {\text {vec}}\left( {\sum _{\lambda }^{{q(t - 1)^{{ - 1}} }} } \right) } \\ {\sum \nolimits _{\lambda }^{{q(t - 1)^{{ - 1}} }} {\mu _{\lambda }^{{q(t - 1)}} } } \\ \\ \end{array} } \right] \nonumber \\ + a_t \left[ {\begin{array}{cc} {D_{d}^{T} } &{} \quad 0 \\ { - 2(\mu _{\lambda }^{{q(t - 1)^{T} }} \otimes I)D_{d}^{{ + T}} D_{d}^{T} } &{} \quad I \\ \end{array} } \right] \sum \limits _{{a \in N(\lambda )}} {\left[ {\begin{array}{c} {\frac{{\partial S_{a} }}{{\partial {\text {vec}}(\sum _{\lambda }^{q} )}}} \\ {\frac{{\partial S_{a} }}{{\partial \mu _{\lambda }^{q} }}} \\ \end{array} } \right] }, \nonumber \\ \end{aligned}$$
(21)

where \(\dfrac{\partial S_a}{\partial \text {vec}({\varSigma }_\lambda ^q)}\) and \(\dfrac{\partial S_a}{\partial \mu _\lambda ^q}\) are evaluated at

$$\begin{aligned} {\varSigma }_\lambda ^q={{{\varSigma }_\lambda ^q}^{(t)}}^{-1} \;\; \text {and} \;\; \mu _\lambda ^q={\mu _\lambda ^q}^{(t-1)}. \end{aligned}$$

Let

$$\begin{aligned} \sum _{a \in N(\lambda )} \frac{\partial S_a}{\partial \text {vec}({\varSigma }_\lambda ^q)} = -\frac{1}{2}\text {vec}(G). \end{aligned}$$

The first line of (21) simplifies to

$$\begin{aligned}&{{{\varSigma }_\lambda ^q}^{(t)}}^{-1} = (1-a_t) {{{\varSigma }_\lambda ^q}^{(t)}}^{-1} + a_t G \\&\Rightarrow {{\varSigma }_\lambda ^q}^{(t)} =\{(1-a_t) {{{\varSigma }_\lambda ^q}^{(t)}}^{-1} + a_t G\}^{-1}. \end{aligned}$$

The second line of (21) gives

$$\begin{aligned} {{{\varSigma }_\lambda ^q}^{(t)}}^{-1} {\mu _\lambda ^q}^{(t)}= & {} (1-a_t) {{{\varSigma }_\lambda ^q}^{(t-1)}}^{-1} {\mu _\lambda ^q}^{(t-1)} \\&+ a_t G {\mu _\lambda ^q}^{(t-1)} + a_t \sum _{a \in N(\lambda )}\frac{\partial S_a}{\partial \mu _\lambda ^q} \\= & {} {{{\varSigma }_\lambda ^q}^{(t)}}^{-1} {\mu _\lambda ^q}^{(t-1)} + a_t \sum _{a \in N(\lambda )}\frac{\partial S_a}{\partial \mu _\lambda ^q} \\ \Rightarrow {\mu _\lambda ^q}^{(t)}= & {} {\mu _\lambda ^q}^{(t-1)} + a_t {{\varSigma }_\lambda ^q}^{(t)} \sum _{a \in N(\lambda )}\frac{\partial S_a}{\partial \mu _\lambda ^q}. \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tan, L.S.L., Ong, V.M.H., Nott, D.J. et al. Variational inference for sparse spectrum Gaussian process regression. Stat Comput 26, 1243–1261 (2016). https://doi.org/10.1007/s11222-015-9600-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-015-9600-7

Keywords

Navigation