Log in

Comprehensive study of variational Bayes classification for dense deep neural networks

  • Original Paper
  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

Although Bayesian deep neural network models are ubiquitous in classification problems; their Markov Chain Monte Carlo based implementation suffers from high computational cost, limiting the use of this powerful technique in large-scale studies. Variational Bayes (VB) has emerged as a competitive alternative to overcome some of these computational issues. This paper focuses on the variational Bayesian deep neural network estimation methodology and discusses the related statistical theory and algorithmic implementations in the context of classification. For a dense deep neural network-based classification, the paper compares and contrasts the true posterior’s consistency and contraction rates and the corresponding variational posterior. Based on the complexity of the deep neural network (DNN), this paper provides an assessment of the loss in classification accuracy due to VB’s use and guidelines on the characterization of the prior distributions and the variational family. The difficulty of the numerical optimization for obtaining the variational Bayes solution has also been quantified as a function of the complexity of the DNN. The development is motivated by an important biomedical engineering application, namely building predictive tools for the transition from mild cognitive impairment to Alzheimer’s disease. The predictors are multi-modal and may involve complex interactive relations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Algorithm 1
Algorithm 2
Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Availability of data and materials

The data is publicly available.

Code availability

The computational code is available.

References

  • Bai, J., Song, Q., Cheng, G.: Efficient variational inference for sparse deep learning with theoretical guarantee. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 466–476. Curran Associates, Inc. (2020)

    Google Scholar 

  • Barron, A., Schervish, M.J., Wasserman, L.: The consistency of posterior distributions in nonparametric problems. Ann. Stat. 27(2), 536–561 (1999)

    Article  MathSciNet  Google Scholar 

  • Bhattacharya, S., Maiti, T.: Statistical foundation of variational bayes neural networks. Neural Netw. 137, 151–173 (2021)

    Article  Google Scholar 

  • Bishop, C.M.: Bayesian neural networks. J. Braz. Comput. Soc. 4(1), 61–68 (1997)

    Article  Google Scholar 

  • Blei, D., Ng, A., Jordan, M.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    Google Scholar 

  • Blei, D.M., Lafferty, J.D.: A correlated topic model of science. Ann. Appl. Stat. 1(1), 17a, 35 (2007)

    Article  MathSciNet  Google Scholar 

  • Blundell, C., Cornebise, J., Kavukcuoglu, K., Wierstra, D.: Weight uncertainty in neural network. In: Proceedings of Machine Learning Research, vol. 37, pp. 1613–1622. PMLR (2015)

  • Cannings, T.I., Samworth, R.J.: Random-projection ensemble classification. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 79(4), 959–1035 (2017)

    Article  MathSciNet  Google Scholar 

  • Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., **ao, T., Xu, B., Zhang, C., Zhang, Z.: Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems (2015)

  • Chérief-Abdellatif, B.-E.: Convergence rates of variational inference in sparse deep learning. In: Hal DaumA III, Singh, A. (eds) Proceedings of the 37th International Conference on Machine Learning, vol. 119 of Proceedings of Machine Learning Research, pp. 1831–1842. PMLR (2020)

  • Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(61), 2121–2159 (2011)

    MathSciNet  Google Scholar 

  • Hinton G.E., Van Camp, D.: Kee** the neural networks simple by minimizing the description length of the weights. In: Proceedings of the Sixth Annual Conference on Computational Learning Theory, COLT’93, pp. 5a 13. ACM press (1993)

  • Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York (2009)

    Google Scholar 

  • Graves, A.: Practical variational inference for neural networks. In: Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 24, pp. 2348–2356. Curran Associates, Inc. (2011)

    Google Scholar 

  • Graves, A.: Generating sequences with recurrent neural networks (2014). ar**v:1308.0850

  • Gurney, K.: An Introduction to Neural Networks. Taylor & Francis Inc., USA (1997). (ISBN 1857286731)

    Book  Google Scholar 

  • Hinton, G., Srivastava, N., Swersky, K.: Lecture 6a Overview of Mini-batch Gradient Descent (2012). http://www.cs.toronto.edu/hinton/coursera/lecture6/lec6.pdf

  • Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989)

    Article  Google Scholar 

  • Hubin, A., Storvik, G., Frommlet, F.: Deep Bayesian regression models (2018). ar**v:1806.02160

  • Javid, K., Handley, W., Hobson, M.P., Lasenby, A.: Compromise-free Bayesian neural networks (2020). ar**v:2004.12211

  • Kingma, D.P., Salimans, T., Welling, M.: Variational dropout and the local reparameterization trick. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 2575–2583. Curran Associates, Inc. (2015)

    Google Scholar 

  • Korolev, I.: Alzheimer’s disease: a clinical and basic science review. Med. Stud. Res. J. 4(1), 24–33 (2014)

    Google Scholar 

  • Korolev, I.O., Symonds, L.L., Bozoki, A.C., Initiative, A.D.N.: Predicting progression from mild cognitive impairment to Alzheimer’s dementia using clinical, MRI, and plasma biomarkers via probabilistic pattern classification. PLoS ONE 11(2), e0138866 (2016)

    Article  Google Scholar 

  • Lampinen, J., Vehtari, A.: Bayesian approach for neural networks-review and case studies. Neural Netw. Off. J. Int. Neural Netw. Soc. 14(3), 257–274 (2001)

    Article  Google Scholar 

  • Lee, H.K.H.: Consistency of posterior distributions for neural networks. Neural Netw. 13(6), 629–642 (2000)

    Article  Google Scholar 

  • Li, X., Li, C., Chi, J., Ouyang, J.: Variance reduction in black-box variational inference by adaptive importance sampling. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, International Joint Conferences on Artificial Intelligence Organization, pp. 2404–2410 (2018)

  • Liang, F., Li, Q., Zhou, L.: Bayesian neural networks for selection of drug sensitive genes. J. Am. Stat. Assoc. 113(523), 955–972 (2018)

    Article  MathSciNet  Google Scholar 

  • Liu, Z., Maiti, T., Bender, A.: A role for prior knowledge in statistical classification of the transition from MCI to Alzheimer’s disease. Unpublished report (2020)

  • Matthews, A.G. de G., Hron, J., Rowland, M., Turner, R.E., Ghahramani, Z.: Gaussian process behaviour in wide deep neural networks. In: International Conference on Learning Representations (2018)

  • McKinney, W.: Data structures for statistical computing in python. In: van der Walt, S., Millman, J. (eds.) Proceedings of the 9th Python in Science Conference, pp, 56–61 (2010)

  • McMahan, H.B.: A survey of algorithms and analysis for adaptive online learning. J. Mach. Learn. Res. 18(90), 1–50 (2017)

    MathSciNet  Google Scholar 

  • Mullachery, V., Khera, A., Husain, A.: Bayesian neural networks (2018). ar**v:1801.07710

  • Nagapetyan, T., Duncan, A.B., Hasenclever, L., Vollmer, S.J., Szpruch, L., Zygalakis, K.: The true cost of stochastic gradient Langevin dynamics (2017). ar**v:1706.02692

  • Neal, R.M.: Bayesian training of backpropagation networks by the hybrid Monte-Carlo method (1992). https://www.cs.toronto.edu/~radford/ftp/bbp.pdf

  • Paisley, J., Blei, David, Jordan, Michael: Variational bayesian inference with stochastic search. In: Proceedings of the 29th International Conference on International Conference on Machine Learning, ICML’12, pp. 1363,1370. ACM press (2012)

  • Pati, D., Bhattacharya, A., Yang, Y.: On statistical optimality of variational bayes. In: Storkey, A., Perez-Cruz, F. (eds.) Proceedings of Machine Learning Research, vol. 84, pp. 1579–1588. PMLR (2018)

  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  Google Scholar 

  • Petersen, R.C., Roberts, R.O., Knopman, D.S., Boeve, B.F., Geda, Y.E., Ivnik, R.J., Smith, G.E., Jack, C.R.: Mild cognitive impairment: ten years later. Arch. Neurol. 66(12), 1447–1455 (2009). https://doi.org/10.1001/archneurol.2009.266

    Article  Google Scholar 

  • Pollard, D.: Empirical processes: Theory and applications. NSF-CBMS Regional Conference Series in Probability and Statistics 2, i–86 (1990)

    MathSciNet  Google Scholar 

  • Polson, N.G., Ročková, V.: Posterior concentration for sparse deep learning. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31. Curran Associates, Inc. (2018)

    Google Scholar 

  • Ranganath, R., Gerrish, S., Blei, D.M.: Black box variational inference (2013). ar**v:1401.0118

  • Ross, S.M.: Simulation, fifth edition Academic Press (2013). (ISBN 9780124158252)

    Google Scholar 

  • Schmidt-Hieber, J.: Nonparametric regression using deep neural networks with ReLU activation function. Ann. Stat. 48(4), 1875–1897 (2020)

    MathSciNet  Google Scholar 

  • Singh, B., De, S., Zhang, Y., Goldstein, T., Taylor, G.: Layer-specific adaptive learning rates for deep networks (2015). ar**v:1510.04609

  • Sun, S., Chen, C., Carin, L.: Learning structured weight uncertainty in bayesian neural networks. In: Proceedings of Machine Learning Research, vol, 54, pp. 1283–1292. PMLR (2017)

  • Sun, S., Zhang, G., Shi, J., Grosse, R.B: Functional variational bayesian neural networks. In: 7th International Conference on Learning Representations, ICLR 2019. OpenReview.net (2019)

  • Sun, Y., Song, Q., Liang, F.: Consistent sparse deep learning: theory and computation. J. Am. Stat. Assoc. 0 (ja):1–42 (2021)

  • Taghia, J.: Lecture Notes. Part III: black-box variational inference (2018). http://www.it.uu.se/research/systems_and_control/education/2018/pml/lectures/VILectuteNotesPart3.pdf

  • Torben, S., Sumeetpal Sidhu, S.: Trace-class Gaussian priors for Bayesian learning of neural networks with MCMC. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 85(1), 46–66 (2023)

    Article  Google Scholar 

  • van der Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Series in Statistics. Springer, New York (1996)

    Book  Google Scholar 

  • Wan, R., Zhong, M., **ong, H., Zhu, Z.: Neural control variates for variance reduction (2018). ar**v:1806.00159

  • Wang, Y., Blei, D.M.: Frequentist consistency of variational bayes. J. Am. Stat. Assoc. 114(527), 1147–1161 (2019)

    Article  MathSciNet  Google Scholar 

  • Welling, M., Teh, Y.: Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, pp. 681–688. ACM Press (2011)

  • Wing Hung, W., **aotong, S.: Probability inequalities for likelihood ratios and convergence rates of sieve MLES. Ann. Stat. 23(2), 339–362 (1995)

    MathSciNet  Google Scholar 

  • Wu, A., Nowozin, S., Meeds, E., Turner, R.E., Hernández-Lobato, J.M., Gaunt, A.L.: Deterministic variational inference for robust bayesian neural networks (2019). https://openreview.net/forum?id=B1l08oAct7

  • **ao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. ar**v preprint ar**v:1708.07747 (2017)

  • Yang, K., Maiti, T.: Statistical aspects of high-dimensional sparse artificial neural network models. Mach. Learn. Knowl. Extr. 2(1), 1–19 (2020)

  • Yang, Y., Pati, D., Bhattacharya, A.: \(\alpha \)-variational inference with statistical guarantees. Ann. Stat. 48(2), 886–905 (2020)

    Article  MathSciNet  Google Scholar 

  • Zhang, D., Shen, D.: Multi-modal multi-task learning for joint prediction of clinical scores in Alzheimer’s disease. In: Tianming, L., Dinggang, S., Luis, I., **aodong, T. (eds.) Multimodal Brain Image Analysis, pp. 60–67. Springer Berlin Heidelberg, Berlin, Heidelberg (2011)

  • Zhang, D., Shen, D., Initiative, A.D.N.: Predicting future clinical changes of mci patients using longitudinal and multimodal biomarkers. PLoS ONE 7(3), e0033182 (2012)

    Article  Google Scholar 

  • Zhang, F., Gao, C.: Convergence rates of variational posterior distributions. Ann. Stat. 48(4), 2180–2207 (2020)

    Article  MathSciNet  Google Scholar 

  • Zhu, C., Cheng, Y., Gan, Z., Huang, F., Liu, J., Goldstein, T.: Adaptive learning rates with maximum variation averaging (2020). ar**v:2006.11918

Download references

Funding

This work is partially supported by the grants NSF-1924724, NSF-1952856, and NSF-2124605.

Author information

Authors and Affiliations

Authors

Contributions

Equal contribution from all three authors.

Corresponding author

Correspondence to Zihuan Liu.

Ethics declarations

Conflict of interest

None.

Ethical approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (zip 147 KB)

Appendices

Appendix A Algorithms of variational implementation.

Algorithm 3
figure c

BBVI-RMS

Algorithm 4
figure d

BBVI-CV-RMS

With q and p as in (12) and (10) respectively,

$$\begin{aligned} d_{\textrm{KL}}(q,p)= & {} \sum _{j=1}^{K_n} \left( \log \frac{\zeta _{jn}}{s_{jn}}+\frac{s_{jn}^2}{2\zeta _{jn}^2}+\frac{(m_{jn}-\mu _{jn})^2}{2\zeta _{jn}^2}-\frac{1}{2}\right) \\ \nabla _{m_{jn}}d_{\textrm{KL}}(q,p)= & {} \frac{(m_{jn}-\mu _{jn})}{2\zeta _{jn}^2} \\ \nabla _{s_{jn}}d_{\textrm{KL}}(q,p)= & {} -\frac{1}{s_{jn}}+\frac{s_{jn}}{\zeta _{jn}^2}\\ \nabla _{m_{jn}}\mathcal {L}_{\mathcal {V}_q}= & {} E_{q(.|\mathcal {V}_q)}\left( \left( \frac{\theta _{jn}-m_{jn}}{s_{jn}^2}\right) \log L(\varvec{\theta }_{n})\right) \\ \nabla _{s_{jn}}\mathcal {L}_{\mathcal {V}_q}= & {} E_{q(.|\mathcal {V}_q)}\\{} & {} \left( \left( \frac{(\theta _{jn}-m_{jn})^2}{s_{jn}^3}-\frac{1}{s_{jn}}\right) \log L(\varvec{\theta }_{n})\right) \end{aligned}$$

Appendix B Preliminaries

Definition 1

\(MVN(\varvec{\mu },\varvec{\Sigma })\) is used to denote the density function of multivariate normal distribution with mean \(\varvec{\mu }\) and variance covariance matrix \(\varvec{\Sigma }\).

Definition 2

For a vector \(\varvec{\alpha }\) and a function g,

  1. 1.

    \(||\varvec{\alpha }||_1=\sum _i |\alpha _i|\), \(||\varvec{\alpha }||_2=\sqrt{\sum _i \alpha _i^2}\), \(||\varvec{\alpha }||_\infty =\max _i |\alpha _i|\).

  2. 2.

    \(||g||_1=\int _{\varvec{x}\in \chi } |g(\varvec{x})|d\varvec{x}\), \(||g||_2=\sqrt{\int _{\varvec{x}\in \chi } g(\varvec{x})^2d\varvec{x}}\), \(||g||_\infty =\sup _{\varvec{x}\in \chi } |g(\varvec{x})|\)

Definition 3

(Bracketing number and entropy) For any two functions l and u, define the bracket [lu] as the set of all functions f such that \(l\le f\le u\) pointwise. Let ||.|| be a metric. Define an \(\varepsilon -\)bracket as a bracket with \(||u-l||\le \varepsilon \). Define the bracketing number of a set of functions \(\mathcal {F}^*\) as the minimum number of \(\varepsilon -\)brackets needed to cover \(\mathcal {F}^*\), and denote it by \(N_{[]}(\varepsilon ,\mathcal {F}^*,||.||)\). Finally, the Hellinger bracketing entropy, denoted by \(H_{[]}(\varepsilon ,\mathcal {F}^*,||.||)\), is the natural logarithm of the bracketing number (Pollard 1990).

Definition 4

(Covering number and entropy) Let (V, ||.||) be a normed space, and \(\mathcal {F} \subset V\). \(\{V_1,\ldots , V_n \}\) is an \(\varepsilon -\)covering of \(\mathcal {F}\) if \(\mathcal {F} \subset \cup _{i=1}^N B(V_i,\varepsilon )\), or equivalently, \(\forall \) \(\theta \in \mathcal {F}\), \(\exists \) i such that \(||\theta -V_i||<\varepsilon \). The covering number of \(\mathcal {F}\) denoted by \(N(\varepsilon ,\mathcal {F},||.||)=\min \{n: \exists \, \varepsilon -\text { covering over }\mathcal {F}\text { of size } n \}\). Finally, the Hellinger covering entropy, denoted by \(H(\varepsilon , \mathcal {F},||.||)\), is the natural logarithm of the covering number (Pollard 1990).

Lemma 5 gives a bound on the integral of the Hellinger entropy. Lemma 6 shows that the prior gives negligible probability outside the sieve \(\mathcal {F}_n\). Lemma 7 shows that the prior if prior gives sufficient mass on the KL neighborhoods of the true density, the marginal density is well bounded. Lemma 8 shows that if parameters of two neural networks are close then so are the neural networks themselves. Lemmas 5, 6, 7 and 8 will serve as important tools towards the proof of consistency of the true posterior.

Lemma 5

With \(H_{[]}(u,\widetilde{\mathcal {F}}_n,||.||_2)\) as in Definition 3, for \(H_{[]}(u,\widetilde{\mathcal {F}}_n,||.||_2)\le K_n\log (M_n/u)\),

$$\begin{aligned} \int _0^{\varepsilon }H_{[]}(u,\widetilde{\mathcal {F}}_n,||.||_2)du\lesssim \varepsilon \sqrt{K_n(\log M_n-\log \varepsilon )} \end{aligned}$$

Proof

See proof of lemma 7.14 in Bhattacharya and Maiti (2021). \(\square \)

Lemma 6

Suppose, \(\int _{\mathcal {F}_n^c} p(\varvec{\theta }_{n}) d\varvec{\theta }_{n} \le e^{-n\varepsilon }, n \rightarrow \infty \) for any \(\varepsilon >0\). Then, for every \(\tilde{\varepsilon }<\varepsilon \).

$$\begin{aligned} P_0^n\left( \int _{\varvec{\theta }_{n} \in \mathcal {F}_n^c}\frac{L(\varvec{\theta }_{n})}{L_0}p(\varvec{\theta }_{n})d\varvec{\theta }_{n}\ge e^{-n\tilde{\varepsilon } }\right) \le e^{-n(\varepsilon -\tilde{\varepsilon })} \end{aligned}$$

Proof

See proof of lemma 7.16 in Bhattacharya and Maiti (2021). \(\square \)

Lemma 7

Suppose \(\mathcal {N}_\varepsilon =\{\varvec{\theta }_{n}: d_{\text {KL}}(\ell _0,\ell _{\varvec{\theta }_{n}})<\varepsilon \}\) and \( \int _{\mathcal {N}_\varepsilon } p(\varvec{\theta }_{n})d\varvec{\theta }_{n}\ge e^{-n\varepsilon }, n\rightarrow \infty \) then for any \(\nu >0\),

$$\begin{aligned} P_0^n\left( \left| \log \int \frac{L(\varvec{\theta }_{n})}{L_0}p(\varvec{\theta }_{n})d\varvec{\theta }_{n}\right| \ge n \nu \right) \le \frac{2\varepsilon }{\nu } \end{aligned}$$

Proof

See proof of lemma 7.12 in Bhattacharya and Maiti (2021). \(\square \)

Lemma 8

Let \(\eta _{\varvec{\theta }_{n}^*}(\varvec{x})=\varvec{b}_L^*+\varvec{A}_L^* \psi (\varvec{b}_{L-1}^*+\varvec{A}_{L-1}^* \psi ( \cdots \psi (\varvec{b}_1^*+\varvec{A}_1^*\psi (\varvec{b}_0^*+\varvec{A}_0^* \varvec{x})))\) be a fixed neural network. Let \(\eta _{\varvec{\theta }_{n}}(\varvec{x})=\varvec{b}_L+\varvec{A}_L \psi (\varvec{b}_{L-1}+\varvec{A}_{L-1} \psi ( \cdots \psi (\varvec{b}_1+\varvec{A}_1\psi (\varvec{b}_0+\varvec{A}_0 \varvec{x})))\) be a neural network such that

$$|\theta _{jn}-\theta ^*_{jn}|\le \frac{\varepsilon }{\sum _{v=0}^L \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n}}$$

where \(\tilde{k}_{vn}=k_{vn}+1\). Then,

$$\int _{\varvec{x}\in [0,1]^{p_n}} |\eta _{\varvec{\theta }_{n}}(\varvec{x})- \eta _{\varvec{\theta }^*_n}(\varvec{x})|dx\le \varepsilon $$

Proof

In the proof, we suppress the dependence on n. Define the projection \(P_v\) as \(P_V \eta _{\varvec{\theta }}(\varvec{x})=\varvec{b}_{V-1}+\varvec{A}_{V-1} \psi ( \cdots \psi (\varvec{b}_1+\varvec{A}_1\psi (\varvec{b}_0+\varvec{A}_0\varvec{x})))\). We claim that

$$\begin{aligned} |P_V \eta _{\varvec{\theta }}(\varvec{x})[s]-P_V \eta _{\varvec{\theta }^*}(\varvec{x})[s]|\le \frac{\varepsilon \sum _{v=0}^V \tilde{k}_v\prod _{v'=v+1}^L a^*_{v'}}{\sum _{v=0}^L \tilde{k}_v\prod _{v'=v+1}^L a^*_{v'}}\nonumber \\ \end{aligned}$$
(B2)

We prove this by induction. Let \(v=1\) as follows. Let \(\tilde{\varepsilon }=\varepsilon /\sum _{v=0}^L \tilde{k}_v\prod _{v'=v+1}^L a^*_{v'}\), then

$$\begin{aligned}&|P_1 \eta _{\varvec{\theta }}(\varvec{x})[s]-P_1 \eta _{\varvec{\theta }^*}(\varvec{x})[s]|\\&\quad \le |\varvec{b}_{1}-\varvec{b}_{1}^*[s]|+|{\varvec{A}_1[s]}^\top \psi (\varvec{b}_0+\varvec{A}_0\varvec{x})\\&\qquad -{\varvec{A}_1^*[s]}^\top \psi (\varvec{b}_0^*+\varvec{A}_0^*\varvec{x})|\\&\quad \le \tilde{\varepsilon }+||\varvec{A}_1[s] -\varvec{A}_1^*[s]||_1\\&\qquad +\sum _{s'=0}^{k_1} |\varvec{A}_{1}^*[s][s'](\psi (\varvec{b}_{0}[s]+\varvec{A}_{0}[s]^\top \varvec{x})\\&\qquad -\psi (\varvec{b}_{0}^*[s]+{\varvec{A}_{0}^*[s]}^\top \varvec{x}))|\\&\quad =\tilde{\varepsilon }+k_{1}\tilde{\varepsilon }+\tilde{\varepsilon }\sum _{s'=0}^{k_{1}}|\varvec{A}_{1}^*[s][s']|(k_{0}+1)\\&\quad =\tilde{\varepsilon }(1+k_1+a^*_{1}(p_n+1))\le \tilde{\varepsilon } (\tilde{k}_1+a^*_{1} \tilde{k}_0) \end{aligned}$$

where the second line holds since \(\psi (u)\le 1\) and the third step is shown next. Let \(u=-\varvec{b}_{0}[s]-\varvec{A}_{0}[s]^\top \varvec{x}\) and \(u_\delta =\varvec{b}_{0}[s]+\varvec{A}_{0}[s]^\top \varvec{x}-\varvec{b}_{0}^*[s]+{\varvec{A}_{0}^*[s]}^\top \varvec{x}\), then for \(|u_\delta |<1\)

$$\begin{aligned} |\psi (u)-\psi (u+u_\delta )|&=\left| \frac{e^{u+u_\delta }-e^{u}}{(1+e^{u+u_\delta })(1+e^{u})}\right| \nonumber \\&\le \left| \frac{e^u(e^{u_\delta }-1)}{(1+e^u)(1+e^{u+u_\delta })}\right| \nonumber \\&\le \frac{e^u |e^{u_\delta }-1|}{(1+e^u)(1+e^{u-1})}\le |u_\delta | \end{aligned}$$
(B3)

since \(e^u/((1+e^u)(1+e^{u-1}))\le 1/2\) and \(|e^{u_\delta }-1|\le 2|u_\delta |\) for \(|u_\delta |<1\). Now, \(|u_\delta |=|\varvec{b}_{0}[s] -\varvec{b}_{0}^*[s]|+\sum _{s'=0}^{p_n} |\varvec{A}_0[s][s']-\varvec{A}_0^*[s][s']| \le (p_n+1)\tilde{\varepsilon }<1\).

Suppose the result hold for \(V-1\), we show the result for V as follows:

$$\begin{aligned}&|P_V \eta _{\varvec{\theta }}(\varvec{x})[s]-P_V \eta _{\varvec{\theta }^*}(\varvec{x})[s]|\\&\quad \le |\varvec{b}_{V}[s]-\varvec{b}_{V}^*[s]|+|{\varvec{A}_{V}[s]}^\top \psi (P_{V-1} \eta _{\varvec{\theta }}(\varvec{x}))\\&\qquad -{\varvec{A}_{V}^*[s]}^\top \psi (P_{V-1} \eta _{\varvec{\theta }^*}(\varvec{x}))|\\&\quad \le \tilde{\varepsilon }+||\varvec{A}_{V}[s] -{\varvec{A}_{V}^*[s]}^\top ||_1\\&\qquad +\sum _{s'=0}^{k_V} |\varvec{A}_{V}^*[s][s'](\psi (P_{V-1} \eta _{\varvec{\theta }}(\varvec{x})[s])\\&\qquad -\psi (P_{V-1} \eta _{\varvec{\theta }^*}(\varvec{x})[s]))|\\&\quad \le \tilde{\varepsilon }+||\varvec{A}_{V}[s] -{\varvec{A}_{V}^*[s]}^\top ||_1\\&\qquad +\sum _{s'=0}^{k_V} |\varvec{A}_{V}^*[s][s'](P_{V-1} \eta _{\varvec{\theta }}(\varvec{x})[s])-\psi (P_{V-1} \eta _{\varvec{\theta }^*}(\varvec{x})[s])| \end{aligned}$$

where the second step follows since \(\psi (u)\le 1\) and the third step follows by relation (B3) provided \(|P_{V-1} \eta _{\varvec{\theta }}(\varvec{x})[s]-P_{V-1} \eta _{\varvec{\theta }^*}(\varvec{x})[s]|\le 1\). But this holds using relation (B2) with \(v=V-1\).

Thus proceeding further we get

$$\begin{aligned}&|P_V \eta _{\varvec{\theta }}(\varvec{x})[s]-P_V \eta _{\varvec{\theta }^*}(\varvec{x})[s]|\\&\quad \le \tilde{\varepsilon } (1+k_{V})+2\tilde{\varepsilon }\sum _{s'=0}^{k_{V}}|W_{V}^*[s][s']|\sum _{v=0}^{V-1} \tilde{k}_v \prod _{v'=v+1}^{V-1}a^*_{v'}\\&\quad \le \tilde{\varepsilon }\tilde{k}_v+\tilde{\varepsilon } \sum _{v=0}^{V-1} \tilde{k}_v \prod _{v'=v+1}^V\widetilde{\theta }_v'=\tilde{\varepsilon } \sum _{v=0}^{V} \tilde{k}_{v} \prod _{v'=v+1}^V a^*_{v'} \end{aligned}$$

This completes the proof.

Lemma 9 shows that if the expected KL densities between two densities is close then the expected log-likelihood ratio between the two densities is also well bounded where expectation is taken with respect to the variational member q. Lemma 10 shows that if functions are close then so is the logistic loss between them. Lemmas 8, 9 and 10 together will serve as tools towards establishing that the variational and true posterior are close in the KL-distance. \(\square \)

Lemma 9

Suppose q satisfies \(\int d_{\textrm{KL}}(\ell _0,\ell _{\varvec{\theta }_{n}}) q(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\le \varepsilon ,\) then for any \(\nu >0\),

$$\begin{aligned} P_0^n\left( \left| \int q(\varvec{\theta }_{n}) \log \frac{L(\varvec{\theta }_{n})}{L_0}d\varvec{\theta }_{n}\right| \ge n\nu \right) \le \frac{\varepsilon }{\nu } \end{aligned}$$

Proof

See proof of lemma 7.13 in Bhattacharya and Maiti (2021). \(\square \)

Lemma 10

If \(|\eta _0(\varvec{x})-\eta _{\varvec{\theta }_{n}}(\varvec{x})|\le \varepsilon \), then \(|h_{\varvec{\theta }_{n}}(\varvec{x})|\le 2\varepsilon \) where

$$\begin{aligned} h_{\varvec{\theta }_{n}}(\varvec{x})= & {} \sigma (\eta _0(\varvec{x}))(\eta _0(\varvec{x})-\eta _{\varvec{\theta }_{n}}(\varvec{x}))\\{} & {} + \log (1-\sigma (\eta _0(\varvec{x}))) -\log (1-\sigma (\eta _{\varvec{\theta }_{n}}(\varvec{x}))) \end{aligned}$$

Proof

Note that,

$$\begin{aligned} |h_{\varvec{\theta }_{n}}(\varvec{x})|&\le |\sigma (\eta _0(\varvec{x}))| |\eta _0(\varvec{x})-\eta _{\varvec{\theta }_{n}}(\varvec{x})|\\&\quad +|\log (1-\sigma (\eta _0(\varvec{x}))-\log (1-\sigma (\eta _{\varvec{\theta }_{n}}(\varvec{x}))|\\&\le |\eta _0(\varvec{x})-\eta _{\varvec{\theta }_{n}}(\varvec{x})| \\&\quad +\left| \log \left( 1+\sigma (\eta _0(\varvec{x}))(e^{\eta _{\varvec{\theta }_{n}}(\varvec{x})-\eta _0(\varvec{x})}-1)\right) \right| \\&\le 2|\eta _0(\varvec{x})-\eta _{\varvec{\theta }_{n}}(\varvec{x})| \end{aligned}$$

where the second step follows by using \(\sigma (x)=e^{x}/(1+e^x) \le 1\) and the proof of the third step is shown below. \(\square \)

Let \(p=\sigma (\eta _0(\varvec{x}))\), then \(0\le p \le 1\) and \( r=\eta _{\varvec{\theta }_{n}}(\varvec{x})-\eta _0(\varvec{x})\), then

$$\begin{aligned}&\left| \log \left( 1+\sigma (\eta _0(\varvec{x}))(e^{\eta _{\varvec{\theta }_{n}}(\varvec{x})-\eta _0(\varvec{x})}-1)\right) \right| \\&\quad =\left| \log \left( 1+p(e^r-1)\right) \right| \\&r>0:\hspace{2mm}|\log (1+p(e^r-1))|=\log (1+p(e^r-1))\\&\quad \le \log (1+(e^r-1))=r\\&r<0:\hspace{2mm} |\log (1+p(e^r-1))|=-\log (1+p(e^r-1))\\&\quad \le -\log (1+(e^r-1))=-r \end{aligned}$$

Lemma 11 gives a bound on the first order derivatives of a neural network. Lemma 12 gives a bound on the Hellinger entropy under the sieve \(\mathcal {F}_n\). Lemmas 11 and 12 will serve as tools to bound the Hellinger entropy of the functional sieve space \(\widetilde{\mathcal {F}}_n\) based on \(\mathcal {F}_n\).

Lemma 11

For \(\eta _{\varvec{\theta }_{n}}(\varvec{x})=\varvec{b}_L+\varvec{A}_L \psi (\varvec{b}_{L-1}+\varvec{A}_{L-1} \psi ( \cdots \psi (\varvec{b}_1+\varvec{A}_1\psi (\varvec{b}_0+\varvec{A}_0 \varvec{x})))\),

$$\begin{aligned} \sup _{j=1, \ldots , K_n} \nabla _{\theta _j} \eta _{\varvec{\theta }_{n}}(\varvec{x}) \le \prod _{v'=1}^{L_n} a_{v'n} \end{aligned}$$

where \(a_{v'n}=\sup _{v=0, \ldots , k_{(v'+1)n}} ||\varvec{A}_{v'}[v]]||_1\).

Proof

We suppress the dependence on n. Let \(P_{V}=\varvec{b}_V+\varvec{A}_V\psi (\cdots \varvec{b}_1+\varvec{A}_1\psi (\varvec{b}_0+\varvec{A}_0 \varvec{x})))\). Define \(G_{V,V}=\varvec{1}_{k_V+1}\) and for \(V=0,\ldots , L\), \(V'=0,\ldots , V-1\), let

$$\begin{aligned} G_{V',V}&=\varvec{A}_{V}(\psi '(P_{V-1})\odot \varvec{A}_{V-1}(\psi '(P_{V-2})\odot \cdots \varvec{A}_{V+1}(\psi '(P_{V'})))) \end{aligned}$$

where \(\odot \) denotes component wise multiplication.

With \(\psi (P_{-1})=\varvec{x}\), we define

$$\begin{aligned} {\left\{ \begin{array}{ll} \nabla _{\varvec{b}_{v}} \eta _{\varvec{\theta }}(\varvec{x})&{}=G_{v,L} \varvec{1}_{k_{v+1}} \\ \nabla _{\varvec{A}_{v}} \eta _{\varvec{\theta }}(\varvec{x})&{}=G_{v,L}\varvec{1}_{k_{v+1}}\psi (P_{v-1})^\top \end{array}\right. } \end{aligned}$$

By the above form and the fact that \(\psi (u),\psi '(u),|x_i|\le 1\), it can be easily checked by induction \(|G_{v,L}|\le \prod _{v'=v+1}^L a_{v'}\) which completes the proof. \(\square \)

Lemma 12

Let, \(\widetilde{\mathcal {F}}_n=\{\sqrt{\ell }: \ell _{\varvec{\theta }_{n}}(y,\varvec{x}), \varvec{\theta }_{n} \in \mathcal {F}_n\}\) where \(\ell _{\varvec{\theta }_{n}}(y,\varvec{x})\) is given by

$$\begin{aligned} \ell _{\varvec{\theta }_{n}}(y,\varvec{x})=\exp \left( y \eta _{\varvec{\theta }_{n}}(\varvec{x})-\log \left( 1+e^{ \eta _{\varvec{\theta }_{n}}(\varvec{x})}\right) \right) \end{aligned}$$
(B4)

and \(\mathcal {F}_n\) is given by

$$\begin{aligned} \mathcal {F}_n=\Big \{\varvec{\theta }_{n}:|\theta _{jn}|\le C_n, j=1,\ldots , K_n\Big \} \end{aligned}$$
(B5)

Then with \(H_{[]}(u,\widetilde{\mathcal {F}}_n,||.||_2)\) is as in Definition 3,

$$\begin{aligned}{} & {} \int _{\varepsilon ^2/8}^{\sqrt{2}\varepsilon }\sqrt{H_{[]}(u,\widetilde{\mathcal {F}}_n,||.||_2)}du\\{} & {} \quad \lesssim \varepsilon \sqrt{K_n((L_n+1) \log K_n+(L_n+2)\log C_n-\log \varepsilon )} \end{aligned}$$

Proof

In this proof, we suppress the dependence on n. By lemma 4.1 in Pollard (1990),

$$\begin{aligned} N(\varepsilon ,\mathcal {F}_n,||.||_\infty )\le \left( \frac{3C}{\varepsilon }\right) ^{K}. \end{aligned}$$

For \(\varvec{\theta }_1, \varvec{\theta }_2 \in \mathcal {F}\), let \(\widetilde{\ell }(u)=\sqrt{\ell _{u\varvec{\theta }_1+(1-u)\varvec{\theta }_2}(\varvec{x},y)}\). Following Equation (52) in Bhattacharya and Maiti (2021),

$$\begin{aligned} \sqrt{\ell _{\varvec{\theta }_1}(\varvec{x},y)}-\sqrt{\ell _{\varvec{\theta }_2}(\varvec{x},y)}&\le K\sup _{j} \Big |\frac{\partial {\widetilde{\ell }}}{\partial {\theta _j}}\Big |||\varvec{\theta }_1-\varvec{\theta }_2||_{\infty }\nonumber \\&\le F(\varvec{x},y)||\varvec{\theta }_1-\varvec{\theta }_2||_{\infty } \end{aligned}$$
(B6)

where the upper bound \(F(\varvec{x},y)=(CK)^L\). This is because \(|\partial \widetilde{\ell }/\partial \theta _j|\), the derivative of \(\sqrt{\ell }\) w.r.t. is bounded above by \(|\partial \eta _{\varvec{\theta }}(\varvec{x})/\partial \theta _j|\) as shown below.

$$\begin{aligned} \left| \frac{\partial {\widetilde{\ell }}}{\partial {\theta _j}}\right|&=\left| \frac{1}{2}\frac{\partial {\eta _{\varvec{\theta }}(\varvec{x})}}{\partial {\theta _j}}\left( y-\frac{e^{\eta _{\varvec{\theta }}(\varvec{x})}}{1+e^{\eta _{\varvec{\theta }}(\varvec{x})}}\right) \right. \\&\left. \sqrt{e^{(y\eta _{\varvec{\theta }}(\varvec{x})-\log (1+e^{\eta _{\varvec{\theta }}(\varvec{x})}))}}\right| \\&\le \left| \frac{1}{2} \frac{\partial {\eta _{\varvec{\theta }}(\varvec{x})}}{\partial {\theta _j}}\right| \left( \frac{e^{\eta _{\varvec{\theta }}(\varvec{x})}}{1+e^{\eta _{\varvec{\theta }}(\varvec{x})}}\right) ^{1/2}\left( \frac{1}{1+e^{\eta _{\varvec{\theta }}(\varvec{x})}}\right) ^{1/2}\\&\le \frac{1}{4}\left| \frac{\partial {\eta _{\varvec{\theta }}(\varvec{x})}}{\partial {\theta _j}}\right| \end{aligned}$$

Thus, using \(e^{\eta _{\varvec{\theta }}(\varvec{x})}/(1+e^{\eta _{\varvec{\theta }}(\varvec{x})})\) and Lemma 11, we get

$$\begin{aligned} \sup _{j=0, \ldots , K_n}\left| \frac{\partial {\eta _{\varvec{\theta }}(\varvec{x})}}{\partial {\theta _j}}\right| \le \prod _{v=1}^{L} a^*_{v}=\prod _{v=1}^L k_v C\le (KC)^L \end{aligned}$$

In view of (B6) and theorem 2.7.11 in van der Vaart and Wellner (1996), we have

$$\begin{aligned}{} & {} N_{[]}(\varepsilon , \widetilde{\mathcal {F}}_n, ||.||_2) \le \left( \frac{3K^{L+1}C^{L+2}}{2\varepsilon }\right) ^{K}\\{} & {} \quad \implies H_{[]}(\varepsilon , \widetilde{\mathcal {F}}_n, ||.||_2)\lesssim K \log \frac{K^{L+1}C^{L+2} }{\varepsilon } \end{aligned}$$

where \(N_{[]}\) and \(H_{[]}\) denote the bracketing number and bracketing entropy as in Definition 3. Using, Lemma 5 with \(M=K^{L+1}C^{L+2}\), we get

$$\begin{aligned}{} & {} \int _0^{\varepsilon } \sqrt{H_{[]}(u, \widetilde{\mathcal {F}}_n, ||.||_2)} du\\{} & {} \quad \lesssim \varepsilon \sqrt{K((L+1)\log K +2(L+2)\log C-\log \varepsilon )} \end{aligned}$$

Therefore,

$$\begin{aligned}&\int _{\varepsilon ^2/8}^{\sqrt{2}\varepsilon } H_{[]}(u, \widetilde{\mathcal {F}}_n, ||.||_2) du\le \int _{0}^{\sqrt{2}\varepsilon } {H_{[]}(u, \widetilde{\mathcal {F}}_n, ||.||_2)} du\\&\quad \lesssim \sqrt{2}\varepsilon \sqrt{K((L+1) \log K+(L+2)\log C-\log \sqrt{2} \varepsilon )} \end{aligned}$$

The proof follows by noting \(\log \sqrt{2}\varepsilon \ge \log \varepsilon \).

Proposition 13 establishes a bound on the log-likelihood ratio when the neural network lies outside the Hellinger neigborhood of the true density function. Proposition 14 shows that the prior gives negligible probability outside the sieve. Proposition 15 shows that the prior gives sufficiently large probability to KL-neighborhoods of the true density function. Propositions 13, 14 and 15 taken together will be used to establish the posterior consistency of the true posterior. \(\square \)

Proposition 13

Let \(n\epsilon _n^2\rightarrow \infty \). Suppose \(K_n\log n =o(n^b\epsilon _n^2)\), for some \(0<b<1\), \(L_n\sim \log n\) and \(p(\varvec{\theta }_{n})=MVN(\varvec{\mu }_n,\text {diag}(\varvec{\zeta }_n^2))\) where \(\log ||\varvec{\zeta }_n||_\infty =O(\log n)\) and \(||\varvec{\mu }_n||_2^2=o(n\epsilon _n^2)\). Then for every \(\varepsilon >0\),

$$\begin{aligned} \log \int _{\mathcal {U}_{\varepsilon \epsilon _n}^c} \frac{L(\varvec{\theta }_{n})}{L_0} p(\varvec{\theta }_{n})d\varvec{\theta }_{n}\le \log 2-\varepsilon ^2 n\epsilon _n^2 + o_{P_0^n}(1) \end{aligned}$$

Proof

It suffices to show

$$\begin{aligned} P_0^n\left( \int _{\mathcal {U}_{\varepsilon \epsilon _n}^c} \frac{L(\varvec{\theta }_{n})}{L_0} p(\varvec{\theta }_{n})d\varvec{\theta }_{n}> 2e^{-\varepsilon n\epsilon _n^2}\right) \rightarrow 0,\,\, n \rightarrow \infty \nonumber \\ \end{aligned}$$
(B7)

The expression on the left above is bounded above by

$$\begin{aligned}&P_0^n\left( \int _{\mathcal {U}_{\varepsilon \epsilon _n}^c \cap \mathcal {F}_n} \frac{L(\varvec{\theta }_{n})}{L_0} p(\varvec{\theta }_{n})d\varvec{\theta }_{n}> e^{-\varepsilon ^2 n\epsilon _n^2}\right) \\&\quad +P_0^n\left( \int _{\mathcal {F}_n^c} \frac{L(\varvec{\theta }_{n})}{L_0} p(\varvec{\theta }_{n})d\varvec{\theta }_{n}> e^{-\varepsilon ^2 n\epsilon _n^2}\right) \end{aligned}$$

Using lemma 12 with \(\varepsilon =\varepsilon \epsilon _n\) and \(C_n=e^{n^b\epsilon _n^2/K_n}\),

$$\begin{aligned}&\int _{\varepsilon ^2\epsilon _n^2/8}^{{\sqrt{2}\varepsilon \epsilon _n}}{H_{[]}(u, \widetilde{\mathcal {F}}_n, ||.||_2)} du\\&\quad \lesssim \epsilon _n \varepsilon \sqrt{K_n((L_n+1) \log K_n+(L_n+2)\log C_n-\log \varepsilon \epsilon _n)}\\&\quad \le \varepsilon \epsilon _n O(\max (\sqrt{K_n(L_n+1)\log K_n}, \\&\quad \sqrt{K_n(L_n+2)\log C_n}, \sqrt{-\log \epsilon _n}))\\&\quad \le \varepsilon \epsilon _n \max (o(\epsilon _n\sqrt{n^b \log n}),\\&\quad O(\epsilon _n\sqrt{n^b \log n} ),O(\sqrt{\log n}))\le \varepsilon ^2 \epsilon _n^2 \sqrt{n} \end{aligned}$$

where \(H_{[]}(u,\widetilde{\mathcal {F}}_n,||.||_2)\) is as in Definition 3. The first inequality in the third step follows because \(L_n\sim \log n\), \(K_n\log n=o(n^b\epsilon _n^2)\) and \(K_n\log C_n =n^b \epsilon _n^2\), \( -\log \epsilon _n^2\le \log n\). The second inequality in the third step is by \((n^b \log n)/n=o(1)\)

By theorem 1 in Wing Hung and **aotong (1995), for some constant \(C>0\), we have

$$\begin{aligned}&P_0^n\left( \int _{\varvec{\theta }_{n}\in \mathcal {U}_{\varepsilon \epsilon _n}^c \cap \mathcal {F}_n } \frac{L(\varvec{\theta }_{n})}{L_0 }p(\varvec{\theta }_{n})d\varvec{\theta }_{n}> e^{-\varepsilon ^2n\epsilon _n^2}\right) \nonumber \\&\quad \le P_0^n\left( \sup _{\varvec{\theta }_{n}\in \mathcal {U}_{\varepsilon \epsilon _n}^c \cap \mathcal {F}_n } \frac{L(\varvec{\theta }_{n})}{L_0}> e^{-\varepsilon ^2n\epsilon _n^2}\right) \nonumber \\&\quad \le 4\exp (-C\varepsilon ^2 n\epsilon _n^2)=o(n\epsilon _n^2) \end{aligned}$$
(B8)

Using proposition 14 with \(\varepsilon =2\varepsilon \), we have

$$\int _{\varvec{\theta }_{n} \in \mathcal {F}_n^c} p(\varvec{\theta }_{n})d\varvec{\theta }_{n} \le e^{-2 n \varepsilon ^2 {\epsilon _n}^2}$$

Therefore, using Lemma 6 with \(\varepsilon =2\varepsilon ^2\epsilon _n^2\) and \(\tilde{\varepsilon }={\varepsilon }^2 \epsilon _n^2\), we have

$$\begin{aligned} P_0^n\left( \int _{\mathcal {F}_n^c} \frac{L(\varvec{\theta }_{n})}{L_0} p(\varvec{\theta }_{n})d\varvec{\theta }_{n}> e^{-\varepsilon ^2 n\epsilon _n^2}\right) \le e^{-\varepsilon ^2 n\epsilon _n^2} \rightarrow 0.\nonumber \\ \end{aligned}$$
(B9)

Combining (B8) and (B9), (B7) follows. \(\square \)

Proposition 14

Let \(n\epsilon _n^2\rightarrow \infty \). Let \(p(\varvec{\theta }_{n})=MVN(\varvec{\mu }_n,\text {diag}(\varvec{\zeta }_n^2))\) where \(\log ||\varvec{\zeta }_n||_\infty =O(\log n)\) and \(||\varvec{\mu }_n||_2^2=o(n\epsilon _n^2)\). Suppose for some \(0<b<1\), \(K_n\log n=o(n^b\epsilon _n^2)\), then for \(C_n=e^{n^b \epsilon _n^2/K_n}\) and \(\mathcal {F}_n\) as in (33), for any \(\varepsilon >0\),

$$\int _{\varvec{\theta }_{n} \in \mathcal {F}_n^c}p(\varvec{\theta }_{n})d\varvec{\theta }_{n}\le e^{- n \varepsilon \epsilon _n^2}, n \rightarrow \infty $$

Proof

Let \(\mathcal {F}_{jn}=\{\theta _{jn}: |\theta _{jn}|\le C_n\}\), then \(\mathcal {F}_n=\cap _{j=1}^{K_n} \mathcal {F}_{jn}\implies \mathcal {F}_n^c= \cap _{j=1}^{K_n}\mathcal {F}_{jn}^c\). This implies \(\int _{\varvec{\theta }_{n} \in \mathcal {F}_n^c}p(\varvec{\theta }_{n})d\varvec{\theta }_{n}\le \sum _{j=1}^{K_n}\int _{\mathcal {F}_{jn}^c}(e^{-\frac{(\theta _{jn}-\mu _{jn})^2}{2\zeta _{jn}^2}}/\sqrt{2\pi \zeta _{jn}^2})d\theta _{jn}\). Thus,

$$\begin{aligned}&\int _{\varvec{\theta }_{n} \in \mathcal {F}_n^c}p(\varvec{\theta }_{n})d\varvec{\theta }_{n} \le \sum _{j=1}^{K_n}\int _{-\infty }^{-C_n}\frac{1}{\sqrt{2\pi \zeta _{jn}^2}}e^{-\frac{(\theta _{jn}-\mu _{jn})^2}{2\zeta _{jn}^2}}d\theta _{jn}\\&\qquad +\sum _{j=1}^{K_n}\int _{C_n}^{\infty }\frac{1}{\sqrt{2\pi \zeta _{jn}^2}}e^{-\frac{(\theta _{jn}-\mu _{jn})^2}{2\zeta _{jn}^2}}d\theta _{jn}\\&\quad =\sum _{j=1}^{K_n}\left( 1-\Phi \left( \frac{C_n-\mu _{jn}}{\zeta _{jn}}\right) \right) \\&\qquad +\sum _{j=1}^{K_n}\left( 1-\Phi \left( \frac{C_n+\mu _{jn}}{\zeta _{jn}}\right) \right) \end{aligned}$$

Since \(||\varvec{\mu }_n||_2^2=o(n\epsilon _n^2) \implies ||\varvec{\mu }_n||_\infty =o(\sqrt{n}\epsilon _n)\). Since \(\log ||\varvec{\zeta }_n||_\infty =O(\log n)\) which implies for some \(M>0\), \(d\ge 1\),

$$\begin{aligned} \min \left( \frac{|C_n-\mu _{jn}|}{\zeta _{jn}},\frac{|C_n+\mu _{jn}|}{\zeta _{jn}}\right)&\ge \frac{(C_n-\sqrt{n})}{n^dM}\nonumber \\&\ge e^{\log C_n-(d+1)\log n}\nonumber \\&\quad -\frac{1}{n^{d-1/2}M}\nonumber \\&\sim e^{R_n\log n} \rightarrow \infty \end{aligned}$$
(B10)

where the last convergence holds since \(K_n\log n=o(n^b \epsilon _n^2)\). This further implies \(R_n=(n^b \epsilon _n^2)/(K_n\log n)-(d+1) \rightarrow \infty \). Thus, using Mill’s ratio, we get:

$$\begin{aligned} \int _{\varvec{\theta }_{n} \in \mathcal {F}_n^c}p(\varvec{\theta }_{n})d\varvec{\theta }_{n}&=O\left( \sum _{j=1}^{K_n}\frac{\zeta _{jn}}{C_n-\mu _{jn}}e^{-\frac{(C_n-\mu _{jn})^2}{2\zeta _{jn}^2}}\right. \\&\quad \left. +\sum _{j=1}^{K_n}\frac{\zeta _{jn}}{C_n+\mu _{jn}}e^{-\frac{(C_n+\mu _{jn})^2}{2\zeta _{jn}^2}}\right) \\&\le 2K_ne^{-\frac{(C_n-\sqrt{n})^2}{2n^2M^2}}\le e^{-\varepsilon n\epsilon _n^2} \end{aligned}$$

where the last asymptotic inequality holds because

$$\begin{aligned}&\frac{(C_n-\sqrt{n})^2}{2n^d M^2}-\log 2K_n\sim \frac{1}{2}e^{2R_n\log n} -2\log K_n \\&\quad \ge n\left( \frac{e^{2R_n}}{2}-\frac{2\log n}{n}\right) \ge \varepsilon n\epsilon _n^2 \end{aligned}$$

In the above step, the first asymptotic equivalence is by (B10), the second inequality holds since \(K_n\le n\). The last inequality is by \(R_n \rightarrow \infty \) and \(\log /n\rightarrow 0\). \(\square \)

Proposition 15

Let \(p(\varvec{\theta }_{n})=MVN(\varvec{\mu }_n,\text {diag}(\varvec{\zeta }_n^2))\) with \(\log ||\varvec{\zeta }_n||_\infty =O(\log n)\), \(||\varvec{\zeta }^*_n||_\infty =O(1)\). Let \(||\eta _0-\eta _{\varvec{\theta }^*_n}||_1\le \varepsilon \epsilon _n^2/4\), \(n\epsilon _n^2 \rightarrow \infty \). Define,

$$\begin{aligned} d_{\textrm{KL}}(\ell _0,\ell _{\varvec{\theta }_{n}})= & {} \int _{\varvec{x}\in [0,1]^{p_n}} \left( \sigma (\eta _0(\varvec{x}))(\eta _0(\varvec{x})-\eta _{\varvec{\theta }_{n}}(\varvec{x}))\right. \nonumber \\{} & {} \quad \left. + \log \frac{1-\sigma (\eta _0(\varvec{x}))}{1-\sigma (\eta _{\varvec{\theta }_{n}}(\varvec{x}))}\right) d\varvec{x}\nonumber \\ \mathcal {N}_\varepsilon= & {} \left\{ \varvec{\theta }_{n}:d_{\textrm{KL}}(\ell _0,\ell _{\varvec{\theta }_{n}})<\varepsilon \right\} \end{aligned}$$
(B11)

If \(K_n\log n =o(n\epsilon _n^2)\), \(||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2)\), \(\log (\sum _{v=0}^{L_n} k_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})=O(\log n)\), \(||\varvec{\mu }_n||_2^2=o(n\epsilon _n^2)\),

$$ \int _{\varvec{\theta }_{n}\in N_{\varepsilon \epsilon _n^2}} p(\varvec{\theta }_{n})d\varvec{\theta }_{n} \ge e^{-n\epsilon _n^2\nu } \hspace{3mm}\forall \,\, \nu >0$$

Proof

Let \(\eta _{\varvec{\theta }^*_n}(\varvec{x})=\varvec{b}_L^*+\varvec{A}_L^* \psi (\varvec{b}_{L-1}^*+\varvec{A}_{L-1}^* \psi ( \cdots \psi (\varvec{b}_1^*+\varvec{A}_1^*\psi (\varvec{b}_0^*+\varvec{A}_0^* \varvec{x})))\) be the neural network such that

$$\begin{aligned} ||\eta _{\varvec{\theta }^*_n}-\eta _0||_1\le \frac{\varepsilon \epsilon _n^2}{4} \end{aligned}$$
(B12)

Such a neural network exists since \(||\eta _0-\eta _{\varvec{\theta }^*_n}||_1\le \varepsilon \epsilon _n^2/4\). Next define \(\mathcal {M}_{\varepsilon \epsilon _n^2}\) as:

$$\begin{aligned} \mathcal {M}_{\varepsilon \epsilon _n^2}&=\Big \{\varvec{\theta }_{n}:|{\theta }_{jn}-{\theta }^*_{jn}|<\frac{\varepsilon \epsilon _n^2}{2\sum _{v=0}^{L_n} \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n}}, \\&\quad j=1,\ldots , K_n\Big \} \end{aligned}$$

where \(\tilde{k}_{vn}=k_{vn}+1\). For every \(\varvec{\theta }_{n} \in \mathcal {M}_{\varepsilon \epsilon _n^2}\), by Lemma 8, we have

$$\begin{aligned} ||\eta _{\varvec{\theta }_{n}}-\eta _{\varvec{\theta }^*_n}||_1 \le \frac{\varepsilon \epsilon _n^2}{2} \end{aligned}$$
(B13)

Combining (B12) and (B13), we get for \(\varvec{\theta }_{n}\in \mathcal {M}_{\varepsilon \epsilon _n^2}\), \(||\eta _{\varvec{\theta }_{n}}-\eta _{0}||_1 \le \varepsilon \epsilon _n^2/2\).

This, in view of Lemma 10, \(d_{\textrm{KL}}(\ell _0,\ell _{\varvec{\theta }_{n}}) \le \varepsilon \epsilon _n^2\). Let \(\varvec{\theta }_{n} \in \mathcal {N}_{\varepsilon \epsilon _n^2}\) for every \(\varvec{\theta }_{n} \in \mathcal {M}_{\varepsilon \epsilon _n^2}\). Therefore,

$$\begin{aligned} \int _{\varvec{\theta }_{n} \in \mathcal {N}_{\varepsilon \epsilon _n^2}} p(\varvec{\theta }_{n})d\varvec{\theta }_{n}\ge \int _{\varvec{\theta }_{n} \in \mathcal {M}_{\varepsilon \epsilon _n^2}} p(\varvec{\theta }_{n})d\varvec{\theta }_{n} \end{aligned}$$

Let \(\delta _n=\varepsilon \epsilon _n^2/(2\sum _{v=0}^{L_n} \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})\), then

$$\begin{aligned}&\int _{\varvec{\theta }_{n} \in \mathcal {M}_{\varepsilon \epsilon _n^2}}p(\varvec{\theta }_{n})d\varvec{\theta }_{n}=\prod _{j=1}^{K_n}\int _{\theta _{jn}^*-\delta _{n}}^{\theta _{jn}^*+\delta _{n}}\frac{1}{\sqrt{2\pi \zeta _{jn}^2}}e^{-\frac{(\theta _{jn}-\mu _{jn})^2}{2\zeta _{jn}^2}}d\theta _{jn}\nonumber \\&\quad = \prod _{j=1}^{K_n}\frac{2\delta _{n}}{\sqrt{2\pi \zeta _{jn}^2}}e^{-\frac{(\widehat{\theta }_{jn}-\mu _{jn})^2}{2\zeta _{jn}^2}},\,\, \widehat{\theta }_{jn}\in [\theta _{jn}^*-\delta _{n},\theta _{jn}^*+\delta _{n}]\nonumber \\&\quad =\prod _{j=1}^{K_n}e^{-\left( -\frac{1}{2}\log \frac{2}{\pi }-\log \delta _n+\log \zeta _{jn}+\frac{(\widehat{\theta }_{jn}-\mu _{jn})^2}{2\zeta _{jn}^2}\right) } \end{aligned}$$
(B14)

where the second last equality holds by mean value theorem.

Note that \(\widehat{\theta }_{jn} \in [\theta _{jn}^*-1,\theta _{jn}^*+1]\) since \(\delta _n \rightarrow 0\), therefore

$$\begin{aligned}&\frac{(\widehat{\theta }_{jn}-\mu _{jn})^2}{2\zeta _{jn}^2} \le \frac{\max ( (\theta _{jn}^*-\mu _{jn}-1)^2,(\theta _{jn}^*-\mu _{jn}+1)^2)}{2\zeta _{jn}^2}\\&\quad \le \frac{(\theta _{jn}^*-\mu _{jn})^2}{\zeta _{jn}^2}+\frac{1 }{\zeta _{jn}^2} \end{aligned}$$

where the last inequality follows since \((a+b)^2\le 2(a^2+b^2)\). Also,

$$\begin{aligned} \sum _{j=1}^{K_n}\frac{(\widehat{\theta }_{jn}-\mu _{jn})^2}{2\zeta _{jn}^2}&\le 2\sum _{j=1}^{K_n}\frac{{\theta _{jn}^*}^2}{\zeta _{jn}^2}+ 2\sum _{j=1}^{K_n}\frac{\mu _{jn}^2}{\zeta _{jn}^2}+\sum _{j=1}^{K_n}\frac{1}{\zeta _{jn}^2}\nonumber \\&\le 2 (||\varvec{\theta }^*_n||_2^2+||\varvec{\mu }_n||_2^2+1) ||\varvec{\zeta }_n^*||_\infty \le n\nu \epsilon _n^2 \end{aligned}$$
(B15)

since \(||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2)\), \(||\varvec{\mu }_n||_2^2=o(n\epsilon _n^2)\) and \(||\varvec{\zeta }_n^*||_\infty =O(1)\) and \(n\epsilon _n^2 \rightarrow \infty \). Also,

$$\begin{aligned} -\log \delta _n +\log \zeta _{jn}&=\log 2 +\log (\sum _{v=0}^{L_n} \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})\\&\quad -\log \varepsilon \epsilon _n^2\\&\le \log 2 +\log (\sum _{v=0}^{L_n} \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})\\&\quad +\log \zeta _{jn}-\log \varepsilon \\&\quad -2\log \epsilon _n\\&\le \log 2 +O(\log n)+O(\log n)\\&\quad -\log \varepsilon +O(\log n) \end{aligned}$$

where the last follows since \(\log ||\varvec{\zeta }_n||_\infty =O(\log n)\), \(\log (\sum _{v=0}^{L_n} k_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})=O(\log n)\) and \(1/n\epsilon _n^2=o(1)\) which implies \(-2\log \epsilon _n=o(\log n)\).

$$\begin{aligned} \sum _{j=1}^{K_n}-\frac{1}{2}\log \frac{2}{\pi }-\log \delta _n +\log \zeta _{jn}=O(K_n\log n)=o(n\epsilon _n^2) \end{aligned}$$
(B16)

where the last inequality follows since \(K_n\log n=o(n\epsilon _n^2)\),

Combining (B15) and (B16) and replacing (B14), the proof follows.

Proposition 16 establishes that under a suitable choice of the variational family q and the prior p, the KL distance between p and q is suitably bounded. Proposition 17 shows that the integral of the logistic loss between the neural network model and the true model with respect to the variational family q is small. Propositions 15, 16 and 17 taken together will be used to establish that the KL-distance between the true posterior and the variational posterior is suitably bounded. \(\square \)

Proposition 16

For \(\log ||\varvec{\zeta }_n||_\infty =O(\log n)\) and \(||\varvec{\zeta }^*_n||_\infty =O(1)\), let \(q(\varvec{\theta }_{n})=MVN(\varvec{\theta }^*_n,I_{K_n}/n^{2+2d})\) and \(p(\varvec{\theta }_{n})=MVN(\varvec{\mu }_n,\text {diag}(\varvec{\zeta }_n^2))\). Let \(K_n\log n\) \(=o(n\epsilon _n^2)\), \(||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2)\), \(||\varvec{\mu }_n||_2^2=o(n\epsilon _n^2)\) and \(n\epsilon _n^2 \rightarrow \infty \), then for any \(\nu >0\),

$$\begin{aligned} d_{\textrm{KL}}(q,p)\le n\epsilon _n^2\nu \end{aligned}$$

Proof

$$\begin{aligned}&d_{\textrm{KL}}(q,p)=\sum _{j=1}^{K_n}\left( \log \sqrt{n^{1+d}}\zeta _{jn}+\frac{1}{n^{1+d}\zeta _{jn}^2}\right. \\&\qquad \left. +\frac{(\theta _{jn}^*-\mu _{jn})^2}{\zeta _{jn}^2}-\frac{1}{2}\right) \\&\quad \le \frac{K_n}{2}((d+1)\log n-1)+\sum _{j=1}^{K_n}\log \zeta _{jn}\\&\qquad +\frac{1}{n^{1+d}}\sum _{j=1}^{K_n}\frac{1}{\zeta _{jn}^2}+2\sum _{j=1}^{K_n}\frac{{\theta _{jn}^*}^2}{\zeta _{jn}^2}+2\sum _{j=1}^{K_n}\frac{\mu _{jn}^2}{\zeta _{jn}^2}\\&\quad \le K_n\Big ((d+1)\frac{\log n}{2}+\log ||\varvec{\zeta }_n||_\infty \Big ) \\&\qquad +2\Big (\frac{K_n}{n}+||\varvec{\theta }^*_n||_2^2+||\varvec{\mu }_n||_2^2\Big )||\varvec{\zeta }_n^*||_\infty \\&\quad =o(n\epsilon _n^2) \end{aligned}$$

where the second last inequality uses \(\varvec{\zeta }^*_n=1/\varvec{\zeta }_n\). The last equality follows since \(\log ||\varvec{\zeta }_n||_{\infty }=O(\log n)\), \(||\varvec{\zeta }_n^*||_\infty =O(1)\), \(K_n\log n=o(n\epsilon _n^2)\), \(||\varvec{\mu }_n||_2^2=o(n\epsilon _n^2)\) and \(||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2)\). \(\square \)

Proposition 17

Let \(q(\varvec{\theta }_{n}) \sim MVN(\varvec{\theta }^*_n,I_{K_n}/n^{2+2d})\) where \(d>d^*>0\) and \(\sum _{v=0}^{L_n} k_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n}=O(n^{d^*})\). Define

$$\begin{aligned} h(\varvec{\theta }_{n})= & {} \int _{\varvec{x}\in [0,1]^{p_n}} \left( \sigma (\eta _0(\varvec{x}))(\eta _0(\varvec{x})-\eta _{\varvec{\theta }_{n}}(\varvec{x}))\right. \\{} & {} \left. + \log \frac{1-\sigma (\eta _0(\varvec{x}))}{1-\sigma (\eta _{\varvec{\theta }_{n}}(\varvec{x}))}\right) d\varvec{x}\end{aligned}$$

Let \(||\eta _0-\eta _{\varvec{\theta }^*_n}||_1\le \varepsilon \epsilon _n^2/4\) where \(n\epsilon _n^2 \rightarrow \infty \). If \(K_n\log n=o(n\epsilon _n^2)\), \(||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2)\),

$$\begin{aligned} \int h(\varvec{\theta }_{n})q(\varvec{\theta }_{n})d\varvec{\theta }_{n} \le \varepsilon \epsilon _n^2. \end{aligned}$$

Proof

Since \(h(\varvec{\theta }_{n})\) is a KL-distance, \(h(\varvec{\theta }_{n})>0\). We establish an upper bound:

$$\begin{aligned} \int h(\varvec{\theta }_{n})q(\varvec{\theta }_{n})d\varvec{\theta }_{n}&\le \int _{\varvec{x}\in [0,1]^{p_n}} |\eta _{\varvec{\theta }_{n}}(\varvec{x})-\eta _0(\varvec{x})|d\varvec{x}\nonumber \\&\le \int \int _{\varvec{x}\in [0,1]^{p_n}} |\eta _{\varvec{\theta }_{n}}(\varvec{x})-\eta _{\theta _n^*}(\varvec{x})|d\varvec{x}q(\varvec{\theta }_{n})d\varvec{\theta }_{n}\nonumber \\&\qquad +||\eta _{\varvec{\theta }_{n}^*}-\eta _0||_1\nonumber \\&\le \int ||\eta _{\varvec{\theta }_{n}}-\eta _{\varvec{\theta }_{n}^*}||_1 q(\varvec{\theta }_{n})d\varvec{\theta }_{n}+\varepsilon \epsilon _n^2 \end{aligned}$$
(B17)

where the first inequality is a consequence of Lemma 10 and the last inequality follows since \(||\eta _{\varvec{\theta }_{n}^*}-\eta _0||_1=o(\epsilon _n^2)\).

Let \(S=\{\varvec{\theta }_{n}:\cap _{j=1}^{K_n}|\theta _{jn}-\theta _{jn}^*|\le \varepsilon \epsilon _n^2/(\sum _{v=0}^L \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n}) \}\), then

$$\begin{aligned}&\int ||\eta _{\varvec{\theta }_{n}}-\eta _{\varvec{\theta }_{n}^*}||_1 q(\varvec{\theta }_{n})d\varvec{\theta }_{n} \nonumber \\&\quad =\int _S ||\eta _{\varvec{\theta }_{n}}-\eta _{\varvec{\theta }_{n}^*}||_1 q(\varvec{\theta }_{n})d\varvec{\theta }_{n}+\int _{S^c} ||\eta _{\varvec{\theta }_{n}}-\eta _{\varvec{\theta }_{n}^*}||_1 q(\varvec{\theta }_{n})d\varvec{\theta }_{n}\nonumber \\&\quad \le \varepsilon \epsilon _n^2+\int _{S^c} |\varvec{b}_L[s]-\varvec{b}_L^*[s]|q(\varvec{\theta }_{n})d\varvec{\theta }_{n}\nonumber \\&\qquad +\int _{S^c}\sum _{s'=1}^{k_{L_n}}|\varvec{A}_L[s][s']-\varvec{A}_L^*[s][s']| q(\varvec{\theta }_{n})d\varvec{\theta }_{n}\nonumber \\&\qquad + \sum _{s=1}^{k_{L_n}}|\varvec{A}_L^*[1][s]| \int _{S^c} q(\varvec{\theta }_{n})d\varvec{\theta }_{n} \end{aligned}$$
(B18)

Let \(S^c=\cup _{j=1}^{K_n}S_j^c\), \(S_j=\{|\theta _{jn}-\theta _{jn}^*|\le u_n\}\), \(u_n=\varepsilon \epsilon _n^2/(\sum _{v=0}^L \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})\). We first compute \(Q(S^c)\) as follows:

$$\begin{aligned} Q(S^c)&=Q(\cup _{j=1}^{K_n}S_j^c)\le \sum _{j=1}^{K_n}Q(S_j^c)\nonumber \\&=\sum _{j=1}^{K_n}\int _{|\theta _{jn}-\theta _{jn}^*|>u_n}q(\theta _{jn})d\theta _{jn}\nonumber \\&=2K_n\left( 1-\Phi \left( n^{1+d}u_n\right) \right) \end{aligned}$$
(B19)

Using (B19) in the last term of (B18), we get

$$\begin{aligned}&\sum _{s=1}^{k_{L_n}}|\varvec{A}_L^*[1][s]| \int _{S^c} q(\varvec{\theta }_{n})d\varvec{\theta }_{n} \nonumber \\&\quad = Q(S^c)\sum _{s=1}^{k_{L_n}}|\varvec{A}_L^*[1][s]| = a^*_{L_n n}K_n(1-\Phi (n^{1+d}u_n))\nonumber \\&\quad =o(n\epsilon _n^2)O\left( n^d\frac{1}{n^{1+d}u_n}e^{-n^{2(1+d)}u_n^2}\right) =o(\epsilon _n^2) \end{aligned}$$
(B20)

where second step follows by Mill’s ratio, \(K_n=o(n\epsilon _n^2)\), \(\sum _{v=0}^{L_n} k_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n}=O(n^d)\) which implies \(n^{1+d}u_n \rightarrow \infty \). The third step holds because

$$\begin{aligned}{} & {} \frac{n^{1+d}}{n^{1+d}u_n}e^{-n^{2(1+d)}u_n^2}\le e^{-n^{2(1+d)}u_n^2}\nonumber \\{} & {} \quad =e^{-\left( \frac{n^{2(1+d)}\varepsilon ^2 \epsilon _n^4}{\log n(\sum _{v=0}^L \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})^2}-(d+1)\right) }=o(1) \end{aligned}$$
(B21)

since \((\sum _{v=0}^L \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n} )^2 \log n=O(n^{2d^*} \log n)=o(n^{2d})\).

For the second term in (B18), let \({S'}=\{|\varvec{b}_L[s]-\varvec{b}_L^*[s]|>u_n\}\)

$$\begin{aligned}&\int _{S^c}\left( |\varvec{b}_L[s]-\varvec{b}_L^*[s]|\right) q(\varvec{\theta }_{n})d\varvec{\theta }_{n}\nonumber \\&\quad =\int _{S^c\cap {S'}} |\varvec{b}_L[s]-\varvec{b}_L^*[s]|q(\varvec{\theta }_{n})d\varvec{\theta }_{n}\nonumber \\&\qquad +\int _{S^c \cap {S'}^c}|\varvec{b}_L[s]-\varvec{b}_L^*[s]|q(\varvec{\theta }_{n})d\varvec{\theta }_{n}\nonumber \\&\quad \le \int _{{S'}} |\varvec{b}_L[s]-\varvec{b}_L^*[s]|q(\varvec{b}_L[s])d\varvec{b}_L[s]\nonumber \\&\qquad +E_{q(\varvec{b}_L[s])}|\varvec{b}_L[s]-\varvec{b}_L^*[s]|Q(\tilde{S}^c), \end{aligned}$$
(B22)

\(\tilde{S}^c\) is the union of all \(S_j^c\), \(j=1, \ldots , K_n\) except the one corresponding to \(\varvec{b}_{L}[s]\).

$$\begin{aligned}&\int _{{S'}} |\varvec{b}_L[s]-\varvec{b}_L^*[s]|q(\varvec{b}_L[s])d\varvec{b}_L[s]\nonumber \\&\quad =\int _{|\varvec{b}_L[s]-\varvec{b}_L^*[s]|>n^{1+d}u_n}\sqrt{\frac{n^{2+2d}}{2\pi }}(\varvec{b}_L[s]-\varvec{b}_L^*[s])\nonumber \\&\quad e^{-\frac{n^{2+2d}}{2}(\varvec{b}_L[s]-\varvec{b}_L^*[s])^2}d\varvec{b}_L[s]\nonumber \\&\quad =\frac{2}{\sqrt{n^{2+2d}}}\int _{n^{1+d}u_n}^{\infty } \frac{u}{\sqrt{2\pi }}e^{-\frac{1}{2}u^2}du\le e^{-n^{1+d}u_n} \end{aligned}$$
(B23)

Also, \(E_{q(\varvec{b}_L[s])}|\varvec{b}_L[s]-\varvec{b}_L^*[s]|=\sqrt{2/\pi }(1/n^{1+d})\). Thus

$$\begin{aligned}&E_{q(\varvec{b}_L[s])}|\varvec{b}_L[s]-\varvec{b}_L^*[s]|Q(\tilde{S}^c)\nonumber \\&\quad = O\left( \frac{K_n}{n^{1+d}}\left( 1-\Phi \left( n^{1+d}u_n\right) \right) \right) \sim \frac{K_n}{n^{2(1+d)}u_n}e^{- n^{2(1+d)}u_n}\nonumber \\&\quad \le e^{-n^{2(1+v)}u_n^2 } \end{aligned}$$
(B24)

where the first equality in the above step follows by observing that \(Q(\tilde{S}^c)\) behaves analogous to \(Q(S^c)\) which was computed in (B19) and the second equality in the above step follows due to Mill’s ratio and \(\sum _{v=0}^L \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n}=O(n^v)\) which implies \(n^{1+d} u_n \rightarrow \infty \). The third inequality in the above step is a consequence of the fact that \(K_n\le n^{1+d}\).

Combining (B20), (B23) and (B24), we get

$$\begin{aligned} \int _{S^c}\left( |\varvec{b}_L[s]-\varvec{b}_L^*[s]|\right) q(\varvec{\theta }_{n})d\varvec{\theta }_{n} \le e^{-n^{1+d}u_n} \end{aligned}$$
(B25)

Note the third term in (B18) can be handled similar to third term and it can be shown

$$\begin{aligned}&\int _{S^c} |\varvec{b}_L[s]-\varvec{b}_L^*[s]|q(\varvec{\theta }_{n})d\varvec{\theta }_{n}+\nonumber \\&\quad \int _{S^c}\sum _{s'=1}^{k_{L_n}}|\varvec{A}_L[s][s']-\varvec{A}_L^*[s][s']| q(\varvec{\theta }_{n})d\varvec{\theta }_{n}\nonumber \\&\quad \le k_{L_n+1}K_ne^{-n^{1+d}u_n}\le =o((n\epsilon _n^2)^2)e^{-n^{1+d}u_n}\nonumber \\&\quad \le o(\epsilon _n^2)e^{-(n^{1+d}u_n-2\log n)}=o(\epsilon _n^2) \end{aligned}$$
(B26)

where the last equality in the second step follows by \(K_n=o(n \epsilon _n^2)\) and the argument in (B21) by which \(e^{-(n^{1+d}u_n-2\log n)}=o(1)\).

Combining (B20) and (B26) with (B18) the proof follows.

Using Propositions 15, 16 and 17, the following Proposition 18 establishes that the KL-distance between the true posterior and the variational posterior is suitably bounded. \(\square \)

Proposition 18

Let \(p(\varvec{\theta }_{n})=MVN(\varvec{\mu }_n,\text {diag}(\varvec{\zeta }_n^2))\), \(||\varvec{\zeta }_n||_\infty =O(n)\) and \(||\varvec{\zeta }^*_n||_\infty =O(1)\).

  1. 1.

    Let \(L_n=L\), \(p_n=p\) independent of n. If \(K_n\log n=o(n)\) and \(||\varvec{\mu }_n||_2^2=o(n)\), then

    $$\begin{aligned} d_{\textrm{KL}}(\pi ^*,\pi (.|\varvec{y}_{n},\varvec{X}_{n}))=o_{P_0^n}(n) \end{aligned}$$
    (B27)
  2. 2.

    Let \(K_n\log n=o(n\epsilon _n^2)\), \(L_n \sim \log n\) and \(||\varvec{\mu }_n||_2^2=o(n\epsilon _n^2)\). If there exists a neural network such that \(||\eta _0-\eta _{\varvec{\theta }^*_n}||_1=o(n\epsilon _n^2)\), \(||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2)\) and \(\log (\sum _{v=0}^{L_n} \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})=O(\log n)\), then

    $$\begin{aligned} d_{\textrm{KL}}(\pi ^*,\pi (.|\varvec{y}_{n},\varvec{X}_{n}))=o_{P_0^n}(n\epsilon _n^2) \end{aligned}$$
    (B28)

Proof

For any \(q \in \mathcal {Q}_n\).

$$\begin{aligned}&d_{\textrm{KL}}(q,\pi (.|\varvec{y}_{n},\varvec{X}_{n}))\nonumber \\&\quad =\int q(\varvec{\theta }_{n})\log q(\varvec{\theta }_{n})d\varvec{\theta }_{n}\nonumber \\&\qquad -\int q(\varvec{\theta }_{n}) \log \pi (\varvec{\theta }_{n}|\varvec{y}_{n},\varvec{X}_{n})d\varvec{\theta }_{n}\nonumber \\&\quad =\int q(\varvec{\theta }_{n})\log q(\varvec{\theta }_{n})d\varvec{\theta }_{n}\nonumber \\&\qquad - \int q(\varvec{\theta }_{n}) \log \frac{L(\varvec{\theta }_{n})p(\varvec{\theta }_{n})}{\int L(\varvec{\theta }_{n})p(\varvec{\theta }_{n})d\varvec{\theta }_{n}} d\varvec{\theta }_{n}\nonumber \\&\quad =d_{\textrm{KL}}(q,p)-\int \log \frac{L(\varvec{\theta }_{n})}{L_0}q(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\nonumber \\&\qquad +\log \int \frac{L(\varvec{\theta }_{n})}{L_0} p(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\nonumber \\&\quad \le d_{\textrm{KL}}(q,p)+\left| \int \log \frac{L(\varvec{\theta }_{n})}{L_0}q(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\right| \nonumber \\&\qquad +\left| \log \int \frac{L(\varvec{\theta }_{n})}{L_0} p(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\right| \end{aligned}$$
(B29)

Since \(\pi ^*\) satisfies minimizes the KL-distance to \(\pi (.|\varvec{y}_{n},\varvec{X}_{n})\) in the family \(\mathcal {Q}_n\), therefore for any \(\kappa >0\)

$$\begin{aligned}&P_0^n\left( d_{\textrm{KL}}(\pi ^*,\pi (.|\varvec{y}_{n},\varvec{X}_{n}))>\kappa \right) \nonumber \\&\quad \le P_0^n\left( d_{\textrm{KL}}(q,\pi (.|\varvec{y}_{n},\varvec{X}_{n}))>\kappa \right) \end{aligned}$$
(B30)

\(\square \)

Proof of part 1

Note, \(K_n\log n=o(n)\), \(||\mu _n||_2^2 =o(n)\), \(||\varvec{\zeta }_n||_\infty =O(n)\) and \(||\varvec{\zeta }^*_n||_\infty =O(1)\). Let \(q(\varvec{\theta }_{n})=MVN(\varvec{\theta }^*_n, \varvec{I}_{K_n}/\sqrt{n})\) where \(\varvec{\theta }^*_n\) is defined next. For \(N\ge 1\), let \(\eta _{\varvec{\theta }^*_N}\) be a finite neural network approximation satisfying \(||\eta _{\varvec{\theta }^*_n}-\eta _0||_1 \le \varepsilon /4\). Since \(\eta _0\) is a continuous function defined on the compact set \([0,1]^p\), thus the existence of such a neural network is guaranteed by Theorem 2.1 in Hornik et al. (1989). Let \(\varvec{\theta }_{n}^*\) be \(\varvec{\theta }_N^*\) for all non zero coefficients and zeros for all non existent coefficients.

Step 1 (a): Using proposition 16, with \(\epsilon _n=1\), we get for any \(\nu >0\),

$$\begin{aligned} P_0^n(d_{\textrm{KL}}(q,p)>n\nu )=0 \end{aligned}$$
(B31)

where the above step follows \(||\varvec{\theta }^*_n||_2^2=||\varvec{\theta }^*_N||_2^2=||\varvec{\theta }^*_n||_2^2=o(n)\).

Step 1 (b): Next, note that

$$\begin{aligned}&d_{\textrm{KL}}(\ell _0,\ell _{\varvec{\theta }_{n}})\nonumber \\&\quad = \int _{\varvec{x}\in [0,1]^{p_n}} \left( \sigma (\eta _0(\varvec{x}))\log \frac{\sigma (\eta _0(\varvec{x}))}{\sigma (\eta _{\varvec{\theta }_{n}}(\varvec{x}))}\right. \nonumber \\&\quad \left. +(1-\sigma (\eta _0(\varvec{x})))\log \frac{1-\sigma (\eta _0(\varvec{x}))}{1-\sigma (\eta _{\varvec{\theta }_{n}}(\varvec{x}))}\right) d\varvec{x}\nonumber \\&\quad =\int _{\varvec{x}\in [0,1]^{p_n}} \left( \sigma (\eta _0(\varvec{x}))(\eta _{\varvec{\theta }_{n}}(\varvec{x})-\eta _0(\varvec{x}))\right. \nonumber \\&\qquad \left. +\log \frac{1-\sigma (\eta _0(\varvec{x}))}{1-\sigma (\eta _{\varvec{\theta }_{n}}(\varvec{x}))}\right) d\varvec{x}\end{aligned}$$
(B32)

Since \(||\eta _0-\eta _{\varvec{\theta }^*_n}||_1 \le \varepsilon /4\), using proposition 17 with \(\epsilon _n=1\) and \(\varepsilon =\varepsilon \), \(\int d_{\textrm{KL}}(\ell _0,\ell _{\varvec{\theta }_{n}})q(\varvec{\theta }_{n})d\varvec{\theta }_{n} \le \varepsilon \) which follows by noting that \(||\varvec{\theta }^*_n||_2^2=||\varvec{\theta }^*_N||_2^2=o(n)\) and \(\log (\sum _{v=0}^{L} \tilde{k}_{vN}\prod _{v'=v+1}^{L} a^*_{v'N})=O(\log n)\).

Therefore, by Lemma 9,

$$\begin{aligned} P_0^n\left( \left| \int \log \frac{L(\varvec{\theta }_{n})}{L_0}q(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\right| > n\nu \right) \le \frac{\varepsilon }{\nu }. \end{aligned}$$
(B33)

Step 1 (c): Since \(||\eta _0-\eta _{\varvec{\theta }^*_n}||_1\le \varepsilon /4\), using proposition 15 with \(\epsilon _n=1\) and \(\nu =\varepsilon \),

$$\begin{aligned} \int _{\varvec{\theta }_{n}\in \mathcal {N}_\varepsilon } p(\varvec{\theta }_{n})d\varvec{\theta }_{n}&\ge \exp (-n\varepsilon ) \end{aligned}$$

which follows by \(||\varvec{\theta }^*_n||_2^2=||\varvec{\theta }^*_N||_2^2=o(n)\) and \(\log (\sum _{v=0}^{L} \tilde{k}_{vn}\prod _{v'=v+1}^{L} a^*_{v'n})=O(\log n)\). Therefore, using Lemma 7, we get

$$\begin{aligned} P_0^n\left( \left| \log \int \frac{L(\varvec{\theta }_{n})}{L_0}p(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\right| > n\nu \right) \le \frac{2\varepsilon }{\nu } \end{aligned}$$
(B34)

Step 1 (d): From (B30) and (B29) we get

$$\begin{aligned}&P_0^n(d_{\textrm{KL}}(\pi ^*,\pi (.|\varvec{y}_{n},\varvec{X}_{n}))>3n\nu )\le P_0^n \left( d_{\textrm{KL}}(q,p)>n\nu \right) \nonumber \\&\qquad +P_0^n\left( \left| \int \log \frac{L(\varvec{\theta }_{n})}{L_0}q(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\right|> n\nu \right) \nonumber \\&\qquad +P_0^n\left( \left| \log \int \frac{L(\varvec{\theta }_{n})}{L_0}p(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\right| > n\nu \right) \nonumber \\&\quad \le \frac{3\varepsilon }{\nu } \end{aligned}$$
(B35)

where the last inequality is a consequence of (B31), (B33) and (B34).

Since \(\varepsilon \) is arbitrary, taking \(\varepsilon \rightarrow 0\) completes the proof. \(\square \)

Proof of part 2

Note, \(K_n\log n=o(n\epsilon _n^2)\), \(||\mu _n||_2^2 =o(n\epsilon _n^2)\), \(\log ||\varvec{\zeta }_n||_\infty =O(\log n)\) and \(||\varvec{\zeta }^*_n||_\infty =O(1)\). Let \(q(\varvec{\theta }_{n})=MVN(\varvec{\theta }^*_n, \varvec{I}_{K_n}/n^{2+2d}), d>d^*\) where \(\sum _{v=0}^{L_n} \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n}=O(n^{d^*})\), \(d^*>0\). We next define \(\varvec{\theta }_{n}^*\) as follows:

Let \(\eta _{\varvec{\theta }^*_n}\) be the neural satisfying

$$\begin{aligned} ||\eta _{\varvec{\theta }^*_n}-\eta _0||_1\le \varepsilon \epsilon _n^2/4 \hspace{5mm} ||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2) \end{aligned}$$

The existence of such a neural network is guaranteed since \(||\eta _{\varvec{\theta }^*_n}-\eta _0||_1=o(\epsilon _n^2)\).

Step 2 (a): Since \(||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2)\), by proposition 16,

$$\begin{aligned} P_0^n(d_{\textrm{KL}}(q,p)>\nu n\epsilon _n^2)=0 \end{aligned}$$
(B36)

Step 2 (b): By proposition 17, since \(||\eta _{\varvec{\theta }^*_n}-\eta _0||_1\le \varepsilon \epsilon _n^2/4\), \(||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2)\) and \((\sum _{v=0}^{L_n} \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})\log n=o(n\epsilon _n^2)\), we have

$$\begin{aligned}\int d_{\textrm{KL}}(\ell _0,\ell _{\varvec{\theta }_{n}})q(\varvec{\theta }_{n})d\varvec{\theta }_{n} \le \varepsilon \epsilon _n^2 \end{aligned}$$

Therefore, by Lemma 9,

$$\begin{aligned} P_0^n\left( \left| \int \log \frac{L(\varvec{\theta }_{n})}{L_0}q(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\right| > \nu n \epsilon _n^2\right) \le \frac{\varepsilon }{\nu }. \end{aligned}$$
(B37)

Step 2 (c): By proposition 15, since \(||\eta _{\varvec{\theta }^*_n}-\eta _0||_1\le \varepsilon \epsilon _n^2/4\), \(||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2)\)and \(\log (\sum _{v=0}^{L_n} \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})=O(\log n)\), we have

$$\begin{aligned} \int _{\varvec{\theta }_{n}\in \mathcal {N}{\varepsilon \epsilon _n^2}} p(\varvec{\theta }_{n})d\varvec{\theta }_{n}&\ge \exp (-\varepsilon n\epsilon _n^2) \end{aligned}$$

Therefore, using Lemma 7, we get

$$\begin{aligned} P_0^n\left( \left| \log \int \frac{L(\varvec{\theta }_{n})}{L_0}q(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\right| > \nu n \epsilon _n^2\right) \le \frac{2\varepsilon }{\nu } \end{aligned}$$
(B38)

Step 2 (d): From (B30) and (B29) we get

$$\begin{aligned}&P_0^n(d_{\textrm{KL}}(\pi ^*,\pi (.|\varvec{y}_{n},\varvec{X}_{n}))>3\nu n\epsilon _n^2)\le P_0^n \left( d_{\textrm{KL}}(q,p)>\nu n\epsilon _n^2\right) \nonumber \\&\quad +P_0^n\left( \left| \int \log \frac{L(\varvec{\theta }_{n})}{L_0}q(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\right|> \nu n\epsilon _n^2\right) \nonumber \\&\quad +P_0^n\left( \left| \log \int \frac{L(\varvec{\theta }_{n})}{L_0}p(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\right| > \nu n\epsilon _n^2\right) \le \frac{3\varepsilon }{\nu } \end{aligned}$$
(B39)

where the last inequality is a consequence of (B36), (B37) and (B38).

Since \(\varepsilon \) is arbitrary, taking \(\varepsilon \rightarrow 0\) completes the proof. \(\square \)

Appendix C Consistency of the variational posterior

Proof of Theorem 1

We assume Relation (32) holds with \(A_n\) and \(B_n\) are same as in (31).

By assumptions (A1) and (A2), the prior parameters satisfy

$$\begin{aligned}{} & {} ||\varvec{\mu }_n||_2^2=o(n), \,\,\log ||\varvec{\zeta }_n||_\infty =O(\log n),\\{} & {} ||\varvec{\zeta }_n^*||_\infty =O(1), \,\, \varvec{\zeta }^*_n=1/\varvec{\zeta }_n. \end{aligned}$$

Note \(K_n \sim n^a\), \(0<a<1\) which implies \(K_n\log n=o(n)\). By proposition 18 part 1.,

$$\begin{aligned} d_{\textrm{KL}} (\pi ^*, \pi (.|\varvec{y}_{n},\varvec{X}_{n}))= o_{P_0^n}(n). \end{aligned}$$
(C40)

By step 1 (c) in the proof of proposition 18

$$\begin{aligned} B_n= o_{P_0^n}(n) \end{aligned}$$
(C41)

Since, \(K_n \sim n^a\), \(K_n\log n=o(n^b)\), \(a<b<1\). Using proposition 13 with \(\epsilon _n=1\),

$$\begin{aligned}{} & {} -\pi ^*(\mathcal {U}_\varepsilon ^c)A_n \ge n\varepsilon ^2\pi ^*(\mathcal {U}_\varepsilon ^c)-\log 2+o_{P_0^n}(1)\nonumber \\{} & {} \quad =n\varepsilon ^2\pi ^*(\mathcal {U}_\varepsilon ^c)+O_{P_0^n}(1) \end{aligned}$$
(C42)

Thus, using (C40), (C41) and (C42) in (32), we get

$$\begin{aligned}{} & {} n\varepsilon ^2\pi ^*(\mathcal {U}_\varepsilon ^c)+O_{P_0^n}(1)\le o_{P_0^n}(n)+o_{P_0^n}(n) \\{} & {} \quad \implies \pi ^*(\mathcal {U}_\varepsilon ^c)=o_{P_0^n}(1) \end{aligned}$$

\(\square \)

Proof of Theorem 2

We assume Relation (32) holds with \(A_n\) and \(B_n\) are same as in (31).

Let \(K_n\sim n^a\) and \(\epsilon _n^2\sim n^{-\delta }\), \(0<\delta <1-a\). This implies \(K_n\log n=o(n\epsilon _n^2)\).

By assumptions (A1) and (A4), the prior parameters satisfy

$$\begin{aligned}{} & {} ||\varvec{\mu }_n||_2^2=o(n \epsilon _n^2), \,\,\log ||\varvec{\zeta }_n||_\infty =O(\log n),\\{} & {} ||\varvec{\zeta }_n^*||_\infty =O(1), \,\, \varvec{\zeta }^*_n=1/\varvec{\zeta }_n. \end{aligned}$$

Also by assumption (A3),

$$\begin{aligned}{} & {} ||\eta _0-\eta _{\varvec{\theta }^*_n}||_1=o(\epsilon _n^2), \,\, ||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2),\\{} & {} \log (\sum _{v=0}^{L_n} \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})=O(\log n) \end{aligned}$$

By proposition 18 part 2.,

$$\begin{aligned} d_{\textrm{KL}} (\pi ^*, \pi (.|\varvec{y}_{n},\varvec{X}_{n}))= o_{P_0^n}(n\epsilon _n^2). \end{aligned}$$
(C43)

By step 2 (c) in the proof of proposition 18

$$\begin{aligned} B_n= o_{P_0^n}(n \epsilon _n^2 ) \end{aligned}$$
(C44)

Since \(K_n \sim n^a\), \(K_n\log n=o(n^b \epsilon _n^2)\), \(a+\delta<b<1\). Using proposition 13, it follows that

$$\begin{aligned}{} & {} -\pi ^*(\mathcal {U}_{\varepsilon \epsilon _n}^c )A_n \ge \varepsilon ^2 n \epsilon _n^2 \pi ^*(\mathcal {U}_{\varepsilon \epsilon _n}^c)-\log 2+o_{P_0^n}(1)\nonumber \\{} & {} \quad =\varepsilon ^2 n \epsilon _n^2 \pi ^*(\mathcal {U}_{\varepsilon \epsilon _n}^c)+O_{P_0^n}(1) \end{aligned}$$
(C45)

Thus, using (C43), (C44) and (C45) in (32), we get

$$\begin{aligned}{} & {} n\varepsilon ^2 \epsilon _n^2\pi ^*(\mathcal {U}_{\varepsilon \epsilon _n}^c)+O_{P_0^n}(1)\le o_{P_0^n}(n\epsilon _n^2)+o_{P_0^n}(n\epsilon _n^2) \\{} & {} \quad \implies \pi ^*(\mathcal {U}_{\varepsilon \epsilon _n}^c)=o_{P_0^n}(1) \end{aligned}$$

\(\square \)

Proof of Corollary 1

Let \(\hat{\ell }_n(y,\varvec{x})=\int \ell _{\varvec{\theta }_{n}}(y,\varvec{x}) \pi ^*(\varvec{\theta }_{n})d\varvec{\theta }_{n}\).

$$\begin{aligned} d_{\textrm{H}}(\hat{\ell }_n,\ell _0)&=d_{\textrm{H}}\left( \int \ell _{\varvec{\theta }_{n}} \pi ^*(\varvec{\theta }_{n})d\varvec{\theta }_{n},\ell _0\right) \\&\le \int d_{\text {H}}(\ell _{\varvec{\theta }_{n}},\ell _0) \pi ^*(\varvec{\theta }_{n})d\varvec{\theta }_{n} \hspace{5mm} \text {Jensen's inequality}\\&=\int _{\mathcal {U}_\varepsilon } d_{\text {H}}(\ell _{\varvec{\theta }_{n}},\ell _0) \pi ^*(\varvec{\theta }_{n})d\varvec{\theta }_{n}\\&\quad +\int _{\mathcal {U}_\varepsilon ^c} d_{\text {H}}(\ell _{\varvec{\theta }_{n}},\ell _0) \pi ^*(\varvec{\theta }_{n})d\varvec{\theta }_{n}\\&\le \varepsilon +o_{P_0^n}(1) \end{aligned}$$

Taking \(\varepsilon \rightarrow 0\), we get \(d_{\text {H}}(\hat{\ell }_n,\ell _0)=o_{P_0^n}(1)\). Let

$$\begin{aligned} \hat{\eta }(\varvec{x})=\sigma ^{-1}\left( \int \sigma (\eta _{\varvec{\theta }_{n}}(\varvec{x})) \pi ^*(\varvec{\theta }_{n})d\varvec{\theta }_{n} \right) \end{aligned}$$
(C46)

then, note that \(\hat{\eta }(\varvec{x})=\log ( \hat{\ell }_n(1,\varvec{x})/\hat{\ell }_n(0,\varvec{x}))\). Further,

$$\begin{aligned} 2d^2_{\textrm{H}}(\hat{\ell }_n,\ell _0) =2-2 \int _{\varvec{x}\in [0,1]^p} \sum _{y \in \{0,1\}} \sqrt{\hat{\ell }_n(y,\varvec{x})\ell _0(y,\varvec{x})} d\varvec{x}\end{aligned}$$

This implies

$$\begin{aligned}&2d^2_{\textrm{H}}(\hat{\ell }_n,\ell _0)\nonumber \\&\quad =2-2\int _{\varvec{x}\in [0,1]^p} \nonumber \\&\quad \sum _{y \in \{0,1\}} e^{\left\{ \frac{1}{2}\left( y\hat{\eta }(\varvec{x})-\log (1+e^{\hat{\eta }(\varvec{x})})+y\eta _0(\varvec{x})-\log (1+e^{\eta _0(\varvec{x})}\right) \right\} } d\varvec{x}\nonumber \\&\quad =2-2\int _{\varvec{x}\in [0,1]^p} \left( \sqrt{\sigma (\eta _0(\varvec{x}))\sigma (\hat{\eta }(\varvec{x}))}\right. \nonumber \\&\quad \left. +\sqrt{(1-\sigma (\eta _0(\varvec{x})))(1-\sigma (\hat{\eta }(\varvec{x})))}\right) d\varvec{x}\nonumber \\&\quad \ge 2-2 \int _{\varvec{x}\in [0,1]^p} \sqrt{1- (\sqrt{\sigma (\eta _0(\varvec{x}))}-\sqrt{\sigma (\hat{\eta }(\varvec{x}))})^2}d\varvec{x}\nonumber \\&\quad \ge \int _{\varvec{x}\in [0,1]^p} (\sqrt{\sigma (\eta _0(\varvec{x}))}-\sqrt{\sigma (\hat{\eta }(\varvec{x}))})^2d\varvec{x}\nonumber \\&\quad \ge \frac{1}{4}\int _{\varvec{x}\in [0,1]^p} (\sigma (\eta _0(\varvec{x}))-\sigma (\hat{\eta }(\varvec{x})))^2d\varvec{x}\end{aligned}$$
(C47)

In the above equation, the sixth and the seventh step hold because \(\sqrt{1-x}\le 1-x/2\) and \(|p_1-p_2|\le |\sqrt{p_1}+\sqrt{p_2}||\sqrt{p_1}-\sqrt{p_2}|\le 2|\sqrt{p_1}-\sqrt{p_2}|\) respectively. The fifth step holds because

$$\begin{aligned}&\left( \sqrt{p_1p_2}+\sqrt{(1-p_1)(1-p_2)}\right) ^2\\&\quad =p_1p_2+1-p_1-p_2+\sqrt{p_1p_2(1-p_1)(1-p_2)}\\&\quad \le \sqrt{p_1p_2}+1-p_1-p_2+\sqrt{p_1p_2}=1-(\sqrt{p1}-\sqrt{p_2})^2 \end{aligned}$$

By (C47) and Cauchy Schwartz inequality,

$$\begin{aligned}&\int _{\varvec{x}\in U[0,1]^p} |\sigma (\eta _0(\varvec{x}))-\sigma (\hat{\eta }(\varvec{x}))| d\varvec{x}\nonumber \\&\quad \le \left( \int _{\varvec{x}\in [0,1]^p} (\sigma (\eta _0(\varvec{x}))-\sigma (\hat{\eta }(\varvec{x})))^2d\varvec{x}\right) ^{1/2}\nonumber \\&\quad \le 2\sqrt{2} d_{\text {H}}(\hat{\ell }_n, \ell _0)=o_{P_0^n}(1) \end{aligned}$$
(C48)

The proof follows in lieu (35). \(\square \)

Proof of Corollary 2

We assume Relation (32) holds with \(A_n\) and \(B_n\) are same as in (31).

Let \(K_n\sim n^a\) and \(\epsilon _n^2\sim n^{-\delta }\), \(0<\delta <1-a\). This implies \(K_n\log n=o(n\epsilon _n^2)\).

Also, \(K_n\log n=o(n^b \epsilon _n^2)\), \(a+\delta<b<1\). This implies \(K_n\log n =o(n^b (\epsilon _n^2)^\kappa )\), \(0\le \kappa \le 1\). Thus, using proposition 13 with \(\epsilon _n=\epsilon _n^{k}\), we get

$$\begin{aligned}{} & {} -\pi ^*(\mathcal {U}_{\varepsilon \epsilon _n^{\kappa }}^c )A_n \ge \varepsilon ^2 n \epsilon _n^{2\kappa } \pi ^*(\mathcal {U}_{\varepsilon \epsilon _n^{\kappa }}^c)-\log 2+o_{P_0^n}(1)\nonumber \\{} & {} \quad =\varepsilon ^2 n \epsilon _n^{2\kappa } \pi ^*(\mathcal {U}_{\varepsilon \epsilon _n^k}^c)+O_{P_0^n}(1) \end{aligned}$$
(C49)

This together with (C43), (C44) and (32) implies \(\pi ^*(\mathcal {U}_{\varepsilon \epsilon _n^\kappa }^c)=o_{P_0^n}(\epsilon _n^{2-2\kappa })\).

Let \(\hat{\ell }_n(y,\varvec{x})=\int \ell _{\varvec{\theta }_{n}}(y,\varvec{x}) \pi ^*(\varvec{\theta }_{n})d\varvec{\theta }_{n}\), then

$$\begin{aligned} d_{\textrm{H}}(\hat{\ell }_n,\ell _0)&\le \int _{\mathcal {U}_{\varepsilon \epsilon _n^\kappa }} d_{\text {H}}(\ell _{\varvec{\theta }_{n}},\ell _0) \pi ^*(\varvec{\theta }_{n})d\varvec{\theta }_{n}\\&\quad +\int _{\mathcal {U}_{\varepsilon \epsilon _n^\kappa }^c} d_{\text {H}}(\ell _{\varvec{\theta }_{n}},\ell _0) \pi ^*(\varvec{\theta }_{n})d\varvec{\theta }_{n}\\&\le \varepsilon \epsilon _n^\kappa +o_{P_0^n}(\epsilon _n^{2-2\kappa }) \end{aligned}$$

Dividing by \(\epsilon _n^\kappa \) on both sides we get

$$\begin{aligned} \frac{1}{\epsilon _n^\kappa }d_{\textrm{H}}(\hat{\ell }_n,\ell _0)&=o_{P_0^n}(\epsilon _n^{2-3\kappa })+o_{P_0^n}(1)=o_{P_0^n}(1),\\&\quad 0\le \kappa \le 2/3. \end{aligned}$$

By (C48), for every \(0\le \kappa \le 2/3\),

$$\begin{aligned}{} & {} \frac{1}{\epsilon _n^\kappa }\int _{\varvec{x}\in [0,1]^{p_n}} |\sigma (\eta _0(\varvec{x}))-\sigma (\hat{\eta }(\varvec{x}))| d\varvec{x}\\{} & {} \quad \le \frac{1}{\epsilon _n^\kappa }2\sqrt{2}d_{\text {H}}(\hat{\ell }_n,\ell _0)=o_{P_0^n}(1). \end{aligned}$$

The proof follows in lieu of (35). \(\square \)

Appendix D Consistency of the true posterior

From (11), note that

$$\begin{aligned} \pi (\mathcal {U}_{\varepsilon }^c|\varvec{y}_{n},\varvec{X}_{n})&=\frac{\int _{\mathcal {U}_\varepsilon ^c}L(\varvec{\theta }_{n})p(\varvec{\theta }_{n})d\varvec{\theta }_{n}}{\int L(\varvec{\theta }_{n})p(\varvec{\theta }_{n})d\varvec{\theta }_{n}}\nonumber \\&=\frac{\int _{\mathcal {U}_\varepsilon ^c}(L(\varvec{\theta }_{n})/L_0)p(\varvec{\theta }_{n})d\varvec{\theta }_{n}}{\int (L(\varvec{\theta }_{n})/L_0)p(\varvec{\theta }_{n})d\varvec{\theta }_{n}} \end{aligned}$$
(D50)

Theorem 19

Suppose conditions of Theorem 1 hold. Then,

  1. 1.
    $$\begin{aligned} P_0^n\left( \pi (\mathcal {U}_{\varepsilon }^c|\varvec{y}_{n},\varvec{X}_{n})\le 2e^{-n\varepsilon ^2/2}\right) \rightarrow 1, n \rightarrow \infty \end{aligned}$$
  2. 2.
    $$\begin{aligned} P_0^n(|R(\hat{C})-R(C^\textrm{Bayes})|\le 8\sqrt{2}\varepsilon )\rightarrow 1,n \rightarrow \infty \end{aligned}$$

Proof

By assumptions (A1) and (A2), the prior parameters satisfy

$$\begin{aligned}{} & {} ||\varvec{\mu }_n||_2^2=o(n), \,\,\log ||\varvec{\zeta }_n||_\infty =O(\log n),\\{} & {} ||\varvec{\zeta }_n^*||_\infty =O(1), \,\, \varvec{\zeta }^*_n=1/\varvec{\zeta }_n. \end{aligned}$$

Note \(K_n \sim n^a\), \(0<a<1\) which implies \(K_n\log n=o(n)\). Thus, the conditions of proposition 15 hold with \(\epsilon _n=1\).

$$\begin{aligned}&P_0^n\left( \int \frac{L(\varvec{\theta }_{n})}{L_0}p(\varvec{\theta }_{n}) d\varvec{\theta }_{n} \le e^{-n\nu }\right) \nonumber \\&\quad \le P_0^n\left( \left| \log \int \frac{L(\varvec{\theta }_{n})}{L_0}p(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\right| > n\nu \right) \rightarrow 0, \end{aligned}$$
(D51)

\(n \rightarrow \infty \) which follows from (B34) (see step 1 (c) in proof of proposition 18). Since \(K_n\log n=o(n^b)\), \(a<b<1\), the proposition 13 holds with \(\epsilon _n=1\).

$$\begin{aligned} P_0^n\left( \int _{\mathcal {U}_\varepsilon ^c} \frac{L(\varvec{\theta }_{n})}{L_0}p(\varvec{\theta }_{n}) d\varvec{\theta }_{n} \ge 2 e^{-n\varepsilon ^2 }\right) \rightarrow 0, n \rightarrow \infty \end{aligned}$$
(D52)

where the last equality follows from (B7) with \(\epsilon _n=1\) in the proof of proposition 13. Using (D51) and (D52) with (D50), we get

$$\begin{aligned} P_0^n\left( \pi (\mathcal {U}_{\varepsilon }^c|\varvec{y}_{n},\varvec{X}_{n})\ge 2e^{-n(\varepsilon ^2-\nu )}\right) \rightarrow 0, n \rightarrow \infty \end{aligned}$$

Take \(\nu =\varepsilon ^2/2\) to complete the proof. Mimicking the steps in the proof of corollary 1,

$$\begin{aligned} d_{\textrm{H}}(\hat{\ell }_n,\ell _0)&\le \int d_{\text {H}}(\ell _{\varvec{\theta }_{n}},\ell _0) \pi (\varvec{\theta }_{n}|\varvec{y}_{n},\varvec{X}_{n})d\varvec{\theta }_{n} \\&\quad \text {Jensen's inequality}\\&=\int _{\mathcal {U}_\varepsilon } d_{\text {H}}(\ell _{\varvec{\theta }_{n}},\ell _0) \pi (\varvec{\theta }_{n}|\varvec{y}_{n},\varvec{X}_{n})d\varvec{\theta }_{n}\\&\quad +\int _{\mathcal {U}_\varepsilon ^c} d_{\text {H}}(\ell _{\varvec{\theta }_{n}},\ell _0) \pi (\varvec{\theta }_{n}|\varvec{y}_{n},\varvec{X}_{n})d\varvec{\theta }_{n}\\&\le \varepsilon +2e^{-n\varepsilon ^2/2}\le 2\varepsilon ,\\&\quad \text {with probability tending to 1 as }n\rightarrow \infty \end{aligned}$$

where the second last inequality is a consequence of part 1. in Theorem 19. The remaining part of the proof follows by (C48) and (35). \(\square \)

Table 6 Clinical Features and Cognitive Assessment Score. Values are shown as mean ± standard deviation or percentage. Test statistics and P-values for differences between MCI-S and MCI-C are based on (a) t-test or (b) chi- square test. MCI-S = non-progressive MCI; MCI-P = progressive MCI; APOE = apolipoprotein E; MMSE = Mini-Mental State Examination. RAVLT = The Rey Auditory Verbal Learning Test (immediate: sum of 5 trails; learning: trial 5-trial 1; Forgetting: trial 5-delayed; perc.forgetting: Precent forgetting); DIGT = The Digit- Symbol Coding test; TRAB = Trail Making tests; CDRSB = Clinical Dementia Rating Scaled Response; FAQ = Activities of Daily living Score; ADAS = Alzheimer’s Disease Assessment Scale Cognitive sub-scale; mPACCdigit = the Digit Symbol Substitution Test from the Preclinical Alzheimer Cognitive Composite
Table 7 Significant MRI Features. Values are shown as mean ± standard deviation or percentage. Test statistics and P-values for differences between MCI-C and MCI-S are based on t-test. MCI-S = non-progressive MCI; MCI-C = progressive MCI. HippoR = Right Hippocampus; HippoL = Left Hippocampus; flWMR = frontal lobe WM right; flWML = frontal lobe WM left; plWMR = parietal lobe WM right; plWML = parietal lobe WM left; tlWMR = temporal lobe WM right; tlWML = temporal lobe WM left; ACgCR=Right ACgG anterior cingulate gyrus; ACgCL=Left ACgG anterior cingulate gyrus; EntR = Right Ent entorhinal area; EntL = Left Ent entorhinal area; MCgCR = Right MCgG middle cingulate gyrus;MCgCL = Left MCgG middle cingulate gyrus; MFCR = Right MFC medial frontal cortex; MFCL = Left MFC medial frontal cortex; OpIFGR = Right OpIFG opercular part of the inferior frontal gyrus; OpIFGL = Left OpIFG opercular part of the inferior frontal gyrus; OrIFGR = Right OrIFG orbital part of the inferior frontal gyrus; OrIFGL = Left OrIFG orbital part of the inferior frontal gyrus; PCgCR = Right PCgG posterior cingulate gyrus; PCgCL = Left PCgG posterior cingulate gyrus; PCuR = Right PCu precuneus; PCuL = Left PCu precuneus; SPLR = Right SPL superior parietal lobule; SPLL = Left SPL superior parietal lobule

Theorem 20

Suppose conditions of Theorem 2 hold. Then,

  1. 1.
    $$\begin{aligned} P_0^n\left( \pi (\mathcal {U}_{\varepsilon \epsilon _n}^c|\varvec{y}_{n},\varvec{X}_{n})\le 2e^{-n\epsilon _n^2\varepsilon ^2/2}\right) \rightarrow 1, n \rightarrow \infty \end{aligned}$$
  2. 2.
    $$\begin{aligned} P_0^n(|R(\hat{C})-R(C^\textrm{Bayes})|\le 8\sqrt{2}\varepsilon \epsilon _n)\rightarrow 1,n \rightarrow \infty \end{aligned}$$

Proof

By assumptions (A1) and (A4), the prior parameters satisfy

$$\begin{aligned}{} & {} ||\varvec{\mu }_n||_2^2=o(n \epsilon _n^2), \,\,\log ||\varvec{\zeta }_n||_\infty =O(\log n),\\{} & {} ||\varvec{\zeta }_n^*||_\infty =O(1), \,\, \varvec{\zeta }^*_n=1/\varvec{\zeta }_n. \end{aligned}$$

Also by assumption (A3),

$$\begin{aligned}{} & {} ||\eta _0-\eta _{\varvec{\theta }^*_n}||_1=o(\epsilon _n^2), \,\, ||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2),\\{} & {} \log (\sum _{v=0}^{L_n} \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})=O(\log n) \end{aligned}$$

Note \(K_n \sim n^a\), \(0<a<1\) and \(\epsilon _n\sim n^{-\delta }\), \(0<\delta <1-a\), thus \(K_n\log n=o(n\epsilon _n^2)\). Thus, the conditions of proposition 15 hold.

$$\begin{aligned}&P_0^n\left( \int \frac{L(\varvec{\theta }_{n})}{L_0}p(\varvec{\theta }_{n}) d\varvec{\theta }_{n} \le e^{-n\epsilon _n^2 \nu }\right) \nonumber \\&\quad \le P_0^n\left( \left| \log \int \frac{L(\varvec{\theta }_{n})}{L_0}p(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\right| > n\epsilon _n^2\nu \right) \nonumber \\&\quad \rightarrow 0 \end{aligned}$$
(D53)

for \( n \rightarrow \infty \) where the above convergence follows from (B38) in step 2 (c) in the proof of proposition 18. Also, since \(K_n\log n=o(n^b \epsilon _n^2)\), \(a+\delta<b<1\). Thus conditions of proposition 13 hold.

$$\begin{aligned} P_0^n\left( \int _{\mathcal {U}_{\varepsilon \epsilon _n}^c} \frac{L(\varvec{\theta }_{n})}{L_0}p(\varvec{\theta }_{n}) d\varvec{\theta }_{n} \ge 2 e^{-n \epsilon _n^2\varepsilon ^2 }\right) \rightarrow 0, n \rightarrow \infty \end{aligned}$$
(D54)

where the last equality follows from (B7) in the proof of proposition 13.

Using (D53) and (D54) with (D50), we get \(P_0^n\left( \pi (\mathcal {U}_{\varepsilon \epsilon _n}^c|\varvec{y}_{n},\varvec{X}_{n})\ge 2e^{-n\epsilon _n^2(\varepsilon ^2-\nu )}\right) \rightarrow 0, n \rightarrow \infty \). Take \(\nu =\varepsilon ^2/2\) to complete the proof. Mimicking the steps in the proof of corollary 2,

$$\begin{aligned} d_{\textrm{H}}(\hat{\ell }_n,\ell _0)&\le \int d_{\text {H}}(\ell _{\varvec{\theta }_{n}},\ell _0) \pi (\varvec{\theta }_{n}|\varvec{y}_{n},\varvec{X}_{n})d\varvec{\theta }_{n} \\&\quad \text {Jensen's inequality}\\&=\int _{\mathcal {U}_{\varepsilon \epsilon _n}} d_{\text {H}}(\ell _{\varvec{\theta }_{n}},\ell _0) \pi (\varvec{\theta }_{n}|\varvec{y}_{n},\varvec{X}_{n})d\varvec{\theta }_{n}\\&\quad +\int _{\mathcal {U}_{\varepsilon \epsilon _n}^c} d_{\text {H}}(\ell _{\varvec{\theta }_{n}},\ell _0) \pi (\varvec{\theta }_{n}|\varvec{y}_{n},\varvec{X}_{n})d\varvec{\theta }_{n}\\&\le \varepsilon \epsilon _n+2e^{-2n\epsilon _n^2\varepsilon ^2}\le 2\varepsilon \epsilon _n, \\&\quad \text {with probability tending to 1 as }n\rightarrow \infty \end{aligned}$$

where the second last inequality is a consequence of part 1. in Theorem 20 and the last inequality last equality follows since \(\epsilon _n \sim n^{-\delta }\). Dividing by \(\epsilon _n\) on both sides we get

$$\begin{aligned} \epsilon _n^{-1}d_{\textrm{H}}(\hat{\ell }_n,\ell _0) \le 2\varepsilon ,\,\,\text {with probability tending to 1 as }n\rightarrow \infty \end{aligned}$$

The remaining part of the proof follows by (C48) and (35). \(\square \)

Appendix E Tables for real data

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bhattacharya, S., Liu, Z. & Maiti, T. Comprehensive study of variational Bayes classification for dense deep neural networks. Stat Comput 34, 17 (2024). https://doi.org/10.1007/s11222-023-10338-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11222-023-10338-9

Keywords

Navigation