PAC-Bayesian inequalities of some random variables sequences

Wang, Zhen; Shen, Luming; Miao, Yu; Chen, Shanshan; Xu, Wenfei

doi:10.1186/s13660-015-0768-3

PAC-Bayesian inequalities of some random variables sequences

Research
Open access
Published: 07 August 2015

Volume 2015, article number 244, (2015)
Cite this article

Download PDF

You have full access to this open access article

Journal of Inequalities and Applications Submit manuscript

PAC-Bayesian inequalities of some random variables sequences

Download PDF

Zhen Wang¹,
Luming Shen²,
Yu Miao¹,
Shanshan Chen¹ &
…
Wenfei Xu¹

1468 Accesses
7 Citations
Explore all metrics

Abstract

In this paper, we establish some PAC-Bayesian inequalities for a sequence of random variables, which include conditionally symmetric random variables and locally square integrable martingales.

The Baum–Katz theorem for dependent sequences

Article 19 December 2016

On the Concept of A-Statistical Uniform Integrability and the Law of Large Numbers

Article 01 December 2019

Strong laws of large numbers for sequences of blockwise and pairwise m-dependent random variables in metris spaces

Article 15 November 2016

1 Introduction

In this paper, we establish several PAC-Bayesian inequalities. The PAC-Bayesian analysis is an abbreviation for the Probably Approximately Correct learning model and has been introduced a decade ago (Shawe-Taylor and Williamson [1]; Shawe-Taylor et al. [2]) and has made a significant contribution to the analysis and development of supervised learning methods, the random multiarmed bandits problem and so on. PAC-Bayesian analysis provides high probability bounds on the deviation of weighted averages of empirical means of sets of independent random variables from their expectations. It supplies generalization guarantees for many influential machine learning algorithms.

Shawe-Taylor and Williamson [1] established the PAC-Bayesian learning theorems. They showed that if one can find a ball of sufficient volume in a parameterized concept space, then the center of that ball has a low error rate. For non-i.i.d. data, application of PAC-Bayesian analysis was partially addressed only recently by Ralaivola et al. [3] and Lever et al. [4]. For a martingale, taking advantage of the martingale’s properties, Seldin et al. [5] obtained the PAC-Bayesian-Bernstein inequality and applied it to multiarmed bandits. Seldin et al. [6] presented a generalization of the PAC-Bayesian analysis. Their generalization makes it possible to consider model order selection simultaneously with the exploration-exploitation trade-off.

In the present paper, we continue to study the PAC-Bayesian inequalities for some random variable sequence. We concentrate on the conditionally symmetric random variables and the locally square integrable martingale. For the first case, we only assume the conditional symmetry of the random variable sequence, without any other dependent conditions and moment conditions. For the second case, the bounded condition in Seldin et al. [6] is weakened. The paper is divided as follows: In Section 2 we state the main results and make some remarks. Proofs of the main results are provided in Section 3.

2 Main results

In this section, we discuss the PAC-Bayesian inequalities for conditionally symmetric random variables and martingales. In order to present our main theorems, we give a few definitions. Let $\mathbb{H}$ be an index (or a hypothesis) space, possibly uncountably infinite. Let $\{X_{1}(h), X_{2}(h),\ldots: h\in\mathbb{H}\} $ be a sequence of variables adapted to an increasing sequence of σ-fields $\{\mathcal{F}_{n}\}$, where $\mathcal{F}_{n}=\{ X_{k}(h): 1\le k\le n \text{ and } h\in\mathbb{H}\}$ is a set of random sequences observed up to time n (the history).

Seldin et al. [6] obtained a PAC-Bayes-Bernstein inequality for martingale with finite jumps.

Theorem 2.1

(Seldin et al. [6])

Let $\{X_{1}(h), X_{2}(h),\ldots: h\in\mathbb{H}\}$ be a set of martingale difference sequences adapted to an increasing sequence of σ-fields $\mathcal{F}_{n}=\{X_{k}(h): 1\le k\le n \textit{ and } h\in\mathbb{H}\}$. Furthermore, let $M_{n}(h)=\sum_{k=1}^{n}X_{k}(h)$ be martingales corresponding to the martingale difference sequences and let $V_{n}(h)=\sum_{k=1}^{n}\mathbb{E}[X_{k}(h)^{2}|\mathcal{F}_{k-1}]$ be cumulative variances of the martingales. For a distribution ρ over $\mathbb{H}$, define $M_{n}(\rho) =\mathbb{E}_{\rho(h)}[M_{n}(h)]$ and $V_{n}(\rho) =\mathbb{E}_{\rho(h)}[V_{n}(h)]$. Let $\{C_{1}, C_{2}, \ldots\}$ be an increasing sequence set in advance, such that $|X_{k}(h)|\le C_{k}$ for all h with probability 1. Let $\{\mu_{1},\mu _{2},\ldots\}$ be a sequence of ‘reference’ (‘prior’) distributions over $\mathbb{H}$, such that $\mu_{n}$ is independent of $\mathcal{F}_{n}$ (but can depend on n). Let $\{\lambda_{1},\lambda_{2},\ldots\}$ be a sequence of positive numbers set in advance that satisfy $\lambda_{k}\le C_{k}^{-1}$. Then for all possible distributions $\rho_{n}$ over $\mathbb {H}$ given n and for all n simultaneously with probability greater than $1-\delta$,

$$\bigl\vert M_{n}(\rho_{n})\bigr\vert \le(e-2) \lambda_{n}V_{n}(\rho_{n})+\frac{\operatorname{KL}(\rho_{n}\parallel\mu _{n})+2\ln(n+1)+\ln\frac{2}{\delta}}{\lambda_{n}}, $$

where $\operatorname{KL}(\mu\parallel\nu)$ is the KL-divergence (relative entropy) between two distributions μ and ν.

Seldin et al. [6] (see Theorem 2.1) considered the bounded martingale difference sequence and the parameters $(\lambda_{k})$ depend on the bounds $(1/C_{k})$. Since $C_{k}$ is an increasing sequence, $\lambda_{k}$ is a decreasing sequence. Furthermore, Seldin et al. [6] studied the deviation properties between the martingale and the conditional variance of the martingale. Based on the above works, in the present paper, we want to consider the conditionally symmetric random variables and the locally square integrable martingale. For the conditionally symmetric random variables, we can establish the PAC-Bayesian inequality without any dependent assumptions (for example, independence or being a martingale) and any moment assumptions for the sequences. For the locally square integrable martingale, we can remove the bounded restriction.

2.1 Conditionally symmetric random variables

Assume that $\{X_{k}(h): k\ge1, h\in\mathbb{H}\}$ are conditionally symmetric with respect to $(\mathcal{F}_{n})$ (i.e., $\mathcal{L}(X_{i}(h)|\mathcal{F}_{i})=\mathcal {L}(-X_{i}(h)|\mathcal{F}_{i})$). Let

$$M_{n}(h)=\sum_{k=1}^{n}X_{k}(h) \quad \text{and}\quad V_{n}(h)=\sum_{k=1}^{n} \bigl[X_{k}(h)^{2}\bigr]. $$

For a distribution ρ over $\mathbb{H}$, define $M_{n}(\rho) =\mathbb{E}_{\rho(h)}[M_{n}(h)]$ and $V_{n}(\rho) =\mathbb{E}_{\rho (h)}[V_{n}(h)]$. We establish the PAC-Bayesian inequality between the partial sums and the total quadratic variation of the partial sums.

Theorem 2.2

Let $\{\mu_{1},\mu_{2},\ldots\}$ be a sequence of ‘reference’ (‘prior’) distributions over $\mathbb{H}$, such that $\mu_{n}$ is independent of $\mathcal{F}_{n}$ (but can depend on n). Then for all possible distributions $\rho_{n}$ over $\mathbb{H}$ given n and for all n simultaneously with probability greater than $1-\delta$,

$$\bigl\vert M_{n}(\rho_{n})\bigr\vert \le \frac{\lambda}{2}V_{n}(\rho_{n})+\frac{\operatorname{KL}(\rho_{n}\parallel \mu_{n})+2\ln(n+1)+\ln\frac{2}{\delta}}{\lambda},\quad \lambda>0. $$

The following theorem gives an inequality in the sense of a self-normalized sequence.

Theorem 2.3

Under the assumptions in Theorem 2.2, then for all $y>0$ and n simultaneously with probability greater than $1-\frac{\delta}{2}$,

$$\begin{aligned}& \mathbb{E}_{\rho_{n}(h)} \biggl(\ln\frac{y}{\sqrt {V_{n}(h)+y^{2}}}+\frac{M_{n}(h)^{2}}{ 2(V_{n}(h)+y^{2})} \biggr) \\& \quad \leq \operatorname{KL}(\rho_{n}\parallel\mu_{n})+2\ln(n+1)+\ln \frac{2}{\delta}. \end{aligned}$$

2.2 Martingales

Let $(M_{n}(h))$ be a locally square integrable real martingale adapted to the filtration $(\mathcal{F}_{n})$ with $M_{0}=0$. The predictable quadratic variation and the total quadratic variation of $(M_{n}(h))$ are, respectively, given by

$$\langle M \rangle_{n}(h)=\sum_{k=1}^{n} \mathbb{E}\bigl[\bigl(\Delta M_{k}(h)\bigr)^{2}|\mathcal {F}_{k-1}\bigr]\quad \text{and}\quad [M]_{n}(h)=\sum _{k=1}^{n}\bigl(\Delta M_{k}(h) \bigr)^{2}, $$

where $\Delta M_{n}(h)=M_{n}(h)-M_{n-1}(h)$. For a distribution ρ over $\mathbb{H}$, define

$$M_{n}(\rho) =\mathbb{E}_{\rho(h)}\bigl[M_{n}(h)\bigr]. $$

Theorem 2.4

Let $(M_{n}(h))$ be a locally square integrable real martingale adapted to a filtration $\mathbb{F}=(\mathcal{F}_{n})$. Let $\{\mu_{1},\mu_{2},\ldots\}$ be a sequence of ‘reference’ (‘prior’) distributions over $\mathbb{H}$, such that $\mu_{n}$ is independent of $\mathcal{F}_{n}$ (but it can depend on n). Then for all possible distributions $\rho_{n}$ over $\mathbb{H}$ given n and for all n simultaneously with probability greater than $1-\delta$,

$$\bigl\vert M_{n}(\rho_{n})\bigr\vert \le \frac{\lambda}{2} \biggl(\frac{[M]_{n}(\rho _{n})}{3}+\frac{2\langle M \rangle_{n}(\rho_{n})}{3} \biggr)+ \frac{\operatorname{KL}(\rho_{n}\parallel\mu_{n})+2\ln (n+1)+\ln\frac{2}{\delta}}{\lambda}, \quad \lambda>0. $$

Theorem 2.5

Under the assumptions in Theorem 2.4, for all $y>0$ and n simultaneously with probability greater than $1-\frac{\delta}{2}$,

$$\begin{aligned}& \mathbb{E}_{\rho_{n}(h)} \biggl(\ln\frac{y}{\sqrt{ (\frac {[M]_{n}(h)}{3}+\frac{2\langle M \rangle_{n}(h)}{3} )+y^{2}}}+\frac{M_{n}(h)^{2}}{ 2 ( (\frac{[M]_{n}}{3}+\frac{2\langle M \rangle_{n}}{3} )+y^{2} )} \biggr) \\& \quad \leq \operatorname{KL}(\rho_{n}\parallel \mu_{n})+2 \ln(n+1)+\ln\frac{2}{\delta}. \end{aligned}$$

3 The proofs of main results

Before giving the proofs of our results, we state the following basic lemmas.

Lemma 3.1

[7]

Let $\{d_{i}\}$ be a sequence of variables adapted to an increasing sequence of σ-fields $\{\mathcal{F}_{i}\}$. Assume that the $\{d_{i}\}$’s are conditionally symmetric. Then

$$\exp \Biggl(\lambda\sum_{i=1}^{n}d_{i}- \frac{\lambda^{2}}{2}\sum_{i=1}^{n}d_{i}^{2} \Biggr),\quad n\geq1, $$

is a supermartingale with mean ≤1, for all $\lambda\in\mathbb{R}$.

Lemma 3.2

Under the assumptions of Lemma 3.1, for any $y>0$, we have

$$\mathbb{E}\frac{y}{\sqrt{\sum_{i=1}^{n}d_{i}^{2}+y^{2}}}\exp \biggl(\frac{(\sum_{i=1}^{n}d_{i})^{2}}{ 2(\sum_{i=1}^{n}d_{i}^{2}+y^{2})} \biggr)\leq1. $$

Remark 3.1

Hitczenko [8] proved the above inequality for conditionally symmetric martingale difference sequences, and De la Peña [7] obtained the above inequality without the martingale difference assumption, hence without any integrability assumptions. Note that any sequence of real valued random variables $X_{i}$ can be ‘symmetrized’ to produce an exponential supermartingale by introducing random variables $X_{i}'$ such that

$$\mathcal{L}\bigl(X_{i}'|X_{1},X_{1}', \ldots,X_{n-1},X_{n-1}',X_{n}\bigr) = \mathcal{L}\bigl(X_{n}'|X_{1}, \ldots,X_{n-1}\bigr) $$

and we set $d_{n}=X_{n}-X_{n}'$.

Proof

By using Fubini’s theorem, we have

$$\begin{aligned}& \mathbb{E}\frac{y}{\sqrt{\sum_{i=1}^{n}d_{i}^{2}+y^{2}}}\exp \biggl(\frac{(\sum_{i=1}^{n}d_{i})^{2}}{ 2(\sum_{i=1}^{n}d_{i}^{2}+y^{2})} \biggr) \\& \quad = \mathbb{E}\biggl[\frac{y}{\sqrt{\sum_{i=1}^{n}d_{i}^{2}+y^{2}}}\exp \biggl(\frac{(\sum_{i=1}^{n}d_{i})^{2}}{ 2(\sum_{i=1}^{n}d_{i}^{2}+y^{2})} \biggr) \\& \qquad {}\times\frac{\sqrt{\sum_{i=1}^{n}d_{i}^{2}+y^{2}}}{ \sqrt{2\pi}}\int_{-\infty}^{\infty} \exp \biggl\{ \frac{\sum_{i=1}^{n}d_{i}^{2}+y^{2}}{2} \biggl(\lambda -\frac {\sum_{i=1}^{n}d_{i}^{2}}{\sum_{i=1}^{n}d_{i}^{2}+y^{2}} \biggr)^{2} \biggr\} \, d\lambda\biggr] \\& \quad = \int_{-\infty}^{\infty} \mathbb{E} \Biggl[ \frac{y}{\sqrt{2\pi}}\exp \Biggl(\lambda\sum_{i=1}^{n}d_{i}- \frac{\lambda^{2}}{2} \sum_{i=1}^{n}d_{i}^{2} \Biggr)\exp \biggl(-\frac{\lambda ^{2}y^{2}}{2} \biggr) \Biggr]\, d\lambda \\& \quad \leq \int_{-\infty}^{\infty} \biggl[\frac{y}{\sqrt{2\pi}} \exp \biggl(-\frac{\lambda^{2}y^{2}}{2} \biggr) \biggr]\, d\lambda=1, \end{aligned}$$

where we used the fact

$$\frac{\sqrt{\sum_{i=1}^{n}d_{i}^{2}+y^{2}}}{\sqrt{2\pi}}\int_{-\infty}^{\infty} \exp \biggl\{ \frac{\sum_{i=1}^{n}d_{i}^{2}+y^{2}}{2} \biggl(\lambda -\frac {\sum_{i=1}^{n}d_{i}^{2}}{\sum_{i=1}^{n}d_{i}^{2}+y^{2}} \biggr)^{2} \biggr\} \, d\lambda=1. $$

□

The following inequality is about a transformation of the measure inequality [9].

Lemma 3.3

For any measurable function $\phi(h)$ on $\mathbb{H}$ and any distributions $\mu(h)$ and $\rho(h)$ on $\mathbb{H}$, we have

$$\mathbb{E}_{\rho(h)}\bigl(\phi(h)\bigr)\le \operatorname{KL}(\rho\parallel \mu)+\ln\mathbb {E}_{\mu(h)}\bigl(e^{\phi(h)}\bigr). $$

Proof of Theorem 2.2

Taking

$$\phi(h)=\lambda M_{n}(h)-\frac{\lambda^{2}}{2}V_{n}(h), \quad \lambda>0, $$

then from Lemma 3.1 and Lemma 3.3, for all $\rho_{n}$ and n simultaneously with probability greater than $1-\frac{\delta}{2}$,

$$\begin{aligned}& \lambda M_{n}(\rho_{n})-\frac{\lambda^{2}}{2}V_{n}( \rho_{n}) \\& \quad = \mathbb{E}_{\rho_{n}(h)} \biggl(\lambda M_{n}(h)- \frac{\lambda^{2}}{2}V_{n}(h) \biggr) \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mu_{n}(h)} \bigl(e^{\lambda M_{n}(h)-\frac{\lambda^{2}}{2}V_{n}(h)} \bigr) \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mathcal{F}_{n}} \bigl(\mathbb{E}_{\mu_{n}(h)}e^{\lambda M_{n}(h)-\frac{\lambda^{2}}{2}V_{n}(h)} \bigr)+2 \ln(n+1)+\ln\frac {2}{\delta} \\& \quad = \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mu_{n}(h)} \bigl(\mathbb {E}_{\mathcal{F}_{n}} e^{\lambda M_{n}(h)-\frac{\lambda^{2}}{2}V_{n}(h)} \bigr)+2\ln(n+1)+\ln\frac {2}{\delta} \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+2 \ln(n+1)+\ln\frac{2}{\delta}, \end{aligned}$$

where the second equality is due to the fact that $\mu_{n}$ is independent of $\mathcal{F}_{n}$. By applying the same argument to martingales $-M_{n}(h)$, we obtain the result that, with probability greater than $1-\delta$,

$$\bigl\vert M_{n}(\rho_{n})\bigr\vert \le \frac{\lambda}{2}V_{n}(\rho_{n})+\frac{\operatorname{KL}(\rho _{n}\parallel\mu_{n})+2\ln(n+1)+\ln\frac{2}{\delta}}{\lambda},\quad \lambda>0. $$

□

Proof of Theorem 2.3

For all $y>0$, taking

$$\phi(h)=\ln\frac{y}{\sqrt{V_{n}(h)+y^{2}}}+\frac{M_{n}(h)^{2}}{ 2(V_{n}(h)+y^{2})}, $$

then from Lemma 3.2 and Lemma 3.3, for all $\rho_{n}$ and n simultaneously with probability greater than $1-\frac{\delta}{2}$,

$$\begin{aligned}& \mathbb{E}_{\rho_{n}(h)} \biggl(\ln\frac{y}{\sqrt {V_{n}(h)+y^{2}}}+\frac{M_{n}(h)^{2}}{ 2(V_{n}(h)+y^{2})} \biggr) \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mu_{n}(h)} \bigl(e^{\ln \frac{y}{\sqrt{V_{n}(h)+y^{2}}}+\frac{M_{n}(h)^{2}}{2(V_{n}(h)+y^{2})}} \bigr) \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mathcal{F}_{n}} \bigl(\mathbb{E}_{\mu_{n}(h)}e^{\ln\frac{y}{\sqrt{V_{n}(h)+y^{2}}}+\frac{M_{n}(h)^{2}}{2(V_{n}(h)+y^{2})}} \bigr)+2 \ln(n+1)+\ln\frac{2}{\delta} \\& \quad = \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mu_{n}(h)} \bigl(\mathbb {E}_{\mathcal{F}_{n}} e^{\ln\frac{y}{\sqrt{V_{n}(h)+y^{2}}}+\frac {M_{n}(h)^{2}}{2(V_{n}(h)+y^{2})}} \bigr)+2\ln(n+1)+\ln\frac{2}{\delta} \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+2 \ln(n+1)+\ln\frac{2}{\delta}. \end{aligned}$$

□

In order to prove Theorem 2.4, we need to introduce the following lemma.

Lemma 3.4

Let $(M_{n})$ be a locally square integrable martingale. Putting

$$\langle M \rangle_{n}=\sum_{k=1}^{n} \mathbb{E}\bigl[(\Delta M_{k})^{2}|\mathcal {F}_{k-1}\bigr]\quad \textit{and} \quad [M]_{n}=\sum _{k=1}^{n}(\Delta M_{k})^{2}, $$

for all $t\in \mathbb{R}$ and $n\geq0$, denote

$$G_{n}(t)=\exp \biggl(tM_{n}-\frac{t^{2}}{6}[M]_{n}- \frac{t^{2}}{3} \langle M\rangle_{n} \biggr). $$

Then, for all $t\in\mathbb{R}$, $(G_{n}(t))$ is a positive supermartingale with $\mathbb{E}[G_{n}(t)]\leq1$.

Proof

Let X be a square integrable random variable with $\mathbb{E}X=0$ and $0<\sigma^{2}:=\mathbb{E}X^{2}<\infty$. Because of the basic inequality

$$\exp \biggl(x- \frac{x^{2}}{6} \biggr) \leq1 + x+ \frac{x^{2}}{3},\quad x \in\mathbb{R}, $$

we know

$$ \mathbb{E} \biggl[\exp \biggl(t X- \frac {t^{2}}{6} X^{2} \biggr) \biggr]\leq1+\frac{t^{2}}{3}\sigma^{2}. $$

(3.1)

Then, for all $t \in\mathbb{R}$ and $n \geq0$, we obtain from (3.1)

$$G_{n}(t)=G_{n}(t-1)\exp \biggl(t \Delta M_{n}- \frac{t^{2}}{6}\Delta[M]_{n}-\frac{t^{2}}{3}\Delta \langle M \rangle_{n} \biggr). $$

Hence, we deduce that, for all $t \in\mathbb{R}$,

$$\begin{aligned} \mathbb{E}\bigl[G_{n}(t)|\mathcal{F}_{n-1}\bigr] \leq&G_{n}(t-1)\exp \biggl(-\frac{t^{2}}{3}\Delta\langle M \rangle_{n} \biggr)\cdot \biggl(1+\frac{t^{2}}{3}\Delta \langle M \rangle_{n} \biggr) \\ =& G_{n}(t-1). \end{aligned}$$

As a result, for all $t\in\mathbb{R}$, $G_{n}(t)$ is a positive supermartingale, i.e., for all $n\geq1$, $\mathbb {E}[G_{n}(t)]\leq \mathbb{E}[G_{n-1}(t)]$, which implies that $\mathbb{E}[G_{n}(t)]\leq \mathbb{E}[G_{0}(t)]=1$. □

Proof of Theorem 2.4

Taking

$$\phi(h)=\lambda M_{n}(h)-\frac{\lambda^{2}}{2} \biggl(\frac{[M]_{n}(h)}{3}+ \frac {2\langle M \rangle_{n}(h)}{3} \biggr),\quad \lambda>0, $$

then from Lemma 3.3 and Lemma 3.4, for all $\rho_{n}$ and n simultaneously with probability greater than $1-\frac{\delta}{2}$,

$$\begin{aligned}& \lambda M_{n}(\rho_{n})-\frac{\lambda^{2}}{2} \biggl( \frac{[M]_{n}(\rho _{n})}{3}+\frac{2\langle M \rangle_{n}(\rho_{n})}{3} \biggr) \\& \quad = \mathbb{E}_{\rho_{n}(h)} \biggl[\lambda M_{n}(h)- \frac{\lambda^{2}}{2} \biggl(\frac{[M]_{n}(h)}{3}+\frac {2\langle M \rangle_{n}(h)}{3} \biggr) \biggr] \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mu_{n}(h)} \bigl[e^{\lambda M_{n}(h)-\frac{\lambda^{2}}{2} (\frac{[M]_{n}(h)}{3}+\frac {2\langle M \rangle_{n}(h)}{3} )} \bigr] \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mathcal{F}_{n}} \bigl[\mathbb{E}_{\mu_{n}(h)}e^{\lambda M_{n}(h)-\frac{\lambda^{2}}{2} (\frac{[M]_{n}(h)}{3}+\frac {2\langle M \rangle_{n}(h)}{3} )} \bigr]+2 \ln(n+1)+\ln\frac{2}{\delta} \\& \quad = \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mu_{n}(h)} \bigl[\mathbb {E}_{\mathcal{F}_{n}} e^{\lambda M_{n}(h)-\frac{\lambda^{2}}{2} (\frac{[M]_{n}(h)}{3}+\frac {2\langle M \rangle_{n}(h)}{3} )} \bigr]+2\ln(n+1)+\ln\frac{2}{\delta} \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+2 \ln(n+1)+\ln\frac{2}{\delta}, \end{aligned}$$

where the second equality is due to the fact that $\mu_{n}$ is independent of $\mathcal{F}_{n}$. By applying the same argument to martingales $-M_{n}(h)$, we obtain the result that, with probability greater than $1-\delta$,

$$\bigl\vert M_{n}(\rho_{n})\bigr\vert \le \frac{\lambda}{2} \biggl(\frac{[M]_{n}(\rho _{n})}{3}+\frac{2\langle M \rangle_{n}(\rho_{n})}{3} \biggr)+ \frac{\operatorname{KL}(\rho_{n}\|\mu_{n})+2\ln (n+1)+\ln\frac{2}{\delta}}{\lambda},\quad \lambda>0. $$

□

Proof of Theorem 2.5

For all $y>0$, taking

$$\phi(h)=\ln\frac{y}{\sqrt{ (\frac{[M]_{n}(h)}{3}+\frac {2\langle M \rangle_{n}(h)}{3} )+y^{2}}}+\frac{M_{n}(h)^{2}}{ 2 ( (\frac{[M]_{n}(h)}{3}+\frac{2\langle M \rangle_{n}(h)}{3} )+y^{2} )}, $$

and $V_{n}(h)= [\frac{[M]_{n}(h)}{3} +\frac{2\langle M\rangle_{n}(h)}{3} ]$, from Lemma 3.2 and Lemma 3.3, we can get for all $\rho_{n}$ and n simultaneously with probability greater than $1-\frac {\delta}{2}$,

$$\begin{aligned}& \mathbb{E}_{\rho_{n}(h)} \biggl(\ln\frac{y}{\sqrt{ (\frac {[M]_{n}(h)}{3}+\frac{2\langle M \rangle_{n}(h)}{3} )+y^{2}}}+\frac{M_{n}(h)^{2}}{ 2 ( (\frac{[M]_{n}(h)}{3}+\frac{2\langle M \rangle_{n}(h)}{3} )+y^{2} )} \biggr) \\& \quad = \mathbb{E}_{\rho_{n}(h)} \biggl(\ln\frac{y}{\sqrt {V_{n}(h)+y^{2}}}+\frac{M_{n}(h)^{2}}{ 2 (V_{n}(h)+y^{2} )} \biggr) \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mu_{n}(h)} \bigl(e^{\ln \frac{y}{\sqrt{V_{n}(h)+y^{2}}}+\frac{M_{n}(h)^{2}}{2 (V_{n}(h)+y^{2} )}} \bigr) \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mathcal{F}_{n}} \bigl(\mathbb{E}_{\mu_{n}(h)}e^{\ln\frac{y}{\sqrt{V_{n}(h)+y^{2}}}+\frac{M_{n}(h)^{2}}{2 (V_{n}(h)+y^{2} )}} \bigr)+2 \ln(n+1)+\ln\frac {2}{\delta} \\& \quad = \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mu_{n}(h)} \bigl(\mathbb {E}_{\mathcal{F}_{n}} e^{\ln\frac{y}{\sqrt{V_{n}(h)+y^{2}}}+\frac {M_{n}(h)^{2}}{2 (V_{n}(h)+y^{2} )}} \bigr)+2\ln(n+1)+\ln\frac {2}{\delta} \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+2 \ln(n+1)+\ln\frac{2}{\delta}. \end{aligned}$$

□

References

Shawe-Taylor, J, Williamson, RC: A PAC analysis of a Bayesian estimator. In: Proceedings of the International Conference on COLT, pp. 2-9 (1997)
Google Scholar
Shawe-Taylor, J, Bartlett, PL, Williamson, RC, Anthony, M: Structural risk minimization over data-dependent hierarchies. IEEE Trans. Inf. Theory 44(5), 1926-1940 (1998)
Article MathSciNet MATH Google Scholar
Ralaivola, L, Szafranski, M, Stempfel, G: Chromatic PAC-Bayes bounds for non-IID data: applications to ranking and stationary β-mixing processes. J. Mach. Learn. Res. 11, 1927-1956 (2010)
MathSciNet MATH Google Scholar
Lever, G, Laviolette, F, Shawe-Taylor, J: Distribution-dependent PAC-Bayes priors. In: Algorithmic Learning Theory. Lecture Notes in Computer Science, vol. 6331, pp. 119-133 (2010)
Chapter Google Scholar
Seldin, Y, Laviolette, F, Cesa-Bianchi, N, Shawe-Taylar, J, Auer, P: PAC-Bayesian inequalities for martingales. IEEE Trans. Inf. Theory 58, 7086-7093 (2012)
Article Google Scholar
Seldin, Y, Cesa-Bianchi, N, Auer, P, Laviolette, F, Shawe-Taylor, J: PAC-Bayes-Bernstein inequality for martingales and its application to multiarmed bandits. In: JMLR: Workshop and Conference Proceedings, vol. 26, pp. 98-111 (2012)
Google Scholar
De la Peña, VH: A general class of exponential inequalities for martingales and ratios. Ann. Probab. 27(1), 537-564 (1999)
Article MathSciNet MATH Google Scholar
Hitczenko, P: Upper bounds for the $L_{p}$-norms of martingales. Probab. Theory Relat. Fields 86(2), 225-238 (1990)
Article MathSciNet MATH Google Scholar
Dupuis, P, Ellis, RS: A Weak Convergence Approach to the Theory of Large Deviations. Wiley Series in Probability and Statistics. Wiley, New York (2011)
Google Scholar

Download references

Acknowledgements

This work is supported by IRTSTHN (14IRTSTHN023), NSFC (11471104), NCET (NCET-11-0945).

Author information

Authors and Affiliations

Henan Engineering Laboratory for Big Data Statistical Analysis and Optimal Control, College of Mathematics and Information Science, Henan Normal University, **nxiang, Henan, 453007, China
Zhen Wang, Yu Miao, Shanshan Chen & Wenfei Xu
Science College, Hunan Agricultural University, Changsha, Hunan, 410128, China
Luming Shen

Authors

Zhen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Luming Shen
View author publications
You can also search for this author in PubMed Google Scholar
Yu Miao
View author publications
You can also search for this author in PubMed Google Scholar
Shanshan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Wenfei Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yu Miao.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

All authors contributed equally to the manuscript, and they read and approved the final manuscript.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Wang, Z., Shen, L., Miao, Y. et al. PAC-Bayesian inequalities of some random variables sequences. J Inequal Appl 2015, 244 (2015). https://doi.org/10.1186/s13660-015-0768-3

Download citation

Received: 23 March 2015
Accepted: 24 July 2015
Published: 07 August 2015
DOI: https://doi.org/10.1186/s13660-015-0768-3

PAC-Bayesian inequalities of some random variables sequences

Abstract

Similar content being viewed by others

The Baum–Katz theorem for dependent sequences

On the Concept of A-Statistical Uniform Integrability and the Law of Large Numbers

Strong laws of large numbers for sequences of blockwise and pairwise m-dependent random variables in metris spaces

1 Introduction

2 Main results

Theorem 2.1

2.1 Conditionally symmetric random variables

Theorem 2.2

Theorem 2.3

2.2 Martingales

Theorem 2.4

Theorem 2.5

3 The proofs of main results

Lemma 3.1

Lemma 3.2

Remark 3.1

Proof

Lemma 3.3

Proof of Theorem 2.2

Proof of Theorem 2.3

Lemma 3.4

Proof

Proof of Theorem 2.4

Proof of Theorem 2.5

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Rights and permissions

About this article

Cite this article

Share this article

MSC

Keywords

Search

Navigation