1 Introduction

In this paper, we establish several PAC-Bayesian inequalities. The PAC-Bayesian analysis is an abbreviation for the Probably Approximately Correct learning model and has been introduced a decade ago (Shawe-Taylor and Williamson [1]; Shawe-Taylor et al. [2]) and has made a significant contribution to the analysis and development of supervised learning methods, the random multiarmed bandits problem and so on. PAC-Bayesian analysis provides high probability bounds on the deviation of weighted averages of empirical means of sets of independent random variables from their expectations. It supplies generalization guarantees for many influential machine learning algorithms.

Shawe-Taylor and Williamson [1] established the PAC-Bayesian learning theorems. They showed that if one can find a ball of sufficient volume in a parameterized concept space, then the center of that ball has a low error rate. For non-i.i.d. data, application of PAC-Bayesian analysis was partially addressed only recently by Ralaivola et al. [3] and Lever et al. [4]. For a martingale, taking advantage of the martingale’s properties, Seldin et al. [5] obtained the PAC-Bayesian-Bernstein inequality and applied it to multiarmed bandits. Seldin et al. [6] presented a generalization of the PAC-Bayesian analysis. Their generalization makes it possible to consider model order selection simultaneously with the exploration-exploitation trade-off.

In the present paper, we continue to study the PAC-Bayesian inequalities for some random variable sequence. We concentrate on the conditionally symmetric random variables and the locally square integrable martingale. For the first case, we only assume the conditional symmetry of the random variable sequence, without any other dependent conditions and moment conditions. For the second case, the bounded condition in Seldin et al. [6] is weakened. The paper is divided as follows: In Section 2 we state the main results and make some remarks. Proofs of the main results are provided in Section 3.

2 Main results

In this section, we discuss the PAC-Bayesian inequalities for conditionally symmetric random variables and martingales. In order to present our main theorems, we give a few definitions. Let \(\mathbb{H}\) be an index (or a hypothesis) space, possibly uncountably infinite. Let \(\{X_{1}(h), X_{2}(h),\ldots: h\in\mathbb{H}\} \) be a sequence of variables adapted to an increasing sequence of σ-fields \(\{\mathcal{F}_{n}\}\), where \(\mathcal{F}_{n}=\{ X_{k}(h): 1\le k\le n \text{ and } h\in\mathbb{H}\}\) is a set of random sequences observed up to time n (the history).

Seldin et al. [6] obtained a PAC-Bayes-Bernstein inequality for martingale with finite jumps.

Theorem 2.1

(Seldin et al. [6])

Let \(\{X_{1}(h), X_{2}(h),\ldots: h\in\mathbb{H}\}\) be a set of martingale difference sequences adapted to an increasing sequence of σ-fields \(\mathcal{F}_{n}=\{X_{k}(h): 1\le k\le n \textit{ and } h\in\mathbb{H}\}\). Furthermore, let \(M_{n}(h)=\sum_{k=1}^{n}X_{k}(h)\) be martingales corresponding to the martingale difference sequences and let \(V_{n}(h)=\sum_{k=1}^{n}\mathbb{E}[X_{k}(h)^{2}|\mathcal{F}_{k-1}]\) be cumulative variances of the martingales. For a distribution ρ over \(\mathbb{H}\), define \(M_{n}(\rho) =\mathbb{E}_{\rho(h)}[M_{n}(h)]\) and \(V_{n}(\rho) =\mathbb{E}_{\rho(h)}[V_{n}(h)]\). Let \(\{C_{1}, C_{2}, \ldots\}\) be an increasing sequence set in advance, such that \(|X_{k}(h)|\le C_{k}\) for all h with probability 1. Let \(\{\mu_{1},\mu _{2},\ldots\}\) be a sequence ofreference’ (‘prior’) distributions over \(\mathbb{H}\), such that \(\mu_{n}\) is independent of \(\mathcal{F}_{n}\) (but can depend on n). Let \(\{\lambda_{1},\lambda_{2},\ldots\}\) be a sequence of positive numbers set in advance that satisfy \(\lambda_{k}\le C_{k}^{-1}\). Then for all possible distributions \(\rho_{n}\) over \(\mathbb {H}\) given n and for all n simultaneously with probability greater than \(1-\delta\),

$$\bigl\vert M_{n}(\rho_{n})\bigr\vert \le(e-2) \lambda_{n}V_{n}(\rho_{n})+\frac{\operatorname{KL}(\rho_{n}\parallel\mu _{n})+2\ln(n+1)+\ln\frac{2}{\delta}}{\lambda_{n}}, $$

where \(\operatorname{KL}(\mu\parallel\nu)\) is the KL-divergence (relative entropy) between two distributions μ and ν.

Seldin et al. [6] (see Theorem 2.1) considered the bounded martingale difference sequence and the parameters \((\lambda_{k})\) depend on the bounds \((1/C_{k})\). Since \(C_{k}\) is an increasing sequence, \(\lambda_{k}\) is a decreasing sequence. Furthermore, Seldin et al. [6] studied the deviation properties between the martingale and the conditional variance of the martingale. Based on the above works, in the present paper, we want to consider the conditionally symmetric random variables and the locally square integrable martingale. For the conditionally symmetric random variables, we can establish the PAC-Bayesian inequality without any dependent assumptions (for example, independence or being a martingale) and any moment assumptions for the sequences. For the locally square integrable martingale, we can remove the bounded restriction.

2.1 Conditionally symmetric random variables

Assume that \(\{X_{k}(h): k\ge1, h\in\mathbb{H}\}\) are conditionally symmetric with respect to \((\mathcal{F}_{n})\) (i.e., \(\mathcal{L}(X_{i}(h)|\mathcal{F}_{i})=\mathcal {L}(-X_{i}(h)|\mathcal{F}_{i})\)). Let

$$M_{n}(h)=\sum_{k=1}^{n}X_{k}(h) \quad \text{and}\quad V_{n}(h)=\sum_{k=1}^{n} \bigl[X_{k}(h)^{2}\bigr]. $$

For a distribution ρ over \(\mathbb{H}\), define \(M_{n}(\rho) =\mathbb{E}_{\rho(h)}[M_{n}(h)]\) and \(V_{n}(\rho) =\mathbb{E}_{\rho (h)}[V_{n}(h)]\). We establish the PAC-Bayesian inequality between the partial sums and the total quadratic variation of the partial sums.

Theorem 2.2

Let \(\{\mu_{1},\mu_{2},\ldots\}\) be a sequence ofreference’ (‘prior’) distributions over \(\mathbb{H}\), such that \(\mu_{n}\) is independent of \(\mathcal{F}_{n}\) (but can depend on n). Then for all possible distributions \(\rho_{n}\) over \(\mathbb{H}\) given n and for all n simultaneously with probability greater than \(1-\delta\),

$$\bigl\vert M_{n}(\rho_{n})\bigr\vert \le \frac{\lambda}{2}V_{n}(\rho_{n})+\frac{\operatorname{KL}(\rho_{n}\parallel \mu_{n})+2\ln(n+1)+\ln\frac{2}{\delta}}{\lambda},\quad \lambda>0. $$

The following theorem gives an inequality in the sense of a self-normalized sequence.

Theorem 2.3

Under the assumptions in Theorem  2.2, then for all \(y>0\) and n simultaneously with probability greater than \(1-\frac{\delta}{2}\),

$$\begin{aligned}& \mathbb{E}_{\rho_{n}(h)} \biggl(\ln\frac{y}{\sqrt {V_{n}(h)+y^{2}}}+\frac{M_{n}(h)^{2}}{ 2(V_{n}(h)+y^{2})} \biggr) \\& \quad \leq \operatorname{KL}(\rho_{n}\parallel\mu_{n})+2\ln(n+1)+\ln \frac{2}{\delta}. \end{aligned}$$

2.2 Martingales

Let \((M_{n}(h))\) be a locally square integrable real martingale adapted to the filtration \((\mathcal{F}_{n})\) with \(M_{0}=0\). The predictable quadratic variation and the total quadratic variation of \((M_{n}(h))\) are, respectively, given by

$$\langle M \rangle_{n}(h)=\sum_{k=1}^{n} \mathbb{E}\bigl[\bigl(\Delta M_{k}(h)\bigr)^{2}|\mathcal {F}_{k-1}\bigr]\quad \text{and}\quad [M]_{n}(h)=\sum _{k=1}^{n}\bigl(\Delta M_{k}(h) \bigr)^{2}, $$

where \(\Delta M_{n}(h)=M_{n}(h)-M_{n-1}(h)\). For a distribution ρ over \(\mathbb{H}\), define

$$M_{n}(\rho) =\mathbb{E}_{\rho(h)}\bigl[M_{n}(h)\bigr]. $$

Theorem 2.4

Let \((M_{n}(h))\) be a locally square integrable real martingale adapted to a filtration \(\mathbb{F}=(\mathcal{F}_{n})\). Let \(\{\mu_{1},\mu_{2},\ldots\}\) be a sequence ofreference’ (‘prior’) distributions over \(\mathbb{H}\), such that \(\mu_{n}\) is independent of \(\mathcal{F}_{n}\) (but it can depend on n). Then for all possible distributions \(\rho_{n}\) over \(\mathbb{H}\) given n and for all n simultaneously with probability greater than \(1-\delta\),

$$\bigl\vert M_{n}(\rho_{n})\bigr\vert \le \frac{\lambda}{2} \biggl(\frac{[M]_{n}(\rho _{n})}{3}+\frac{2\langle M \rangle_{n}(\rho_{n})}{3} \biggr)+ \frac{\operatorname{KL}(\rho_{n}\parallel\mu_{n})+2\ln (n+1)+\ln\frac{2}{\delta}}{\lambda}, \quad \lambda>0. $$

Theorem 2.5

Under the assumptions in Theorem  2.4, for all \(y>0\) and n simultaneously with probability greater than \(1-\frac{\delta}{2}\),

$$\begin{aligned}& \mathbb{E}_{\rho_{n}(h)} \biggl(\ln\frac{y}{\sqrt{ (\frac {[M]_{n}(h)}{3}+\frac{2\langle M \rangle_{n}(h)}{3} )+y^{2}}}+\frac{M_{n}(h)^{2}}{ 2 ( (\frac{[M]_{n}}{3}+\frac{2\langle M \rangle_{n}}{3} )+y^{2} )} \biggr) \\& \quad \leq \operatorname{KL}(\rho_{n}\parallel \mu_{n})+2 \ln(n+1)+\ln\frac{2}{\delta}. \end{aligned}$$

3 The proofs of main results

Before giving the proofs of our results, we state the following basic lemmas.

Lemma 3.1

[7]

Let \(\{d_{i}\}\) be a sequence of variables adapted to an increasing sequence of σ-fields \(\{\mathcal{F}_{i}\}\). Assume that the \(\{d_{i}\}\)s are conditionally symmetric. Then

$$\exp \Biggl(\lambda\sum_{i=1}^{n}d_{i}- \frac{\lambda^{2}}{2}\sum_{i=1}^{n}d_{i}^{2} \Biggr),\quad n\geq1, $$

is a supermartingale with mean ≤1, for all \(\lambda\in\mathbb{R}\).

Lemma 3.2

Under the assumptions of Lemma  3.1, for any \(y>0\), we have

$$\mathbb{E}\frac{y}{\sqrt{\sum_{i=1}^{n}d_{i}^{2}+y^{2}}}\exp \biggl(\frac{(\sum_{i=1}^{n}d_{i})^{2}}{ 2(\sum_{i=1}^{n}d_{i}^{2}+y^{2})} \biggr)\leq1. $$

Remark 3.1

Hitczenko [8] proved the above inequality for conditionally symmetric martingale difference sequences, and De la Peña [7] obtained the above inequality without the martingale difference assumption, hence without any integrability assumptions. Note that any sequence of real valued random variables \(X_{i}\) can be ‘symmetrized’ to produce an exponential supermartingale by introducing random variables \(X_{i}'\) such that

$$\mathcal{L}\bigl(X_{i}'|X_{1},X_{1}', \ldots,X_{n-1},X_{n-1}',X_{n}\bigr) = \mathcal{L}\bigl(X_{n}'|X_{1}, \ldots,X_{n-1}\bigr) $$

and we set \(d_{n}=X_{n}-X_{n}'\).

Proof

By using Fubini’s theorem, we have

$$\begin{aligned}& \mathbb{E}\frac{y}{\sqrt{\sum_{i=1}^{n}d_{i}^{2}+y^{2}}}\exp \biggl(\frac{(\sum_{i=1}^{n}d_{i})^{2}}{ 2(\sum_{i=1}^{n}d_{i}^{2}+y^{2})} \biggr) \\& \quad = \mathbb{E}\biggl[\frac{y}{\sqrt{\sum_{i=1}^{n}d_{i}^{2}+y^{2}}}\exp \biggl(\frac{(\sum_{i=1}^{n}d_{i})^{2}}{ 2(\sum_{i=1}^{n}d_{i}^{2}+y^{2})} \biggr) \\& \qquad {}\times\frac{\sqrt{\sum_{i=1}^{n}d_{i}^{2}+y^{2}}}{ \sqrt{2\pi}}\int_{-\infty}^{\infty} \exp \biggl\{ \frac{\sum_{i=1}^{n}d_{i}^{2}+y^{2}}{2} \biggl(\lambda -\frac {\sum_{i=1}^{n}d_{i}^{2}}{\sum_{i=1}^{n}d_{i}^{2}+y^{2}} \biggr)^{2} \biggr\} \, d\lambda\biggr] \\& \quad = \int_{-\infty}^{\infty} \mathbb{E} \Biggl[ \frac{y}{\sqrt{2\pi}}\exp \Biggl(\lambda\sum_{i=1}^{n}d_{i}- \frac{\lambda^{2}}{2} \sum_{i=1}^{n}d_{i}^{2} \Biggr)\exp \biggl(-\frac{\lambda ^{2}y^{2}}{2} \biggr) \Biggr]\, d\lambda \\& \quad \leq \int_{-\infty}^{\infty} \biggl[\frac{y}{\sqrt{2\pi}} \exp \biggl(-\frac{\lambda^{2}y^{2}}{2} \biggr) \biggr]\, d\lambda=1, \end{aligned}$$

where we used the fact

$$\frac{\sqrt{\sum_{i=1}^{n}d_{i}^{2}+y^{2}}}{\sqrt{2\pi}}\int_{-\infty}^{\infty} \exp \biggl\{ \frac{\sum_{i=1}^{n}d_{i}^{2}+y^{2}}{2} \biggl(\lambda -\frac {\sum_{i=1}^{n}d_{i}^{2}}{\sum_{i=1}^{n}d_{i}^{2}+y^{2}} \biggr)^{2} \biggr\} \, d\lambda=1. $$

 □

The following inequality is about a transformation of the measure inequality [9].

Lemma 3.3

For any measurable function \(\phi(h)\) on \(\mathbb{H}\) and any distributions \(\mu(h)\) and \(\rho(h)\) on \(\mathbb{H}\), we have

$$\mathbb{E}_{\rho(h)}\bigl(\phi(h)\bigr)\le \operatorname{KL}(\rho\parallel \mu)+\ln\mathbb {E}_{\mu(h)}\bigl(e^{\phi(h)}\bigr). $$

Proof of Theorem 2.2

Taking

$$\phi(h)=\lambda M_{n}(h)-\frac{\lambda^{2}}{2}V_{n}(h), \quad \lambda>0, $$

then from Lemma 3.1 and Lemma 3.3, for all \(\rho_{n}\) and n simultaneously with probability greater than \(1-\frac{\delta}{2}\),

$$\begin{aligned}& \lambda M_{n}(\rho_{n})-\frac{\lambda^{2}}{2}V_{n}( \rho_{n}) \\& \quad = \mathbb{E}_{\rho_{n}(h)} \biggl(\lambda M_{n}(h)- \frac{\lambda^{2}}{2}V_{n}(h) \biggr) \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mu_{n}(h)} \bigl(e^{\lambda M_{n}(h)-\frac{\lambda^{2}}{2}V_{n}(h)} \bigr) \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mathcal{F}_{n}} \bigl(\mathbb{E}_{\mu_{n}(h)}e^{\lambda M_{n}(h)-\frac{\lambda^{2}}{2}V_{n}(h)} \bigr)+2 \ln(n+1)+\ln\frac {2}{\delta} \\& \quad = \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mu_{n}(h)} \bigl(\mathbb {E}_{\mathcal{F}_{n}} e^{\lambda M_{n}(h)-\frac{\lambda^{2}}{2}V_{n}(h)} \bigr)+2\ln(n+1)+\ln\frac {2}{\delta} \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+2 \ln(n+1)+\ln\frac{2}{\delta}, \end{aligned}$$

where the second equality is due to the fact that \(\mu_{n}\) is independent of \(\mathcal{F}_{n}\). By applying the same argument to martingales \(-M_{n}(h)\), we obtain the result that, with probability greater than \(1-\delta\),

$$\bigl\vert M_{n}(\rho_{n})\bigr\vert \le \frac{\lambda}{2}V_{n}(\rho_{n})+\frac{\operatorname{KL}(\rho _{n}\parallel\mu_{n})+2\ln(n+1)+\ln\frac{2}{\delta}}{\lambda},\quad \lambda>0. $$

 □

Proof of Theorem 2.3

For all \(y>0\), taking

$$\phi(h)=\ln\frac{y}{\sqrt{V_{n}(h)+y^{2}}}+\frac{M_{n}(h)^{2}}{ 2(V_{n}(h)+y^{2})}, $$

then from Lemma 3.2 and Lemma 3.3, for all \(\rho_{n}\) and n simultaneously with probability greater than \(1-\frac{\delta}{2}\),

$$\begin{aligned}& \mathbb{E}_{\rho_{n}(h)} \biggl(\ln\frac{y}{\sqrt {V_{n}(h)+y^{2}}}+\frac{M_{n}(h)^{2}}{ 2(V_{n}(h)+y^{2})} \biggr) \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mu_{n}(h)} \bigl(e^{\ln \frac{y}{\sqrt{V_{n}(h)+y^{2}}}+\frac{M_{n}(h)^{2}}{2(V_{n}(h)+y^{2})}} \bigr) \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mathcal{F}_{n}} \bigl(\mathbb{E}_{\mu_{n}(h)}e^{\ln\frac{y}{\sqrt{V_{n}(h)+y^{2}}}+\frac{M_{n}(h)^{2}}{2(V_{n}(h)+y^{2})}} \bigr)+2 \ln(n+1)+\ln\frac{2}{\delta} \\& \quad = \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mu_{n}(h)} \bigl(\mathbb {E}_{\mathcal{F}_{n}} e^{\ln\frac{y}{\sqrt{V_{n}(h)+y^{2}}}+\frac {M_{n}(h)^{2}}{2(V_{n}(h)+y^{2})}} \bigr)+2\ln(n+1)+\ln\frac{2}{\delta} \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+2 \ln(n+1)+\ln\frac{2}{\delta}. \end{aligned}$$

 □

In order to prove Theorem 2.4, we need to introduce the following lemma.

Lemma 3.4

Let \((M_{n})\) be a locally square integrable martingale. Putting

$$\langle M \rangle_{n}=\sum_{k=1}^{n} \mathbb{E}\bigl[(\Delta M_{k})^{2}|\mathcal {F}_{k-1}\bigr]\quad \textit{and} \quad [M]_{n}=\sum _{k=1}^{n}(\Delta M_{k})^{2}, $$

for all \(t\in \mathbb{R}\) and \(n\geq0\), denote

$$G_{n}(t)=\exp \biggl(tM_{n}-\frac{t^{2}}{6}[M]_{n}- \frac{t^{2}}{3} \langle M\rangle_{n} \biggr). $$

Then, for all \(t\in\mathbb{R}\), \((G_{n}(t))\) is a positive supermartingale with \(\mathbb{E}[G_{n}(t)]\leq1\).

Proof

Let X be a square integrable random variable with \(\mathbb{E}X=0\) and \(0<\sigma^{2}:=\mathbb{E}X^{2}<\infty\). Because of the basic inequality

$$\exp \biggl(x- \frac{x^{2}}{6} \biggr) \leq1 + x+ \frac{x^{2}}{3},\quad x \in\mathbb{R}, $$

we know

$$ \mathbb{E} \biggl[\exp \biggl(t X- \frac {t^{2}}{6} X^{2} \biggr) \biggr]\leq1+\frac{t^{2}}{3}\sigma^{2}. $$
(3.1)

Then, for all \(t \in\mathbb{R}\) and \(n \geq0\), we obtain from (3.1)

$$G_{n}(t)=G_{n}(t-1)\exp \biggl(t \Delta M_{n}- \frac{t^{2}}{6}\Delta[M]_{n}-\frac{t^{2}}{3}\Delta \langle M \rangle_{n} \biggr). $$

Hence, we deduce that, for all \(t \in\mathbb{R}\),

$$\begin{aligned} \mathbb{E}\bigl[G_{n}(t)|\mathcal{F}_{n-1}\bigr] \leq&G_{n}(t-1)\exp \biggl(-\frac{t^{2}}{3}\Delta\langle M \rangle_{n} \biggr)\cdot \biggl(1+\frac{t^{2}}{3}\Delta \langle M \rangle_{n} \biggr) \\ =& G_{n}(t-1). \end{aligned}$$

As a result, for all \(t\in\mathbb{R}\), \(G_{n}(t)\) is a positive supermartingale, i.e., for all \(n\geq1\), \(\mathbb {E}[G_{n}(t)]\leq \mathbb{E}[G_{n-1}(t)]\), which implies that \(\mathbb{E}[G_{n}(t)]\leq \mathbb{E}[G_{0}(t)]=1\). □

Proof of Theorem 2.4

Taking

$$\phi(h)=\lambda M_{n}(h)-\frac{\lambda^{2}}{2} \biggl(\frac{[M]_{n}(h)}{3}+ \frac {2\langle M \rangle_{n}(h)}{3} \biggr),\quad \lambda>0, $$

then from Lemma 3.3 and Lemma 3.4, for all \(\rho_{n}\) and n simultaneously with probability greater than \(1-\frac{\delta}{2}\),

$$\begin{aligned}& \lambda M_{n}(\rho_{n})-\frac{\lambda^{2}}{2} \biggl( \frac{[M]_{n}(\rho _{n})}{3}+\frac{2\langle M \rangle_{n}(\rho_{n})}{3} \biggr) \\& \quad = \mathbb{E}_{\rho_{n}(h)} \biggl[\lambda M_{n}(h)- \frac{\lambda^{2}}{2} \biggl(\frac{[M]_{n}(h)}{3}+\frac {2\langle M \rangle_{n}(h)}{3} \biggr) \biggr] \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mu_{n}(h)} \bigl[e^{\lambda M_{n}(h)-\frac{\lambda^{2}}{2} (\frac{[M]_{n}(h)}{3}+\frac {2\langle M \rangle_{n}(h)}{3} )} \bigr] \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mathcal{F}_{n}} \bigl[\mathbb{E}_{\mu_{n}(h)}e^{\lambda M_{n}(h)-\frac{\lambda^{2}}{2} (\frac{[M]_{n}(h)}{3}+\frac {2\langle M \rangle_{n}(h)}{3} )} \bigr]+2 \ln(n+1)+\ln\frac{2}{\delta} \\& \quad = \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mu_{n}(h)} \bigl[\mathbb {E}_{\mathcal{F}_{n}} e^{\lambda M_{n}(h)-\frac{\lambda^{2}}{2} (\frac{[M]_{n}(h)}{3}+\frac {2\langle M \rangle_{n}(h)}{3} )} \bigr]+2\ln(n+1)+\ln\frac{2}{\delta} \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+2 \ln(n+1)+\ln\frac{2}{\delta}, \end{aligned}$$

where the second equality is due to the fact that \(\mu_{n}\) is independent of \(\mathcal{F}_{n}\). By applying the same argument to martingales \(-M_{n}(h)\), we obtain the result that, with probability greater than \(1-\delta\),

$$\bigl\vert M_{n}(\rho_{n})\bigr\vert \le \frac{\lambda}{2} \biggl(\frac{[M]_{n}(\rho _{n})}{3}+\frac{2\langle M \rangle_{n}(\rho_{n})}{3} \biggr)+ \frac{\operatorname{KL}(\rho_{n}\|\mu_{n})+2\ln (n+1)+\ln\frac{2}{\delta}}{\lambda},\quad \lambda>0. $$

 □

Proof of Theorem 2.5

For all \(y>0\), taking

$$\phi(h)=\ln\frac{y}{\sqrt{ (\frac{[M]_{n}(h)}{3}+\frac {2\langle M \rangle_{n}(h)}{3} )+y^{2}}}+\frac{M_{n}(h)^{2}}{ 2 ( (\frac{[M]_{n}(h)}{3}+\frac{2\langle M \rangle_{n}(h)}{3} )+y^{2} )}, $$

and \(V_{n}(h)= [\frac{[M]_{n}(h)}{3} +\frac{2\langle M\rangle_{n}(h)}{3} ]\), from Lemma 3.2 and Lemma 3.3, we can get for all \(\rho_{n}\) and n simultaneously with probability greater than \(1-\frac {\delta}{2}\),

$$\begin{aligned}& \mathbb{E}_{\rho_{n}(h)} \biggl(\ln\frac{y}{\sqrt{ (\frac {[M]_{n}(h)}{3}+\frac{2\langle M \rangle_{n}(h)}{3} )+y^{2}}}+\frac{M_{n}(h)^{2}}{ 2 ( (\frac{[M]_{n}(h)}{3}+\frac{2\langle M \rangle_{n}(h)}{3} )+y^{2} )} \biggr) \\& \quad = \mathbb{E}_{\rho_{n}(h)} \biggl(\ln\frac{y}{\sqrt {V_{n}(h)+y^{2}}}+\frac{M_{n}(h)^{2}}{ 2 (V_{n}(h)+y^{2} )} \biggr) \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mu_{n}(h)} \bigl(e^{\ln \frac{y}{\sqrt{V_{n}(h)+y^{2}}}+\frac{M_{n}(h)^{2}}{2 (V_{n}(h)+y^{2} )}} \bigr) \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mathcal{F}_{n}} \bigl(\mathbb{E}_{\mu_{n}(h)}e^{\ln\frac{y}{\sqrt{V_{n}(h)+y^{2}}}+\frac{M_{n}(h)^{2}}{2 (V_{n}(h)+y^{2} )}} \bigr)+2 \ln(n+1)+\ln\frac {2}{\delta} \\& \quad = \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mu_{n}(h)} \bigl(\mathbb {E}_{\mathcal{F}_{n}} e^{\ln\frac{y}{\sqrt{V_{n}(h)+y^{2}}}+\frac {M_{n}(h)^{2}}{2 (V_{n}(h)+y^{2} )}} \bigr)+2\ln(n+1)+\ln\frac {2}{\delta} \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+2 \ln(n+1)+\ln\frac{2}{\delta}. \end{aligned}$$

 □