1 Introduction

1.1 Objective and background

Given an underlying filtered probability space \((\Omega ,\mathcal {F},(\mathcal {F}_t)_{t\in [0,T]},P)\), we consider the following Ornstein–Uhlenbeck (OU) regression model

$$\begin{aligned} Y_t = Y_0 +\int _0^t (\mu \cdot X_s-\lambda Y_s)ds + \sigma J_t, \qquad t\in [0,T], \end{aligned}$$
(1.1)

where J is the symmetric \(\beta \)-stable (càdlàg) Lévy process characterized by

$$\begin{aligned} E[e^{iu J_t}] = \exp (-t|u|^{\beta }),\qquad t\ge 0,~u\in \mathbb {R}, \end{aligned}$$

and is independent of the initial variable \(Y_0\), and where \(X=(X_{t})_{t\in [0,T]}\) is an \(\mathbb {R}^{q}\)-valued non-random càdlàg function such that

$$\begin{aligned} \lambda _{\min }\left( \int _0^T X_t^{\otimes 2}dt \right) >0, \end{aligned}$$
(1.2)

with \(\lambda _{\min }(A)\) denoting the minimum eigenvalue of a square matrix A. Throughout, the terminal sampling time \(T>0\) is a fixed constant. Let

$$\begin{aligned} \theta := (\lambda ,\mu ,\beta ,\sigma ) \in \Theta , \end{aligned}$$

where \(\Theta \subset \mathbb {R}^{p}\) (\(p:=q+3\)) is a bounded convex domain such that its closure \(\overline{\Theta }\subset \mathbb {R}\times \mathbb {R}^q \times (0,2)\times (0,\infty )\). The primary objective of this paper is the asymptotically efficient estimation of \(\theta \) when available data is \((X_t)_{t\in [0, T]}\) and \((Y_{t_{j}})_{j=1}^n\), where \(t_j=t_j^n:=jh\) with \(h=h_n:=T/n\); later on, we will consider cases where we observe \((X_{t_{j}})_{j=1}^n\) instead of the full continuous-time record. We will denote the true value of \(\theta \) by \(\theta _0=(\lambda _0,\mu _0,\beta _0,\sigma _0)\in \Theta \).

Analysis of the (time-homogeneous) OU process driven by a stable Lévy process goes back to Doob (1942), where Doob treated the model in a genuinely analytic manner without Itô’s formula, which has not been published as yet at that time. Nowadays, the OU models have been used in a wide variety of applications, such as electric consumption modeling (Perninge et al. 2011; Borovkova and Schmeck 2017; Verdejo et al. 2019), ecology (Jhwueng and Maroulas 2014), and protein dynamics modeling (Challis and Schmidler 2012), to mention a few.

The model (1.1) may be seen as a continuous-time counterpart of the simple first-order ARX (autoregressive exogenous) model. Nevertheless, any proper form of the efficient estimation result has been missing in the literature, probably due to the lack of background theory to estimate all the parameters involved under the bounded-domain infill asymptotics. Let us note that, when J is a Wiener process (\(\beta =2\)), the drift parameters are consistently estimable only when the terminal sampling time tends to infinity, and the associated statistical experiments are known to possess essentially different properties according to the sign of \(\lambda \). That is to say, the model is: locally asymptotically normal for \(\lambda >0\) (ergodic case); locally asymptotically Brownian functional for \(\lambda =0\) (unit-root case); locally asymptotically mixed normal (LAMN) for \(\lambda <0\) (non-ergodic (explosive) case). Turning back to the stable driven case, we should note that the least-squares type estimator would not work unless \(T_n\rightarrow \infty \), as is expected from Hu and Long (2009) and Zhang and Zhang (2013); there, the authors proved that (when \(\beta \) is known) the rate of convergence when \(\lambda >0\) equals \((T_n/\log n)^{1/\beta }\) and the asymptotic distribution is given by a ratio of two independent stable distributions.

1.2 Contributions in brief

First, in Sect. 2, we will show that the model is locally asymptotically mixed normal (LAMN) at \(\theta _0\in \Theta \), and also that the likelihood equation has a root that is asymptotically efficient in the classical sense of Hajék-Le Cam-Jeganathan. The asymptotic results presented here are uniformly valid in a single manner over any compact subset of the parameter space \(\Theta \). In particular, the sign of the autoregressive parameter \(\lambda _0\) does not matter, revealing that (i) the results can be described in a unified manner regardless of whether the model is ergodic or not, and also that (ii) the conventional unit-root problem (see Samarakoon and Knight (2009) and the references therein) is not relevant here at all; this is in sharp contrast to the case of ARX time-series models and the Gaussian OU models. Besides, in Sect. 3, we will provide a way to provide an asymptotically efficient estimator through a suboptimal, yet very simple preliminary estimator, which enables us to bypass not only computationally demanding numerical optimization of the likelihood function involving the \(\beta \)-stable density, but also possible multiple-root problem (Lehmann 1999, Section 7.3).

2 Local likelihood asymptotic

2.1 Preliminaries and result

Let \(P_\theta \) denote the image measure of (JY) associated with the value \(\theta \in \Theta \). We show the non-trivial stochastic expansion of the log-likelihood ratio of \(P_{\theta +\varphi _{n}(\theta )v_n,n}^Y\) with respect to \(P_{\theta ,n}^Y\) for appropriate norming matrix \(\varphi _n(\theta )\) introduced later and bounded sequence \((v_n)\subset \mathbb {R}^p\), where \(P_{\theta ,n}^Y\) stands for the restriction of \(P_\theta \) to \(\sigma (Y_{t_j}:\,j\le n)\). The distribution \(\mathcal {L}(Y_0)\) may vary according to \(\theta \); we will assume that for any \(\epsilon >0,\) there exists an \(M>0\) such that \(\sup _{\theta \in \overline{\Theta }} P_\theta [|Y_0|\ge M]<\epsilon \).

Let \(\phi _\beta \) denote the \(\beta \)-stable density of \(J_1\): \(P_\theta [J_1\in dy]=\phi _\beta (y)dy\). It is known that \(\phi _{\beta }(y)>0\) for each \(y\in \mathbb {R}\), that \(\phi _{\beta }\) is smooth in \((y,\beta )\in \mathbb {R}\times (0,2)\), and that for each \(k,l\in \mathbb {Z}_{+}\),

$$\begin{aligned} \limsup _{|y|\rightarrow \infty } \frac{|y|^{\beta +1+k}}{\log ^{l}(1+|y|)} \big |\partial ^{k}\partial _{\beta }^{l}\phi _{\beta }(y)\big | < \infty . \end{aligned}$$
(2.1)

See DuMouchel (1973) for details. Here, we write \(\partial ^{k}\partial _{\beta }^{l}\phi _{\beta }(y):=(\partial ^{k}/\partial y^k)(\partial ^l/\partial \beta ^{l}) \phi _{\beta }(y)\); analogous notation for the partial derivatives will be used in the sequel.

To proceed, we need to introduce further notation. Any asymptotics will be taken for \(n\rightarrow \infty \) unless otherwise mentioned. We denote by \(\rightarrow _{u}\) the uniform convergence of non-random quantities concerning \(\theta \) over \(\overline{\Theta }\). We write C for positive universal constant which may vary at each appearance, and \(a_{n}\lesssim b_{n}\) when \(a_{n}\le C b_{n}\) for every n large enough. Given positive functions \(a_{n}(\theta )\) and \(b_{n}(\theta )\), we write \(b_n(\theta )=o_u(a_n)\) and \(b_n(\theta )=O_u(a_n)\) if \(a_n^{-1}b_n(\theta )\rightarrow _u 0\) and \(\sup _\theta |a_n^{-1}b_n(\theta )| =O(1)\), respectively. The symbol \(a_{n}(\theta ) \lesssim _{u} b_{n}(\theta )\) means that \(\sup _{\theta }|a_{n}(\theta )/b_{n}(\theta )| \lesssim 1\). We write \(\int _j\) instead of \(\int _{t_{j-1}}^{t_j}\).

By integrating by parts applied to the process \(t\mapsto e^{\lambda t}Y_t\), we obtain the explicit càdlàg solution process: under \(P_\theta \),

$$\begin{aligned} Y_t = e^{-\lambda (t-s)}Y_s + \mu \cdot \int _s^t e^{-\lambda (t -u)} X_{u} du + \sigma \int _s^t e^{-\lambda (t -u)}dJ_u,\qquad t>s. \end{aligned}$$
(2.2)

For \(x,\lambda \in \mathbb {R}\), we write

$$\begin{aligned} \eta (x)=\frac{1}{x}(1-e^{-x}), \qquad \zeta _{j}(\lambda ) = \frac{1}{h}\int _{j} e^{-\lambda (t_{j}-s)} X_s ds. \end{aligned}$$

The basic property of the Lévy integral and the fact that \(\log E_\theta [e^{iu J_1}]=-|u|^\beta \) give

$$\begin{aligned} \log E_\theta \left[ \exp \left( iu\,\sigma \int _{j} e^{-\lambda (t_j -s)}dJ_s \right) \right]&= { \int _j \log E_\theta \left[ \exp \left( iu e^{-\lambda (t_j -s)} \sigma J_1\right) \right] ds } \nonumber \\&= - |\sigma u|^\beta \int _{j} e^{-\lambda \beta (t_j -s)}ds \nonumber \\&= - \big |\sigma h^{1/\beta } \eta (\lambda \beta h)^{1/\beta } u \big |^\beta . \nonumber \end{aligned}$$

Hence,

$$\begin{aligned} \epsilon _{j}(\theta ) := \frac{Y_{t_{j}} - e^{-\lambda h}Y_{t_{j-1}} - \mu \cdot \zeta _{j}(\lambda )h}{\sigma h^{1/\beta }\eta (\lambda \beta h)^{1/\beta }} ~{\mathop {\sim }\limits ^{P_{\theta }}}~\text {i.i.d.}~\mathcal {L}(J_1). \end{aligned}$$
(2.3)

Now, the exact log-likelihood function \(\ell _n(\theta )=\ell _n\left( \theta ;\,(X_t)_{t\in [0,T]},(Y_{t_j})_{j=0}^n\right) \) is given by

$$\begin{aligned} \ell _n(\theta )&=\sum _{j=1}^{n}\log \left( \frac{1}{\sigma h^{1/\beta }\eta (\lambda \beta h)^{1/\beta }}\phi _{\beta } \left( \epsilon _{j}(\theta ) \right) \right) \nonumber \\&=\sum _{j=1}^{n}\left( -\log \sigma +\frac{1}{\beta }\log (1/h) {- \frac{1}{\beta }\log \eta (\lambda \beta h)} + \log \phi _{\beta } \left( \epsilon _{j}(\theta ) \right) \right) . \end{aligned}$$
(2.4)

We introduce the non-random \(p\times p\)-matrix

$$\begin{aligned} \varphi _n=\varphi _{n}(\theta ) := \textrm{diag}\left( \frac{1}{\sqrt{n}h^{1-1/\beta }}\,I_{1+q},~ \frac{1}{\sqrt{n}} \begin{pmatrix} \varphi _{11,n}(\theta ) &{} \varphi _{12,n}(\theta ) \\ \varphi _{21,n}(\theta ) &{} \varphi _{22,n}(\theta ) \\ \end{pmatrix} \right) , \end{aligned}$$
(2.5)

where the real entries \(\varphi _{kl,n}=\varphi _{kl,n}(\theta )\) are assumed to be continuously differentiable in \(\theta \in \Theta \) and to satisfy the following conditions for some finite values \(\overline{\varphi }_{kl}=\overline{\varphi }_{kl}(\theta )\):

$$\begin{aligned} \left\{ \begin{array}{l} \varphi _{11,n}(\theta ) \rightarrow _{u} \overline{\varphi }_{11}(\theta ), \\ \varphi _{12,n}(\theta ) \rightarrow _{u} \overline{\varphi }_{12}(\theta ), \\ s_{21,n}(\theta ):=\beta ^{-2}\log (1/h_n)\varphi _{11,n}(\theta ) + \sigma ^{-1}\varphi _{21,n}(\theta ) \rightarrow _{u} \overline{\varphi }_{21}(\theta ), \\ s_{22,n}(\theta ):=\beta ^{-2}\log (1/h_n)\varphi _{12,n}(\theta ) + \sigma ^{-1}\varphi _{22,n}(\theta ) \rightarrow _{u} \overline{\varphi }_{22}(\theta ), \\ \displaystyle {\inf _{\theta }|\overline{\varphi }_{11}(\theta )\overline{\varphi }_{22}(\theta ) - \overline{\varphi }_{12}(\theta )\overline{\varphi }_{21}(\theta )|>0,}\\ \displaystyle {\max _{(k,l)} \left| \partial _{\theta } \varphi _{kl,n}(\theta )\right| \lesssim _u \log ^2(1/h)}. \end{array} \right. \end{aligned}$$
(2.6)

The matrix \(\varphi _n(\theta )\) will turn out to be the right norming with which \(u \mapsto \ell _{n}\left( \theta +\varphi _{n}(\theta )u\right) - \ell _{n}\left( \theta \right) \) under \(P_\theta \) has an asymptotically quadratic structure in \(\mathbb {R}^{p}\); see Brouste and Masuda (2018) and Clément and Gloter (2020) for the related previous studies. Note that \(\sqrt{n}h_n^{1-1/\beta }\rightarrow _{u}\infty \) and \(|\varphi _{21,n}(\theta )|\vee |\varphi _{22,n}(\theta )| \lesssim \log (1/h)\). By the same reasoning as in Brouste and Masuda (2018, page 292), we have \(\inf _{\theta }|\varphi _{11,n}(\theta )\varphi _{22,n}(\theta ) - \varphi _{12,n}(\theta )\varphi _{21,n}(\theta )| \gtrsim 1\) and \(|\varphi _{n}(\theta )| \rightarrow _{u} 0\) under (2.6).

Let

$$\begin{aligned} f_\beta (y):= \frac{\partial _{\beta }\phi _{\beta }}{\phi _{\beta }}(y), \qquad g_\beta (y) := \frac{\partial \phi _{\beta }}{\phi _{\beta }}(y), \end{aligned}$$

and define the block-diagonal random matrix

$$\begin{aligned} \mathcal {I}(\theta )=\textrm{diag}\big (\mathcal {I}_{\lambda ,\mu }(\theta ),\mathcal {I}_{\beta ,\sigma }(\theta )\big ), \end{aligned}$$
(2.7)

where, for a random variable \(\epsilon {\mathop {\sim }\limits ^{P_{\theta }}} \phi _\beta (y)dy\) and by denoting by \(A^\top \) the transpose of matrix A,

$$\begin{aligned} \mathcal {I}_{\lambda ,\mu }(\theta )&:= \frac{1}{\sigma ^2}E_\theta \left[ g_\beta (\epsilon )^2 \right] \frac{1}{T}\int _0^T \begin{pmatrix} Y_t^2 &{} -Y_t X_t^{\top } \\ -Y_t X_t &{} X_t^{\otimes 2} \end{pmatrix} dt, \end{aligned}$$
(2.8)
$$\begin{aligned} \mathcal {I}_{\beta ,\sigma }(\theta )&:= \begin{pmatrix} \overline{\varphi }_{11} &{} \overline{\varphi }_{12} \\ -\overline{\varphi }_{21} &{} -\overline{\varphi }_{22} \end{pmatrix}^{\top }\!\!\! \begin{pmatrix} E_\theta \left[ f_\beta (\epsilon )^2\right] &{} E_\theta \left[ \epsilon f_\beta (\epsilon )g_\beta (\epsilon )\right] \\ E_\theta \left[ \epsilon f_\beta (\epsilon )g_\beta (\epsilon )\right] &{} E_\theta \left[ (1+\epsilon g_\beta (\epsilon ))^{2}\right] \end{pmatrix}\!\! \begin{pmatrix} \overline{\varphi }_{11} &{} \overline{\varphi }_{12} \\ -\overline{\varphi }_{21} &{} -\overline{\varphi }_{22} \end{pmatrix}. \end{aligned}$$
(2.9)

Note that \(\mathcal {I}(\theta )\) does depend on the choice of \(\overline{\varphi }(\theta )=\{\overline{\varphi }_{kl}(\theta )\}\); if \(\overline{\varphi }(\theta )\) is free from \((\lambda ,\mu )\), then so is \(\mathcal {I}(\theta )\).

Also, we note that \(\mathcal {I}(\theta ) >0\) (\(P_\theta \)-a.s., \(\theta \in \Theta \)) under (1.2). Indeed, it was verified in Brouste and Masuda (2018, Theorem 1) that \(\mathcal {I}_{\beta ,\sigma }(\theta )>0\) a.s. To deduce that \(\mathcal {I}_{\lambda ,\mu }(\theta )>0\) a.s., we note that \(\int _0^T Y^2_t dt>0\) a.s. and that, by Schwarz’s inequality,

$$\begin{aligned}&u^\top \left\{ \int _0^T X_t^{\otimes 2} dt - \left( \int _0^T Y_t X_tdt \right) \left( \int _0^T Y_t^2 dt \right) ^{-1} \left( \int _0^T Y_t X_tdt \right) ^\top \right\} u \nonumber \\&= \int _0^T (u\cdot X_t)^2 dt - \left( \int _0^T Y_t^2 dt \right) ^{-1} \left( \int _0^T Y_t (u\cdot X_t)dt \right) ^2 > 0 \nonumber \end{aligned}$$

for every nonzero \(u\in \mathbb {R}^q\), since for any constant real \(\xi \) we have \(Y\ne (u\cdot X)\xi \) a.s. as functions on [0, T]. Apply the identity \(\det \begin{pmatrix} A &{} B^\top \\ B &{} C\end{pmatrix}=\det (A)\det (C-BA^{-1}B^\top )\) to conclude the \(P_\theta \)-a.s. positive definiteness of \(\mathcal {I}(\theta )\).

The normalized score function \(\Delta _n(\theta _0)\) and the normalized observed information matrix \(\mathcal {I}_n(\theta _0)\) are given by

$$\begin{aligned} \Delta _{n}(\theta )&:= \varphi _{n}(\theta )^{\top }\partial _{\theta }\ell _{n}(\theta ), \nonumber \\ \mathcal {I}_n(\theta )&:= -\varphi _n(\theta )^{\top }\partial _{\theta }^{2}\ell _n(\theta )\varphi _n(\theta ), \nonumber \end{aligned}$$

respectively. Let \(MN_{p,\theta }(0,\mathcal {I}(\theta )^{-1})\) denote the covariance mixture of p-dimensional normal distribution, corresponding to the characteristic function \(u\mapsto E_\theta \big [\exp (-u^\top \mathcal {I}(\theta )^{-1}u/2)\big ]\). Finally, we write \(M[u]=\sum _i M_i u_i\) for a linear form \(M=\{M_i\}\) and similarly \(Q[u,u]=Q[u^{\otimes 2}]=\sum _{i,j}Q_{ij}u_i u_j\) for a quadratic form \(Q=\{Q_{ij}\}\). Now, we are ready to state the main claim of this section.

Theorem 2.1

The following statements hold for any \(\theta \in \Theta \).

  1. (1)

    For any bounded sequence \((v_n)\subset \mathbb {R}^p\), it holds that

    $$\begin{aligned} \ell _{n}\left( \theta +\varphi _{n}(\theta )v_n\right) - \ell _{n}\left( \theta \right) = \Delta _n(\theta )[v_n] - \frac{1}{2} \mathcal {I}_n(\theta ) [v_n,v_n] + o_{P_{\theta }}(1), \nonumber \end{aligned}$$

    where we have the convergence in distribution under \(P_\theta \): \(\mathcal {L}\left( \Delta _{n}(\theta ), \, \mathcal {I}_n(\theta ) |P_\theta \right) \Rightarrow \mathcal {L}\left( \mathcal {I}(\theta )^{1/2}Z,\, \mathcal {I}(\theta ) \right) \), where \(Z\sim N_{p}(0,I)\) is independent of \(\mathcal {I}(\theta )\), defined on an extended probability space.

  2. (2)

    There exists a local maximum point \(\hat{\theta }_{n}\) of \(\ell _{n}(\theta )\) with \(P_\theta \)-probability tending to 1 for which

    $$\begin{aligned} \varphi _{n}(\theta )^{-1}(\hat{\theta }_{n}-\theta ) = \mathcal {I}_{n}(\theta )^{-1}\Delta _n(\theta ) + o_{P_\theta }(1) \Rightarrow MN_{p,\theta }\left( 0,\, \mathcal {I}(\theta )^{-1} \right) . \nonumber \end{aligned}$$

It is worth mentioning that the particular non-diagonal form of \(\varphi _n(\theta )\) is, as in Brouste and Masuda (2018), inevitable to deduce the asymptotically non-degenerate joint distribution of the maximum-likelihood estimator (MLE), the good local maximum point \(\hat{\theta }_{n}\) in Theorem 2.1(2).

Remark 2.2

Here are some comments on the model timescale.

  1. (1)

    We fix the terminal sampling time T, so that the rate of convergence \(\sqrt{n}h^{1-1/\beta }=n^{1/\beta -1/2}T^{1-1/\beta }=O(n^{1/\beta -1/2})\) for \((\lambda ,\mu )\). If \(\beta >1\) (resp. \(\beta <1\)), then a longer period would lead to a better (resp. worse) performance of estimating \((\lambda ,\mu )\). The Cauchy case \(\beta =1\), where the two rates of convergence coincide, is exceptional.

  2. (2)

    We can explicitly associate a change of the terminal sampling time T with those of the components of \(\theta \). Specifically, changing the model timescale from t to tT in (1.1), we see that the process

    $$\begin{aligned} Y^T=(Y^T_t)_{t\in [0,1]}:=(Y_{tT})_{t\in [0,1]} \nonumber \end{aligned}$$

    satisfies exactly the same integral equation as in (1.1), except that \(\theta = (\lambda ,\mu ,\beta ,\sigma )\) is replaced by

    $$\begin{aligned} \theta _T = \big ( \lambda _T,\mu _T,\beta _T,\sigma _T):= ( T\lambda ,T\mu ,\beta ,T^{1/\beta }\sigma \big ) \nonumber \end{aligned}$$

    (\(\beta \) is unchanged), \(X_t\) by \(X^T_t:=X_{tT}\), and \(J_t\) by \(J^T_t := T^{-1/\beta }J_{tT}\):

    $$\begin{aligned} Y^T_t = Y^T_0 +\int _0^t (\mu _T \cdot X^T_s-\lambda _T Y^T_s)ds + \sigma _T J^T_t, \qquad t\in [0,1]. \nonumber \end{aligned}$$

    Note that \((J^T_t)_{t\in [0,1]}\) defines the standard \(\beta \)-stable Lévy process. This indeed shows that we may set \(T\equiv 1\) in the virtual (model) world without loss of generality. This is impossible for diffusion-type models, where we cannot consistently estimate the drift coefficient unless we let the terminal sampling time T tend to infinity.

Remark 2.3

The present framework allows us to do unit period-wise, for example, day-by-day inference for both trend and scale structures, providing a sequence of period-wise estimates with theoretically valid approximate confidence sets. This, though informally, suggests an aspect of change-point analysis in high-frequency data: if we have high-frequency sample over \([k-1,k]\) for \(k=1,\dots ,[T]\), then we can construct a sequence of estimators \(\{\hat{\theta }_{n}(k)\}_{k=1}^{[T]}\); then it would be possible in some way to reject the constancy of \(\theta \) over [0, [T]] if \(k\mapsto \hat{\theta }_{n}(k)\) (\(k=1,\dots ,[T]\)) is not likely to stay unchanged.

Remark 2.4

It is formally straightforward to extend the model (1.1) to the following form:

$$\begin{aligned} Y_t = Y_0 +\int _0^t \left\{ a(X_s,\lambda ) Y_s + b(X_s,\mu ) \right\} ds + \int _0^t c(X_{s-},\sigma ) dJ_s, \qquad t\in [0,T]. \end{aligned}$$

Under mild regularity conditions on the function (abc) as well as on the non-random process X, a solution process Y is explicitly given by (see Cheridito et al. 2003, Appendix)

$$\begin{aligned} Y_t = e^{\psi (s,t;\lambda )}Y_s + \int _s^t e^{\psi (u,t;\lambda )} (b(X_u,\mu )du + c(X_{u-},\sigma )dJ_u), \qquad 0\le s<t, \nonumber \end{aligned}$$

where \(\psi (s,t;\lambda ) := \int _s^t a(X_v,\lambda ) dv\). However, the corresponding likelihood asymptotics becomes much messier. It is worth mentioning that the optimal rate matrix can be diagonal if, for example, \(\partial _t\partial _\theta \log c(t,\sigma )\not \equiv 0\) with \(X_t=t\): for details, see the previous study Clément and Gloter (2020) that treated the general time-homogeneous Markovian case.

2.2 Proof of Theorem 2.1

In this proof, we make use of the general result Sweeting (1980) about the exact-likelihood asymptotics in a more or less analogous way to that of Brouste and Masuda (2018, Theorem 1): under the uniform nature of the exact-likelihood asymptotics, we will deduce the joint convergence in distribution of the normalized score \(\Delta _n(\theta _0)\) and the normalized observed information \(\mathcal {I}_n(\theta _0)\) from the uniform convergence in probability of \(\mathcal {I}_n(\cdot )\) in an appropriate sense. Consequently, we will not need to derive the stable convergence in law of \(\Delta _n(\theta _0)\), which is often crucial when concerned with high-frequency sampling for a process with dependent increments.

We have \(\sup _{t\in [0,T]}|X_t|<\infty ,\) since \(X: [0,T]\rightarrow \mathbb {R}^q\) is assumed to be càdlàg. Through the localization procedure, we may and do suppose that the driving stable Lévy process does not have jumps of size greater than some fixed threshold (see Masuda 2019, Section 6.1 for a concise account). In that case, the Lévy measure of J is compactly supported; hence in particular,

$$\begin{aligned} \sup _{\theta \in \overline{\Theta }}E_\theta \left[ |J_1|^K\right] < \infty \end{aligned}$$
(2.10)

for any \(K>0\). Further, since the Lévy measure of J is symmetric, the removal of large-size jumps does not change the parametric form of the drift coefficient. We also localize the initial variable \(Y_0\) so that \(|Y_0|\) is essentially bounded uniformly in \(\theta \). It follows from (2.2) and (2.10) that \(\sup _{\theta \in \overline{\Theta }}\sup _{0\le t\le T}E_\theta \left[ |Y_t|^K\right] <\infty \) for \(t\in [0,T]\) as well.

To proceed, we introduce some further notation. Given continuous random functions \(\xi _{0}(\theta )\) and \(\xi _{n}(\theta )\), \(n\ge 1\), we write \(\xi _{n}(\theta )\xrightarrow {p}_u \xi _{0}(\theta )\) if the joint distribution of \(\xi _n\) and \(\xi _0\) are well defined under \(P_\theta \) and if \(P_{\theta }[ |\xi _{n}(\theta )-\xi _{0}(\theta )|>\epsilon ] \rightarrow _{u} 0\) for every \(\epsilon >0\) as \(n\rightarrow \infty \). Additionally, for a sequence \(a_n>0,\) we write \(\xi _n(\theta )=o_{u,p}(a_n)\) if \(a_n^{-1}\xi _{n}(\theta )\xrightarrow {p}_u 0\), and also \(\xi _n(\theta )=O_{u,p}(a_n)\) if for every \(\epsilon >0\) there exists a constant \(K>0\) for which \(\sup _\theta P_\theta [|a_n^{-1}\xi _n(\theta )| > K]<\epsilon \). Similarly, for any random functions \(\chi _{nj}(\theta )\) doubly indexed by n and \(j\le n\), we write \(\xi _{nj}(\theta )=O^*_p(a_n)\) if

$$\begin{aligned} \sup _n \max _{j\le n} \sup _{\theta } E_\theta \left[ |a_n^{-1}\chi _{nj}(\theta )|^K\right] < \infty \nonumber \end{aligned}$$

for any \(K>0\). Finally, let

$$\begin{aligned} \mathfrak {N}_{n}(c;\theta ) := \left\{ \theta ' \in \Theta :\, |\varphi _{n}(\theta )^{-1}(\theta '-\theta )|\le c \right\} . \nonumber \end{aligned}$$

We will complete the proof of Theorem 2.1 by verifying the three statements corresponding to the conditions (12), (13), and (14) in Brouste and Masuda (2018), which here read

$$\begin{aligned}&\mathcal {I}_n(\theta ) \xrightarrow {p}_u \mathcal {I}(\theta ), \end{aligned}$$
(2.11)
$$\begin{aligned}&\sup _{\theta ' \in \mathfrak {N}_{n}(c;\theta )} | \varphi _{n}(\theta ')^{-1}\varphi _{n}(\theta ) - I_{p} | \rightarrow _{u} 0, \end{aligned}$$
(2.12)
$$\begin{aligned}&\sup _{\theta ^1,\dots ,\theta ^{p}\in \mathfrak {N}_{n}(c;\theta )} \left| \varphi _{n}(\theta )^{\top }\{ \partial _{\theta }^{2}\ell _{n}(\theta ^{1},\dots ,\theta ^{p})-\partial _{\theta }^{2}\ell _{n}(\theta ) \} \varphi _{n}(\theta ) \right| {\xrightarrow {p}_{u}} 0, \end{aligned}$$
(2.13)

respectively, where (2.12) and (2.13) should hold for all \(c>0\) and where \(\partial _{\theta }^{2}\ell _{n}(\theta ^1,\dots ,\theta ^{p})\), \(\theta ^k\in \Theta \), denotes the \(p\times p\) Hessian matrix of \(\ell _n(\theta )\), whose (kl)th element is given by \(\partial _{\theta _k}\partial _{\theta _l}\ell _{n}(\theta ^k)\), in which \(\theta =:(\theta _l)_{l=1}^{p}\). Having obtained (2.11), (2.12) and (2.13), Sweeting (1980, Theorem 1 and 2) immediately concludes Theorem 2.1. We can verify (2.12) exactly as in Brouste and Masuda (2018), so we will look at (2.11) and (2.13).

Proof of (2.11). Recall the expression (2.4). To look at the entries of \(\partial _\theta ^2\ell _n(\theta )\), we introduce several shorthands for notational convenience; they may look somewhat daring, but would not bring confusion. Let us omit the subscript \(\beta \) and the argument \(\epsilon _j\) of the aforementioned notation, such as \(\phi :=\phi _\beta (\epsilon _j)\), \(g:=g_\beta (\epsilon _j)\) and so on. For brevity, we also write

$$\begin{aligned} l'=\log (1/h), \quad {c=\eta (\lambda \beta h)^{-1/\beta }}, \quad \epsilon =\epsilon _{j}(\theta ), \nonumber \end{aligned}$$

so that (2.4) becomes

$$\begin{aligned} \ell _n(\theta ) = \sum _{j=1}^{n}\left( -\log \sigma +\frac{1}{\beta }l' + \log c + \log \phi \right) . \nonumber \end{aligned}$$

Further, the partial differentiation with respect to a variable will be denoted by the braced subscript such as \(\epsilon _{(a)}:=\partial _a\epsilon _j(\theta )\) and \(\epsilon _{(a,b)}:=\partial _a\partial _b\epsilon _j(\theta )\). Then, direct computations give the first-order partial derivatives:

$$\begin{aligned} {\partial _\lambda }\ell _n(\theta )&= \sum _{j=1}^{n}\left( (\log c)_{(\lambda )} + \epsilon _{(\lambda )}\, g \right) , \nonumber \\ {\partial _\mu }\ell _n(\theta )&= \sum _{j=1}^{n}\epsilon _{(\mu )}\, g, \nonumber \\ \partial _\beta \ell _n(\theta )&= \sum _{j=1}^{n}\left( -\beta ^{-2}l' + (\log c)_{(\beta )} + \epsilon _{(\beta )}\, g + f \right) , \nonumber \\ \partial _\sigma \ell _n(\theta )&= \sum _{j=1}^{n}\left( -\sigma ^{-1} + \epsilon _{(\sigma )} \, g \right) , \nonumber \end{aligned}$$

followed by the second-order ones:

$$\begin{aligned} \partial _\lambda ^2\ell _n(\theta )&= \sum _{j=1}^{n}\left\{ (\epsilon _{(\lambda )})^2 (\partial g) + \epsilon _{(\lambda ,\lambda )}\, g + (\log c)_{(\lambda ,\lambda )} \right\} , \nonumber \\ \partial _\mu ^2\ell _n(\theta )&= \sum _{j=1}^{n}(\epsilon _{(\mu )})^2 (\partial g), \nonumber \\ \partial _\beta ^2\ell _n(\theta )&= \sum _{j=1}^{n}\big \{ 2{\beta ^{-3}} l' + (\log c)_{(\beta ,\beta )} + \epsilon _{(\beta ,\beta )}\, g + \epsilon _{(\beta )}\, g_{(\beta )} \\ {}&\qquad + (\epsilon _{(\beta )})^2 (\partial g) + f_{(\beta )} + \epsilon _{(\beta )}\, (\partial f) \Big \}, \nonumber \\ \partial _\sigma ^2\ell _n(\theta )&= \sum _{j=1}^{n}\left\{ \sigma ^{-2} + (\epsilon _{(\sigma )})^2 (\partial g) + \epsilon _{(\sigma ,\sigma )} \, g \right\} , \nonumber \\ \partial _\lambda \partial _\mu \ell _n(\theta )&= \sum _{j=1}^{n}\left\{ \epsilon _{(\lambda )}\,\epsilon _{(\mu )} (\partial g) + \epsilon _{(\mu ,\lambda )}\, g \right\} , \nonumber \\ \partial _\lambda \partial _\beta \ell _n(\theta )&= \sum _{j=1}^{n}\left\{ (\log c)_{(\lambda ,\beta )} + \epsilon _{(\lambda )}\, g_{(\beta )} + \epsilon _{(\beta )}\, \epsilon _{(\lambda )}\, (\partial g) + \epsilon _{(\lambda ,\beta )}\, g \right\} , \nonumber \\ \partial _\lambda \partial _\sigma \ell _n(\theta )&= \sum _{j=1}^{n}\left\{ \epsilon _{(\sigma )}\,\epsilon _{(\lambda )} (\partial g) + \epsilon _{(\lambda ,\sigma )}\, g \right\} , \nonumber \\ \partial _\mu \partial _\beta \ell _n(\theta )&= \sum _{j=1}^{n}\left\{ \epsilon _{(\mu )}\, g_{(\beta )} + \epsilon _{(\beta )}\,\epsilon _{(\mu )} (\partial g) + \epsilon _{(\beta ,\mu )}\, g \right\} , \nonumber \\ \partial _\mu \partial _\sigma \ell _n(\theta )&= \sum _{j=1}^{n}\left\{ \epsilon _{(\mu )}\,\epsilon _{(\sigma )} (\partial g) + \epsilon _{(\mu ,\sigma )} \, g \right\} , \nonumber \\ \partial _\beta \partial _\sigma \ell _n(\theta )&= \sum _{j=1}^{n}\left\{ \epsilon _{(\sigma )}\, g_{(\beta )} + \epsilon _{(\beta )}\,\epsilon _{(\sigma )} (\partial g) + \epsilon _{(\beta ,\sigma )}\, g \right\} . \nonumber \end{aligned}$$

It is straightforward to see which term is the leading one in each expression above. We do not list all the details here, but for later reference mention a few of the points:

  • \(\partial ^k\log \eta (y) = O_u(1)\) for \(|y|\rightarrow 0\) whatever \(k\in \mathbb {Z}_{+}\) is;

  • \((\log c)_{(\lambda ,\dots ,\lambda )}=O_u(h^k)\) (k-times, \(k\in \mathbb {Z}_{+}\)), \((\log c)_{(\lambda ,\beta )}=O_u(h^2)\), \((\log c)_{(\beta )}=O_u(h^2)\), \((\log c)_{(\beta ,\beta )}=O_u(h^4)\), and so forth;

  • \(\max _{j\le n}|\partial _\lambda ^k\zeta _j(\lambda )|=O(h^k)\) for \(k\in \mathbb {Z}_{+}\);

  • recalling the definition (2.3) and because of the consequence (2.10) of the localization, concerning the partial derivatives of \(\epsilon _j(\theta )\) we obtain the asymptotic representations: \(\epsilon _{(\mu ,\sigma )} = (1+o_{u}(1)) \sigma ^{-2}h^{1-1/\beta }\), \(\epsilon _{(\mu ,\lambda )} = (1+o_{u}(1)) \sigma ^{-1}h^{2-1/\beta }/2\), \(\epsilon _{(\sigma ,\lambda )} = (1+o_{u}(1)) \{-\sigma ^{-2}h^{1-1/\beta }Y_{t_{j-1}} + O^*_p(h \vee h^{2-1/\beta })\}\), \(\epsilon _{(\lambda ,\lambda )} = O^*_p(h^{2-1/\beta })\), \(\epsilon _{(\beta ,\beta )} = O_p^*(h^2 (l')^2) + \epsilon \, O^*_p((l')^2)\), \(\epsilon _{(\lambda ,\beta )} = O^*_p(l' h^{1-1/\beta })\), and so on; the terms “\(o_u(1)\)” therein are all valid uniformly in \(j\le n\).

Now, we write

$$\begin{aligned} \mathcal {I}_n(\theta ) = \begin{pmatrix} \mathcal {I}_{11,n}(\theta ) &{} \mathcal {I}_{12,n}(\theta ) \\ \mathcal {I}_{12,n}(\theta )^\top &{} \mathcal {I}_{22,n}(\theta ) \end{pmatrix} \nonumber \end{aligned}$$

with \(\mathcal {I}_{11,n}(\theta ) \in \mathbb {R}^{1+q}\otimes \mathbb {R}^{1+q}\), \(\mathcal {I}_{22,n}(\theta ) \in \mathbb {R}^{2}\otimes \mathbb {R}^{2}\) and \(\mathcal {I}_{12,n}(\theta ) \in \mathbb {R}^{1+q}\otimes \mathbb {R}^{2}\). We can deduce \(\mathcal {I}_{22,n}(\theta ) \xrightarrow {p}_u \mathcal {I}_{\beta ,\sigma }(\theta )\) in exactly the same way as in the proof of Eq.(12) in Brouste and Masuda (2018). Below, we will show \(\mathcal {I}_{11,n}(\theta ) \xrightarrow {p}_u \mathcal {I}_{\lambda ,\mu }(\theta )\) and \(\mathcal {I}_{12,n}(\theta ) \xrightarrow {p}_u 0\).

The Burkholder inequality ensures that

$$\begin{aligned} \frac{1}{\sqrt{n}}\sum _{j=1}^{n}\pi (X_{t_{j-1}},Y_{t_{j-1}};\theta )U(\epsilon _j(\theta )) = O_{u,p}(1) \end{aligned}$$
(2.14)

for any continuous \(\pi (x,y;\theta )\) and for any \(U(\epsilon _j(\theta ))\) such that \(E_\theta [U(\epsilon _j(\theta ))]=0\) (\(\theta \in \Theta \)) and that the left-hand side of (2.14) is continuous over \(\theta \in \overline{\Theta }\). Also, note that the right continuity of \(t\mapsto X_{t}\) implies that (\(X_t^{\otimes 1}:=X_t\))

$$\begin{aligned} \lim _{n\rightarrow \infty }\max _{l=1,2}\max _{j\le n}\left| \frac{1}{h} \int _j (X_s^{\otimes l} - X_{t_{j-1}}^{\otimes l}) ds\right| = 0. \nonumber \end{aligned}$$

These basic facts will be repeatedly used below without mentioning them.

For convenience, we will write

$$\begin{aligned} r_n=r_n(\beta )=\sqrt{n}h^{1-1/\beta } \end{aligned}$$
(2.15)

and denote by \(\varvec{1}_{u,p}\) any random array \(\xi _{nj}(\theta )\) such that \(\max _{j\le n}|\xi _{nj}(\theta ) - 1| \xrightarrow {p}_u 0\). Direct computations give the following expressions for the components of \(\mathcal {I}_{11,n}(\theta )=- r_{n}^{-2}\partial _{(\lambda ,\mu )}^2\ell _{n}(\theta )\):

$$\begin{aligned} -\frac{1}{r_n^2}\partial _\mu ^2\ell _n(\theta )&= -\frac{1}{n}\sum _{j=1}^{n}\sigma ^{-2}(\partial g)\,\zeta _j(\lambda )^{\otimes 2} + o_{u,p}(1) \nonumber \\&= \frac{1}{n}\sum _{j=1}^{n}\sigma ^{-2} g^2 \,\zeta _j(\lambda )^{\otimes 2} + o_{u,p}(1), \nonumber \\ -\frac{1}{r_n^2}\partial _\lambda ^2\ell _n(\theta )&= -\frac{1}{n}\sum _{j=1}^{n}\sigma ^{-2}(\partial g)\,Y_{t_{j-1}}^2\,\varvec{1}_{u,p} + o_{u,p}(1)+O_{u,p}(h^{1/\beta }) \nonumber \\&= \frac{1}{n}\sum _{j=1}^{n}\sigma ^{-2} g^2\, Y_{t_{j-1}}^2 + o_{u,p}(1), \nonumber \\ -\frac{1}{r_n^2}\partial _\lambda \partial _\mu \ell _n(\theta )&= -\frac{1}{n}\sum _{j=1}^{n}\Big \{ (\partial g) \left( \varvec{1}_{u,p}\sigma ^{-1}Y_{t_{j-1}}\,\zeta _j(\lambda ) + O^*_{p}(h^{1/\beta }) \right) \\ {}&\qquad (-\sigma ^{-1}\varvec{1}_{u,p}) + O^*_{p}(h^{1/\beta }) \Big \} \nonumber \\&=\varvec{1}_{u,p}\left( - \frac{1}{n}\sum _{j=1}^{n}\sigma ^{-2}(\partial g)\,Y_{t_{j-1}}\,\zeta _j(\lambda ) + o_{u,p}(1) \right) + o_{u,p}(1) \nonumber \\&= \frac{1}{n}\sum _{j=1}^{n}\sigma ^{-2} g^2\, Y_{t_{j-1}}\,\zeta _j(\lambda ) + o_{u,p}(1). \nonumber \end{aligned}$$

We can deduce that \(\mathcal {I}_{11,n}(\theta ) \xrightarrow {p}_u \mathcal {I}_{\lambda ,\mu }(\theta )\) as follows.

  • First, noting that \(\epsilon _j=\epsilon _j(\theta ) {\mathop {\sim }\limits ^{P_{\theta }}} \text {i.i.d.}~\mathcal {L}(J_1)\), we make the compensation \(g^2= E_\theta [g^2]+(g^2 - E_\theta [g^2])\) in the summands in the rightmost sides of the last three displays and then pick up the leading part involving \(E_\theta [g^2]\); the other one becomes negligible by the Burkholder inequality.

  • Then, the a.s. Riemann integrability of \(t\mapsto (X_{t}(\omega ),Y_{t}(\omega ))\) allows us to conclude that, for \(k,l\in \{0,1,2\}\) and under \(P_\theta \) for each \(\theta \),

    $$\begin{aligned} D_{n}(k,l)&:= \left| \frac{1}{n} \sum _{j=1}^{n}Y_{t_{j-1}}^k\,X_{t_{j-1}}^{\otimes l} - \frac{1}{T} \int _0^T Y_{t}^k \, X_{t}^{\otimes l} dt \right| \nonumber \\&\lesssim \frac{1}{n} \sum _{j=1}^{n}\frac{1}{h} \int _j \Bigg (|Y_t- Y_{t_{j-1}}|(1+|Y_t|+|Y_{t_{j-1}}|)^C \nonumber \\&{}\qquad + |Y_t|^k \left| \left( \frac{1}{h}\int _j X_t dt + O(h)\right) ^{\otimes l} - X_t ^{\otimes l} \right| \Bigg )dt \nonumber \\&\lesssim \frac{1}{n} \sum _{j=1}^{n}\frac{1}{h} \int _j \Bigg (|Y_t- Y_{t_{j-1}}|(1+|Y_t|+|Y_{t_{j-1}}|)^C + |Y_t|^k o(1)\Bigg )dt \xrightarrow {p}0, \nonumber \end{aligned}$$

    where the order symbols in the estimates are valid uniformly in \(j\le n\). By (2.2), under the localization, we have \(\max _{j\le n}\sup _\theta E_\theta [|Y_t|^M]=O(1)\) and \(\max _{j\le n}\sup _\theta E_\theta [|Y_t- Y_{t_{j-1}}|^M]=o_u(1)\) for any \(M>0\), from which it follows that \(D_{n}(k,l) = o_{u,p}(1)\).

Specifically, for the case of \(-r_n^{-2}\partial _\mu ^2\ell _n(\theta )\), we have

$$\begin{aligned} -\frac{1}{r_n^2}\partial _\mu ^2\ell _n(\theta )&= \frac{1}{n}\sum _{j=1}^{n}\sigma ^{-2} E_\theta [g^2] \,\zeta _j(\lambda )^{\otimes 2} + o_{u,p}(1) \nonumber \\&= \sigma ^{-2} E_\theta [g_\beta (\epsilon _1(\theta ))^2] \, \frac{1}{T}\int _0^T X_{t}^{\otimes 2}dt + o_{u,p}(1) \xrightarrow {p}_u \mathcal {I}_{\lambda ,\mu ;22}(\theta ) \nonumber \end{aligned}$$

with \(\mathcal {I}_{\lambda ,\mu ;22}(\theta )\) denoting the lower left \(q\times q\) component of \(\mathcal {I}_{\lambda ,\mu }(\theta )\). The others can be handled analogously.

Next, we turn to looking at \(\mathcal {I}_{12,n}(\theta )=\{\mathcal {I}_{12,n}^{kl}(\theta )\}_{k,l}\):

$$\begin{aligned} \mathcal {I}_{12,n}^{11}(\theta )&= \varphi _{11,n}(\theta )\partial _\lambda \partial _\beta \ell _n(\theta ) + \varphi _{21,n}(\theta )\partial _\mu \partial _\beta \ell _n(\theta ), \nonumber \\ \mathcal {I}_{12,n}^{12}(\theta )&= \varphi _{11,n}(\theta )\partial _\lambda \partial _\sigma \ell _n(\theta ) + \varphi _{21,n}(\theta )\partial _\mu \partial _\sigma \ell _n(\theta ), \nonumber \\ \mathcal {I}_{12,n}^{21}(\theta )&= \varphi _{12,n}(\theta )\partial _\lambda \partial _\beta \ell _n(\theta ) + \varphi _{22,n}(\theta )\partial _\mu \partial _\beta \ell _n(\theta ), \nonumber \\ \mathcal {I}_{12,n}^{22}(\theta )&= \varphi _{12,n}(\theta )\partial _\lambda \partial _\sigma \ell _n(\theta ) + \varphi _{22,n}(\theta )\partial _\mu \partial _\sigma \ell _n(\theta ). \nonumber \end{aligned}$$

We can deduce that \(\mathcal {I}_{12,n}(\theta ) \xrightarrow {p}_u 0\) just by inspecting the four components separately in a similar way that we managed \(\mathcal {I}_{11,n}(\theta )\). Let us only mention the lower-left \(q\times 1\) component: recalling the properties (2.1) and \(|\varphi _{22,n}|\lesssim _u l'\), we see that

$$\begin{aligned} \mathcal {I}_{12,n}^{21}(\theta )&= -\frac{(h^{1-1/\beta })^{-1}}{n}\sum _{j=1}^{n}\bigg ( \varphi _{12,n}\,(\log c)_{(\lambda ,\beta )} + \varphi _{12,n}\,\epsilon _{(\lambda )}\,g_{(\beta )} + \varphi _{12,n}\,\epsilon _{(\lambda )}\,\epsilon _{(\beta )}\,(\partial g) \nonumber \\&{}\qquad + \varphi _{12,n}\, \epsilon _{(\lambda ,\beta )}\, g + \varphi _{22,n}\,\epsilon _{(\mu )}\, g_{(\beta )} + \varphi _{22,n}\,\epsilon _{(\beta )}\,\epsilon _{(\mu )}\,(\partial g) + \varphi _{22,n}\,\epsilon _{(\beta ,\mu )} \,g\bigg ) \nonumber \\&= O(h^{1+1/\beta }) + O_{u,p}(n^{-1/2}\vee h^{1/\beta }) + O_{u,p}\left( (n^{-1/2}\vee h^{1/\beta }) \,l'\right) \nonumber \\&{}\qquad +O_{u,p}(n^{-1/2}) + O_{u,p}\left( n^{-1/2}(l')^2\right) + O_{u,p}\left( n^{-1/2}(l')^2\right) \xrightarrow {p}_u 0. \nonumber \end{aligned}$$

Thus, the claim (2.11) follows.

Proof of (2.13). Note that

$$\begin{aligned} \sup _{\theta '\in \mathfrak {N}_{n}(c;\theta )} |\epsilon _{j}(\theta ')| \lesssim _{u} |\epsilon _{j}(\theta )| + \overline{s}_{nj}(\theta ;c), \nonumber \end{aligned}$$

where \(|\overline{s}_{nj}(\theta ;c)| \lesssim o_{u}(1)(1+|Y_{t_{j-1}}|)\). Also, for each \(k,l,m\in \mathbb {Z}_{+}\), we have \(P_\theta \)-a.s. the (rough) estimate:

$$\begin{aligned} \frac{1}{n}\big |\partial _{\beta }^{k}\partial _{\sigma }^{l}\partial _{(\lambda ,\mu )}^{m}\ell _{n}(\theta )\big | \lesssim _{u} (l')^{k}\, h^{(1-1/\beta )m}\, \frac{1}{n}\sum _{j=1}^{n}(1+|Y_{t_{j-1}}|)^m \left\{ 1 + \log \left( 1+|\epsilon _{j}(\theta )|^{2}\right) \right\} ^{k}. \nonumber \end{aligned}$$

Then, as in the proof of Eq.(14) in Brouste and Masuda (2018), for each \(c>0,\) we can find a constant \(R=R(c)>0\) such that (still rough, but sufficient)

$$\begin{aligned}&\sup _{\theta ^1,\dots ,\theta ^{p}\in \mathfrak {N}_{n}(c;\theta )} \left| \varphi _{n}(\theta )^{\top }\{ \partial _{\theta }^{2}\ell _{n}(\theta ^{1},\dots ,\theta ^{p})-\partial _{\theta }^{2}\ell _{n}(\theta ) \} \varphi _{n}(\theta ) \right| \nonumber \\&\lesssim _{u} \sup _{\theta ',\theta ^1,\dots ,\theta ^{p}\in \mathfrak {N}_{n}(c;\theta )} \left| \varphi _{n}(\theta )^{\top } \left\{ \partial _{\theta }^{3}\ell _{n}(\theta ^{1},\dots ,\theta ^{p}) [\theta ' - \theta ] \right\} \varphi _{n}(\theta ) \right| \nonumber \\&\lesssim _{u} \frac{(l')^{C}}{\sqrt{n}} \sup _{\beta ',\beta '' \in \overline{B}(\beta ;R/l^{\prime })} h^{(1/\beta '-1/\beta '')3} \sup _{\theta ' \in \mathfrak {N}_{n}(c;\theta )} \frac{1}{n}\sum _{j=1}^{n}(1+|Y_{t_{j-1}}|)^m \left\{ 1 + \log \left( 1+|\epsilon _{j}(\theta ')|\right) \right\} ^{3} \nonumber \\&\lesssim _{u} \frac{(l')^{C}}{\sqrt{n}} \frac{1}{n}\sum _{j=1}^{n}(1+|Y_{t_{j-1}}|)^m \left\{ 1 + \log \left( 1+|\epsilon _{j}(\theta )|\right) \right\} ^{3} \lesssim O_{u,p}\left( \frac{(l')^{C}}{\sqrt{n}}\right) \xrightarrow {p}_u 0, \nonumber \end{aligned}$$

where \(\overline{B}(\beta ;R/l_n^{\prime })\) denotes the closed ball with center \(\beta \) and radius \(R/l'\). This shows (2.13). The proof of Theorem 2.1 is complete.

3 Asymptotically efficient estimator

From now on, we fix a true value \(\theta _0\in \Theta \), and the stochastic symbols and convergences will be taken under \(P:=P_{\theta _0}\); accordingly, we write \(E:=E_{\theta _0}\). Having Theorem 2.1 in hand, we can proceed with the construction of an asymptotically efficient estimator. It is known that any asymptotically centering estimator \(\hat{\theta }_{n}^*\):

$$\begin{aligned} \varphi _{n}(\theta _0)^{-1}(\hat{\theta }_{n}^*-\theta _0) = \mathcal {I}_{n}(\theta _0)^{-1}\Delta _n(\theta _0) + o_{p}(1) \end{aligned}$$
(3.1)

is regular; by Theorem 2.1, the right-hand side converges in distribution to \(MN_{p,\theta _0}\left( 0,\, \mathcal {I}(\theta _0)^{-1} \right) \). This, together with the convolution theorem in turn, gives the asymptotic minimax theorem: for any measurable (loss) function \(\mathfrak {L}:\,\mathbb {R}^p \rightarrow \mathbb {R}_{+}\) such that \(\mathfrak {L}(u)=\tau (|u|)\) for some non-decreasing \(\tau :\,\mathbb {R}_{+}\rightarrow \mathbb {R}_{+}\) with \(\tau (0)=0\), we have

$$\begin{aligned} \liminf _{n\rightarrow \infty } E\left[ \mathfrak {L}\left( \varphi _{n}(\theta _0)^{-1}(\hat{\theta }_{n}^*- \theta _0) \right) \right] \ge E\big [\mathfrak {L}\big (\mathcal {I}(\theta _0)^{-1/2}Z\big )\big ]. \end{aligned}$$
(3.2)

Recalling that \(\mathcal {L}\left( \Delta _{n}(\theta ), \, \mathcal {I}_n(\theta ) |P_\theta \right) \Rightarrow \mathcal {L}\left( \mathcal {I}(\theta )^{1/2}Z,\, \mathcal {I}(\theta ) \right) \), where \(Z\sim N_{p}(0,I)\) (Theorem 2.1) and in view of the lower bound in (3.2), we may call that any estimator \(\hat{\theta }_{n}^*\) satisfying (3.1) asymptotically efficient. Again by Theorem 2.1, the good local maximum point \(\hat{\theta }_{n}\) of \(\ell _n(\theta )\) is asymptotically efficient. We refer to Jeganathan (1982, Theorems 2 and 3, and Proposition 2) and also Jeganathan (1995, Theorem 8) for more information and details of the above arguments.

Theorem 2.1 is based on the classical Cramér-type argument. The well-known shortcoming is its local character: the result just tells us the existence of an asymptotically well behaving root of the likelihood equation, but does not give information about which local maxima is the one when there are multiple local maxima, and equivalently multiple roots for the likelihood equations Lehmann (1999, Section 7.3). Indeed, the log-likelihood function \(\ell _n\) of (2.4) is highly nonlinear and non-concave. In this section, we try to get rid of the locality by a Newton–Raphson type of improvement, which in our case will not only remedy the aforementioned inconvenience of the multiple-root problem, but also enable us to bypass the numerical optimization involving the stable density \(\phi _\beta \). In Brouste and Masuda (2018, Section 3), for the \(\beta \)-stable Lévy process (the special case of (1.1) with \(\lambda =0\) and \(X\equiv 1\)), we provided an initial estimator based on the sample median and the method of moments associated with logarithm and/or lower-order fractional moments. However, it was essential in Brouste and Masuda (2018) that the model is a Lévy process, for which we could apply the median-adjusted central limit theorem for an i.i.d. sequence of random variables. In the present case, we need a different sort of argument.

In Theorem 2.1, the process \(X=(X_t)_{t\in [0,T]}\) was assumed to be observed continuously in [0, T]. In this section, we will instead deal with a discrete-time sample \((X_{t_j})_{j=0}^{n}\) under the additional condition:

$$\begin{aligned} \exists \kappa \in {(1/2,1]},\quad \max _{j\le n}\left| \frac{1}{h} \int _j (X_t - X_{t_{j-1}}) dt \right| \lesssim h^\kappa . \end{aligned}$$
(3.3)

We will explicitly construct an estimator \(\hat{\theta }_{n}^*\) which is asymptotically equivalent to the MLE \(\hat{\theta }_{n}\), by verifying the asymptotically centering property (3.1); for this much-thinned sample, we may and do keep calling such a \(\hat{\theta }_{n}^*\) asymptotically efficient.

3.1 Newton–Raphson procedure

To proceed with a discrete-time sample \(\{(X_{t_j},Y_{t_j})\}_{j=0}^{n}\), we introduce the approximate-likelihood function \(\mathbb {H}_n(\theta )\) by replacing \(\zeta _{j}(\lambda )\) by \(X_{t_{j-1}}\) in the definition (2.4) of the genuine log-likelihood function \(\ell _n(\theta )\) (recall the notation \(l':=\log (1/h)\)):

$$\begin{aligned} \mathbb {H}_n(\theta )&=\sum _{j=1}^{n}\left( -\log \sigma +\frac{1}{\beta }l' {- \frac{1}{\beta }\log \eta (\lambda \beta h)} + \log \phi _{\beta } \left( \epsilon _{j}'(\theta ) \right) \right) , \end{aligned}$$
(3.4)

where

$$\begin{aligned} \epsilon '_{j}(\theta ) := \frac{Y_{t_{j}} - e^{-\lambda h}Y_{t_{j-1}} - \mu \cdot X_{t_{j-1}} h}{\sigma h^{1/\beta }\eta (\lambda \beta h)^{1/\beta }}. \end{aligned}$$
(3.5)

Of course, this approximation is not for free: to manage the resulting discretization error specified later on, we additionally impose that

$$\begin{aligned} \beta _0 > \frac{2}{1+2\kappa }. \end{aligned}$$
(3.6)

Then we have at least \(\beta _0 > 2/3\), so that small values of \(\beta _0\) are excluded; this is the price we have to pay for dealing with a discrete-time sample from X in an efficient way. Accordingly, in the sequel, we will reset the parameter space of \(\beta \) to be a domain \(\Theta _\beta \) such that \(\overline{\Theta _\beta } \subset (2/3,2)\).

Toward construction of an asymptotically efficient estimator \(\hat{\theta }_{n}^*\) satisfying (3.1), we will prove a basic result about a Newton–Raphson type of procedure. As in (2.15), we write \(r_n=r_n(\beta _0)=\sqrt{n}h^{1-1/\beta _0}\). Write \(n^{-1/2}\tilde{\varphi }_n(\theta )\) for the lower-right \(2\times 2\)-part of \(\varphi _n(\theta )\), so that the definition (2.5) with \(\theta =\theta _0\) becomes \(\varphi _n(\theta _0) = \textrm{diag}(r_{n}^{-1}I_{q+1},\, n^{-1/2}\tilde{\varphi }_n(\theta _0))\). We then introduce the diagonal matrix

$$\begin{aligned} \varphi _{0,n}=\varphi _{0,n}(\beta _0):= \textrm{diag}\left( r_{n}^{-1}I_{q+1},\, n^{-r/2} \begin{pmatrix} 1 &{} 0 \\ 0 &{} l' \end{pmatrix} \right) \end{aligned}$$
(3.7)

for a constant

$$\begin{aligned} {0<r\le 1.} \end{aligned}$$
(3.8)

The difference between \(\varphi _n\) and \(\varphi _{0,n}\) is only in the lower-right component for \((\beta ,\sigma )\), and note that the matrix \(\varphi _n^{-1}\varphi _{0,n}\) may diverge in norm. Then, suppose that we are given an initial estimator \(\hat{\theta }_{0,n}=(\hat{\lambda }_{0,n},\hat{\mu }_{0,n},\hat{\beta }_{0,n},\hat{\sigma }_{0,n})\) such that \(\varphi _{0,n}^{-1}(\hat{\theta }_{0,n} - \theta _0) = O_p(1)\), namely,

$$\begin{aligned} \left( r_n (\hat{\lambda }_{0,n}-\lambda _0),\, r_n(\hat{\mu }_{0,n}-\mu _0),\, n^{r/2}(\hat{\beta }_{0,n}-\beta _0),\,\frac{n^{r/2}}{l'}(\hat{\sigma }_{0,n}-\sigma _0) \right) = O_{p}(1). \nonumber \end{aligned}$$

Let us write \(a=(\lambda ,\mu )\) and \(b=(\beta ,\sigma )\). Based on the approximate-likelihood function (3.4) and \(\hat{\theta }_{0,n}\), we recursively define the k-step estimator \(\hat{\theta }_{k,n}\) (\(k\ge 1\)) by

$$\begin{aligned} \hat{\theta }_{k,n} = \hat{\theta }_{k-1,n} + \left\{ \textrm{diag}\left( -\partial _a^2\mathbb {H}_n(\hat{\theta }_{k-1,n}),\, -\partial _b^2\mathbb {H}_n(\hat{\theta }_{k-1,n}) \right) \right\} ^{-1} \partial _\theta \mathbb {H}_n(\hat{\theta }_{k-1,n}) \end{aligned}$$
(3.9)

on the event \(F_{k-1,n} := \{|\det (\partial _a^2\mathbb {H}_n(\hat{\theta }_{k-1,n}))| \wedge |\det (\partial _b^2\mathbb {H}_n(\hat{\theta }_{k-1,n}))| > 0\}\) and assign an arbitrary value to \(\hat{\theta }_{k,n}\) on the complement set \(F_{k-1,n}^c\); below, it will be seen (as in the proof of Theorem 2.1) that \(P[F_{k-1,n}]\rightarrow 1\). Hence, the arbitrary property does not matter asymptotically and we may and do suppose that \(P[F_{k-1,n}]=1\) for \(k\ge 1\). In our subsequent arguments, the inverse-matrix part in (3.9) must be block diagonal: see Remark 3.3 below.

In what follows, \(\hat{\theta }_{n}\) denotes the good local maxima of the likelihood function \(\ell _n(\theta )\), when \((X_t)_{t\le T}\) is observable; by Theorem 2.1, we have \(P[\partial _\theta \ell _n(\hat{\theta }_{n})=0] \rightarrow 1\) and \(\varphi _{n}(\theta _0)^{-1}(\hat{\theta }_{n}-\theta _0) = \mathcal {I}_{n}(\theta _0)^{-1}\Delta _n(\theta _0) + o_{p}(1)\Rightarrow MN_{p,\theta _0}\left( 0,\, \mathcal {I}(\theta _0)^{-1} \right) \). Define the number

$$\begin{aligned} K := \min \{ k\in \mathbb {N}:\, 2^{k-1}r> 1/2\} = \min \{ k\in \mathbb {N}:\, k>\log _2(1/r)\}. \end{aligned}$$
(3.10)

We deduce the asymptotic equivalence of \(\hat{\theta }_{n}\) and \(\hat{\theta }_{K,n}\):

$$\begin{aligned} \varphi _{n}(\theta _0)^{-1}(\hat{\theta }_{K,n} -\hat{\theta }_{n}) = o_{p}(1), \end{aligned}$$
(3.11)

starting from the initial estimator \(\hat{\theta }_{0,n}\); (3.11) concludes (3.1) (hence, (3.2) as well) with \(\hat{\theta }_{n}^*= \hat{\theta }_{K,n}\).

We assume that \(\varphi _{0,n}^{-1}(\hat{\theta }_{0,n}-\theta _0)=O_p(1)\) (hence also \(\varphi _{0,n}^{-1}(\hat{\theta }_{0,n}-\hat{\theta }_{n})=O_p(1)\)). Then, to establish (3.11), we first look at the amount of improvement through (3.9) with \(k=1\). Write \(\varphi _n = \varphi _{n}(\theta _0)\) and \(\tilde{\varphi }_n = \tilde{\varphi }_{n}(\theta _0)\), and introduce

$$\begin{aligned} \hat{\mathcal {I}}_{0,n}&:= -\varphi _{n}^{\top } \textrm{diag}\left( \partial _a^2\mathbb {H}_n(\hat{\theta }_{0,n}),\, \partial _b^2\mathbb {H}_n(\hat{\theta }_{0,n}) \right) \varphi _{n} \nonumber \\&= \textrm{diag}\left( - r_n^{-2} \partial _a^2\mathbb {H}_n(\hat{\theta }_{0,n}),\, -\frac{1}{n} \tilde{\varphi }_n^{\top } \partial _b^2\mathbb {H}_n(\hat{\theta }_{0,n}) \tilde{\varphi }_n \right) =: \big ( \hat{\mathcal {I}}_{0,a,n},\, \hat{\mathcal {I}}_{0,b,n} \big ). \nonumber \end{aligned}$$

We apply Taylor’s expansion around \(\hat{\theta }_{n}\) to (3.9) with \(k=1\): for some random point \(\hat{\theta }'_{0,n}\) on the segment joining \(\hat{\theta }_{0,n}\) and \(\hat{\theta }_{n}\),

$$\begin{aligned} \hat{\theta }_{1,n} -\hat{\theta }_{n}&= \hat{\theta }_{0,n} -\hat{\theta }_{n}+ \varphi _n \hat{\mathcal {I}}_{0,n}^{-1} \varphi _n^\top \, \partial _\theta \mathbb {H}_n(\hat{\theta }_{0,n}) \nonumber \\&= \varphi _n \hat{\mathcal {I}}_{0,n}^{-1} \Big \{ \varphi _n^\top \partial _\theta \mathbb {H}_n(\hat{\theta }_{n}) \nonumber \\&{}\qquad + \varphi _n^\top \left( \textrm{diag}\left( -\partial _a^2\mathbb {H}_n(\hat{\theta }_{0,n}),\, -\partial _b^2\mathbb {H}_n(\hat{\theta }_{0,n}) \right) - \big ( - \partial _{\theta }^{2}\mathbb {H}_n(\hat{\theta }'_{0,n}) \big ) \right) [\hat{\theta }_{0,n}-\hat{\theta }_{n}] \Big \} \nonumber \\&=: \varphi _n \hat{\mathcal {I}}_{0,n}^{-1}\big ( R'_{0,n} + R''_{0,n} \big ). \end{aligned}$$
(3.12)

In what follows, we will derive the rate of convergence of \(\hat{\theta }_{1,n} -\hat{\theta }_{n}\) in several steps. Here again, we may and do work under the localization (See Sect. 2.2).

Step 1. First, we show that \(\hat{\mathcal {I}}_{0,n}^{-1}=O_p(1)\). We have

$$\begin{aligned} \hat{\mathcal {I}}_{0,a,n}&= - r_n^{-2} \partial _a^2\mathbb {H}_n(\theta _0) - r_n^{-2} \left( \partial _a^2\mathbb {H}_n(\hat{\theta }_{0,n}) - \partial _a^2\mathbb {H}_n(\theta _0)\right) , \end{aligned}$$
(3.13)
$$\begin{aligned} \hat{\mathcal {I}}_{0,b,n}&= - \frac{1}{n} \tilde{\varphi }_n^{\top }\partial _b^2\mathbb {H}_n(\theta _0) \tilde{\varphi }_n - \frac{1}{n} \tilde{\varphi }_n^{\top } \left( \partial _b^2\mathbb {H}_n(\hat{\theta }_{0,n}) - \partial _b^2\mathbb {H}_n(\theta _0) \right) \tilde{\varphi }_n. \end{aligned}$$
(3.14)

The first terms on the right-hand sides above tend to \(\mathcal {I}_{\lambda ,\mu }(\theta _0)\) and \(\mathcal {I}_{\beta ,\sigma }(\theta _0)\) in probability, respectively. The second terms equal \(o_p(1)\), by similar considerations to the verification of (2.13) in the proof of Theorem 2.1. Hence, \(\hat{\mathcal {I}}_{0,n} \xrightarrow {p}\mathcal {I}(\theta _0)\) and in particular \(\hat{\mathcal {I}}_{0,n}^{-1}=O_p(1)\) since \(\mathcal {I}(\theta _0)\) is a.s. positive definite.

Step 2. Next, we show that \(R'_{0,n} =\varphi _n^\top \partial _\theta \mathbb {H}_n(\hat{\theta }_{n}))\) is \(o_p(1)\). Observe that

$$\begin{aligned} \varphi _n^\top \partial _\theta \mathbb {H}_n(\hat{\theta }_{n})&= \varphi _n^\top \partial _\theta \ell _n(\hat{\theta }_{n}) + \varphi _n^\top \left( \partial _\theta \mathbb {H}_n(\hat{\theta }_{n}) - \partial _\theta \ell _n(\hat{\theta }_{n})\right) . \nonumber \end{aligned}$$

For the first term, we have \(\varphi _n^\top \partial _\theta \ell _n(\hat{\theta }_{n}) = o_p(1),\) since \(P[|s_n \partial _\theta \ell _n(\hat{\theta }_{n})|>\epsilon ] \le P[|\partial _\theta \ell _n(\hat{\theta }_{n})|\ne 0]\rightarrow 0\) for every \(\epsilon >0\) and \(s_n\uparrow \infty \). To manage the second term, we need to estimate the gap between \(\mathbb {H}_n(\theta )\) and \(\ell _n(\theta )\) by taking the different convergence rates of their components into account. By the definitions (2.4) and (3.4),

$$\begin{aligned} \mathbb {H}_n(\theta ) - \ell _n(\theta )&= \sum _{j=1}^{n}\left( \log \phi _\beta (\epsilon '_j(\theta )) - \log \phi _\beta (\epsilon _j(\theta )) \right) \nonumber \\&= \sum _{j=1}^{n}\left( \int _0^1 g_\beta \left( \epsilon _j(\theta ) + s (\epsilon '_j(\theta )-\epsilon _j(\theta ))\right) ds\right) (\epsilon '_j(\theta )-\epsilon _j(\theta ).) \end{aligned}$$
(3.15)

From the expressions (2.3) and (3.5) and since \(\kappa \le 1\), a series of straightforward computations shows that the partial derivatives of

$$\begin{aligned} d_{\epsilon ,j}(\theta )&:=\epsilon '_j(\theta )-\epsilon _j(\theta ) \nonumber \\&= \frac{1}{\sigma h^{1/\beta }\eta (\lambda \beta h)^{1/\beta }} \left( \mu \cdot \int _j(e^{-\lambda (t_j -s)} -1)X_s ds + \mu \cdot \int _j (X_s - X_{t_{j-1}})ds\right) \nonumber \end{aligned}$$

satisfy the following bounds: \(|\partial _\mu d_{\epsilon ,j}(\theta )| \lesssim h^{1+\kappa -1/\beta }\), \(|\partial _\lambda d_{\epsilon ,j}(\theta )| \lesssim h^{2-1/\beta }\), \(|\partial _\beta d_{\epsilon ,j}(\theta )| \lesssim h^{1+\kappa -1/\beta } l'\), and \(|\partial _\sigma d_{\epsilon ,j}(\theta )| \lesssim h^{1+\kappa -1/\beta }\). Obviously, \(h^{1/\tilde{\beta }_n - 1/\beta _0} = 1 + o_p(1)\) for any \(\tilde{\beta }_n\) such that \(n^{v}(\tilde{\beta }_n-\beta _0)=O_p(1)\) for some \(v>0\); below, we will repeatedly make use of this fact without mention. Further, under (3.6), it holds that

$$\begin{aligned} \exists \delta _1>0,\quad \sqrt{n} \,h^{1+\kappa -1/\beta _0} = O(n^{-1/2-\kappa +1/\beta _0}) =O(n^{-\delta _1}). \end{aligned}$$
(3.16)

By piecing together these observations, the basic property (2.1), and the expression (3.15), under (3.3) we can obtain

$$\begin{aligned}&\left| \varphi _n^\top \left( \partial _\theta \mathbb {H}_n(\hat{\theta }_{n}) - \partial _\theta \ell _n(\hat{\theta }_{n})\right) \right| \nonumber \\&\lesssim \left| r_n^{-1} \partial _\mu \mathbb {H}_n(\hat{\theta }_{n}) \right| + \left| r_n^{-1} \partial _\lambda \mathbb {H}_n(\hat{\theta }_{n}) \right| + \left| \frac{l'}{\sqrt{n}} \partial _\beta \mathbb {H}_n(\hat{\theta }_{n}) \right| + \left| \frac{l'}{\sqrt{n}} \partial _\sigma \mathbb {H}_n(\hat{\theta }_{n}) \right| \nonumber \\&\lesssim O_p\left( \sqrt{n}\, h^{1+\kappa - 1/\beta _0} \vee \sqrt{n}\,h^\kappa \right) + O_p\left( \sqrt{n}\, h^{1+\kappa - 1/\beta _0} \vee \sqrt{n}\,h^\kappa \right) \nonumber \\&{}\qquad + O_p\left( \sqrt{n}\, h^{1+\kappa - 1/\beta _0} (l')^C\right) + O_p\left( \sqrt{n}\, h^{1+\kappa - 1/\beta _0} (l')^C\right) \nonumber \\&\lesssim O_p\big (n^{-\delta _1} \vee n^{1/2-\kappa }\big ) \xrightarrow {p}0. \nonumber \end{aligned}$$

This concludes that \(R'_{0,n}=o_p(1)\).

Step 3. Let \(R''_{0,n}=:(R''_{0,a,n},R''_{0,b,n}) \in \mathbb {R}^{q+1}\times \mathbb {R}^2\). The goal of this step is to show \(R''_{0,a,n} = o_p(1)\) and \(R''_{0,b,n} = O_p(n^{1/2-r} (l')^C)\); at this stage, the latter component may not be stochastically bounded if \(r\le 1/2\) (recall (3.8)). We have \(R''_{0,n} = A_{0,n} H_{0,n}\), where

$$\begin{aligned} A_{0,n}&:= \varphi _n^\top \begin{pmatrix} \partial _{a}^{2}\mathbb {H}_n(\hat{\theta }'_{0,n}) - \partial _{a}^{2}\mathbb {H}_n(\hat{\theta }_{0,n}) &{} \text {sym.} \\ \partial _{a}\partial _{b}\mathbb {H}_n(\hat{\theta }'_{0,n}) &{} \partial _{b}^{2}\mathbb {H}_n(\hat{\theta }'_{0,n}) - \partial _{b}^{2}\mathbb {H}_n(\hat{\theta }_{0,n}) \end{pmatrix}\varphi _n , \nonumber \\ H_{0,n}&:= \varphi _{n}^{-1}(\hat{\theta }_{0,n} - \hat{\theta }_{n}). \nonumber \end{aligned}$$

Under the assumption \(\varphi _{0,n}^{-1}(\hat{\theta }_{0,n}-\theta _0)=O_p(1)\), recalling the block-diagonal forms (2.5) and (3.7), we see that

$$\begin{aligned} H_{0,n} = \varphi _{n}^{-1}\varphi _{0,n} \, \varphi _{0,n}^{-1}(\hat{\theta }_{0,n} -\theta _0) - \varphi _{n}^{-1}(\hat{\theta }_{n}-\theta _0)= \begin{pmatrix} O_p(1) \\ O_p(n^{(1-r)/2} l') \end{pmatrix}, \end{aligned}$$
(3.17)

where the components \(O_p(1)\in \mathbb {R}^{q+1}\) and \(O_p(n^{(1-r)/2} l') \in \mathbb {R}^2\); here and in what follows, we use the stochastic-order symbols for random variables of different dimensions, which will not cause any confusion.

We will show that all the components of \(A_{0,n}\) are at most \(O_p\big ( n^{-r/2} (l')^C\big )\):

$$\begin{aligned} A_{0,n} = O_p\big ( n^{-r/2} (l')^C\big ). \end{aligned}$$
(3.18)

For the diagonal parts of \(A_{0,n}\), from the same arguments as in proving (3.13) and (3.14) with the assumption \(\varphi _{0,n}^{-1}(\hat{\theta }_{0,n}-\theta _0)=O_p(1)\), it holds that

$$\begin{aligned}&\left| r_n^{-2} \left( \partial _a^2\mathbb {H}_n(\hat{\theta }'_{0,n}) - \partial _a^2\mathbb {H}_n(\hat{\theta }_{0,n})\right) \right| + \left| \frac{1}{n} \tilde{\varphi }_n^{\top } \left( \partial _b^2\mathbb {H}_n(\hat{\theta }'_{0,n}) - \partial _b^2\mathbb {H}_n(\hat{\theta }_{0,n}) \right) \tilde{\varphi }_n\right| \\ {}&\quad = O_p\big ( n^{-r/2} (l')^C\big ). \nonumber \end{aligned}$$

Write \(\theta =(\theta _l)_{l=1}^{p}\) and so on, and also let \(\partial _{a}\partial _{b}\mathbb {H}_n(\hat{\theta }'_{0,n}) \in \mathbb {R}^{2}\times \mathbb {R}^{q+1}\) for the size of the matrix. Then, for the non-diagonal part of \(A_{0,n}\), we expand it as follows:

$$\begin{aligned}&\frac{1}{r_n \sqrt{n}}\tilde{\varphi }_n^\top \partial _{a}\partial _{b}\mathbb {H}_n(\hat{\theta }'_{0,n}) =: \frac{1}{r_n \sqrt{n}}\partial _{a}\partial _{b}\mathbb {H}_n(\theta _0) \\&\quad + \sum _{l=1}^{p} \left( \frac{(h^{1-1/\beta _0})^{-1}}{n}\partial _{\theta _l}\partial _{a}\partial _{b}\mathbb {H}_n(\hat{\theta }''_{0,n})\right) (\hat{\theta }'_{0,n,l} - \theta _{0,l}). \nonumber \end{aligned}$$

As in the previous diagonal case, the second term on the right-hand side equals \(O_p(n^{-r/2} (l')^C)\). As for the first term, we write

$$\begin{aligned} \frac{1}{r_n \sqrt{n}}\partial _{a}\partial _{b}\mathbb {H}_n(\theta _0) = \frac{1}{r_n \sqrt{n}}\partial _{a}\partial _{b}\ell _n(\theta _0) + \frac{1}{r_n \sqrt{n}}\partial _{a}\partial _{b}\left( \mathbb {H}_n(\theta _0) - \ell _n(\theta _0) \right) . \nonumber \end{aligned}$$

We have seen the explicit expressions of the components of \(\partial _\theta ^2\ell _n(\theta )\) in Sect. 2.2. Based on them, it can be seen that all the components of \(r_n^{-1}n^{-1/2}\partial _{a}\partial _{b}\ell _n(\theta _0)\) take the form:

$$\begin{aligned} \frac{1}{n} \sum _{j=1}^{n}\pi _{j-1}(\theta _0)\psi (\epsilon _j(\theta _0)) + O(h^2) \nonumber \end{aligned}$$

for some \(\mathcal {F}_{t_{j-1}}\)-measurable random variable \(\pi _{j-1}(\theta _0)\) such that \(|\pi _{j-1}(\theta _0)|\lesssim (1+|Y_{t_{j-1}}|)(l')^C\) and for some odd function \(\psi \) (hence, \(E[\psi (\epsilon _j(\theta _0))]=0\)); the last term “\(O(h^2)\)” only appears in \(\partial _\lambda \partial _\beta \ell _n(\theta )\). Burkholder’s inequality for the martingale difference arrays gives \(n^{-1}\sum _{j=1}^{n}\pi _{j-1}(\theta _0)\psi (\epsilon _j(\theta _0)) = O_p(n^{-1/2}(l')^{C})\). We conclude that \(r_n^{-1}n^{-1/2}\partial _{a}\partial _{b}\ell _n(\theta _0) = O_p(n^{-1/2}(l')^{C})\). Next, we write \(\mathbb {H}_n(\theta ) - \ell _n(\theta ) = \sum _{j=1}^{n}B_j(\theta )d_{\epsilon ,j}(\theta )\) for the expression (3.15). The following estimates hold: \(|d_{\epsilon ,j}(\theta )| \lesssim h^{1+\kappa -1/\beta }\), \(|\partial _a \partial _b d_{\epsilon ,j}(\theta )| \lesssim h^{1+\kappa -1/\beta }(1+l')\), \(|B_j(\theta )| \lesssim 1\), \(|\partial _a B_j(\theta )| \lesssim (1+|Y_{t_{j-1}}|)h^{1-1/\beta }\), \(|\partial _b B_j(\theta )| \lesssim 1+l'\), and \(|\partial _a\partial _b B_j(\theta )| \lesssim (1+l')(1+|Y_{t_{j-1}}|)h^{1-1/\beta }\). Therefore, by (3.16),

$$\begin{aligned} \left| \frac{1}{r_n \sqrt{n}}\partial _{a}\partial _{b}\left( \mathbb {H}_n(\theta _0) - \ell _n(\theta _0) \right) \right|&= \left| \frac{1}{r_n \sqrt{n}}\sum _{j=1}^{n}\left. \partial _{a}\partial _{b} \left( B_j(\theta )d_{\epsilon ,j}(\theta ) \right) \right| _{\theta =\theta _0} \right| \nonumber \\&\lesssim (1+l') h^{1+\kappa -1/\beta _0} \,\frac{1}{n} \sum _{j=1}^{n}(1+|Y_{t_{j-1}}|) \nonumber \\&= O_p\left( \frac{(l')^C}{\sqrt{n}}\right) \sqrt{n}\,h^{1+\kappa -1/\beta _0} =o_p\left( \frac{(l')^C}{\sqrt{n}}\right) . \nonumber \end{aligned}$$

Since \(r\le 1\), we have concluded (3.18).

The desired stochastic orders now follows from (3.17) and (3.18):

$$\begin{aligned} R''_{0,n} = A_{0,n} H_{0,n} = O_p\big ( n^{-r/2} (l')^C\big ) \begin{pmatrix} O_p(1) \\ O_p(n^{(1-r)/2} l') \end{pmatrix} = \begin{pmatrix} o_p(1) \\ O_p\left( n^{1/2-r} (l')^C\right) ) \end{pmatrix}. \end{aligned}$$
(3.19)

Step 4. We are now able to derive the convergence rate of \(\hat{\theta }_{1,n}-\hat{\theta }_{n}\). Recall the definition (3.10) of \(K\in \mathbb {N}\) and the initial rate of convergence (3.7).

  • First, we consider \(r>1/2\). Then, \(R''_{0,n}=o_p(1)\) from (3.19), so that we can take \(\varphi _{1,n}=\varphi _n\): by Steps 1 to 3 and (3.12), \(\varphi _n^{-1}(\hat{\theta }_{1,n}-\hat{\theta }_{n})=o_p(1)\). This means that a single iteration is enough if we can take \(r>1/2\) from the beginning.

  • Turning to \(r\in (0,1/2]\), we pick a constant \(\epsilon '\in (0,r/2)\) (hence \(r-\epsilon '>r/2\)), which is to be taken sufficiently small later. Define

    $$\begin{aligned} \varphi _{1,n}=\varphi _{1,n}(\epsilon ') := \textrm{diag}\left( r_{n}^{-1}I_{q+1},\, n^{-(r-\epsilon ')} \begin{pmatrix} 1 &{} 0 \\ 0 &{} l' \end{pmatrix} \right) . \nonumber \end{aligned}$$

    Again by Steps 1–3 and (3.12), \(\varphi _{1,n}^{-1}\varphi _n \hat{\mathcal {I}}_{0,n}^{-1}=\textrm{diag}(O_p(1),O_p\big ((l')^C n^{r-\epsilon '-1/2}\big ))\) and

    $$\begin{aligned} \varphi _{1,n}^{-1}(\hat{\theta }_{1,n}-\hat{\theta }_{n})&= \begin{pmatrix} O_p(1) &{} O \\ O &{} O_p\big ((l')^C n^{r-\epsilon '-1/2}\big ) \end{pmatrix} \left\{ o_p(1) + \begin{pmatrix} o_p(1) \\ O_p\left( n^{1/2-r} (l')^C\right) \end{pmatrix} \right\} \nonumber \\&= o_p(1) + \begin{pmatrix} o_p(1) \\ O_p\big (n^{-\epsilon '} (l')^C\big ) \end{pmatrix} =o_p(1). \nonumber \end{aligned}$$

    It follows that the rate of convergence for estimating \((\beta ,\sigma )\) gets improved from \(\textrm{diag}(n^{r/2}, n^{r/2}/ l')\) of \(\hat{\theta }_{0,n}\) to \(\textrm{diag}(n^{r-\epsilon '}, n^{r-\epsilon '}/ l')\) of \(\hat{\theta }_{1,n}\); this can be seen as a matrix-norming counterpart of the (near-)doubling phenomenon in the one-step estimation (see for example Zacks 1971, Section 5.5). To improve the rate further, we apply (3.9) to obtain \(\hat{\theta }_{2,n}\) from \(\hat{\theta }_{1,n}\), so that the rate of convergence for estimating \((\beta ,\sigma )\) gets improved from \(\textrm{diag}(n^{r-\epsilon '}, n^{r-\epsilon '}/ l')\) to \(\textrm{diag}(n^{2r-3\epsilon '}, n^{2r-3\epsilon '}/ l')\); here again, we can control the constant \(\epsilon '>0\) to be sufficiently small. This procedure is iterated \(K-1\) times, resulting in the rate \(\textrm{diag}(n^{2^{K-2}r-\epsilon '_0}, n^{2^{K-2}r-\epsilon '_0}/ l')\) with \(\epsilon '_0\) being small enough to ensure that \(2(2^{K-2}r-\epsilon '_0)>1/2\). Then, the last (Kth-step) application of (3.9) is the same as in the case of \(r>1/2\) mentioned above.

These observations conclude (3.11).

Thus, we have arrived at the following claim.

Theorem 3.1

Suppose that \(\hat{\theta }_{0,n}\) satisfies that \(\varphi _{0,n}^{-1}(\hat{\theta }_{0,n} - \theta _0) = O_p(1)\) with (3.7) and (3.8), and define K as in (3.10). Then, the K-step estimator \(\hat{\theta }_{K,n},\) defined through (3.9) satisfies (3.11), and hence is asymptotically efficient (by Theorem 2.1):

$$\begin{aligned} \varphi _n(\theta _0)^{-1}(\hat{\theta }_{K,n} - \theta _0) = \mathcal {I}_{n}(\theta _0)^{-1}\Delta _n(\theta _0) + o_{p}(1) \xrightarrow {\mathcal {L}}MN_{p,\theta _0}\left( 0,\, \mathcal {I}(\theta _0)^{-1} \right) . \end{aligned}$$
(3.20)

Because of the diagonality of \(\varphi _{0,n}\), Theorem 3.1 makes it possible to construct an initial estimator \(\hat{\theta }_{0,n}=(\hat{\lambda }_{0,n},\hat{\mu }_{0,n},\hat{\beta }_{0,n},\hat{\sigma }_{0,n})\) individually for each component.

Having (3.20) in hand, we can construct consistent estimators \(\hat{\mathcal {I}}_{\lambda ,\mu ,n}\xrightarrow {p}\mathcal {I}_{\lambda ,\mu }(\theta _0)\) and \(\hat{\mathcal {I}}_{\beta ,\sigma ,n}\xrightarrow {p}\mathcal {I}_{\beta ,\sigma }(\theta _0)\), and then prove the Studentization:

$$\begin{aligned} \left( \hat{\mathcal {I}}_{\lambda ,\mu ,n}^{1/2}\sqrt{n} \,h^{1-1/\hat{\beta }_{K,n}} \left( {\begin{array}{c}\hat{\lambda }_{K,n}-\lambda _0\\ \hat{\mu }_{K,n}-\mu _0\end{array}}\right) , ~\hat{\mathcal {I}}_{\beta ,\sigma ,n}^{1/2}\sqrt{n}\,\tilde{\varphi }_n(\hat{\theta }_{K,n})^{-1}\left( {\begin{array}{c}\hat{\beta }_{K,n}-\beta _0\\ \hat{\sigma }_{K,n}-\sigma _0\end{array}}\right) \right) \xrightarrow {\mathcal {L}}N_{p}(0,I_{p}). \end{aligned}$$
(3.21)

Indeed, this follows by noting the following facts.

  • For construction of \(\hat{\mathcal {I}}_{\lambda ,\mu ,n}\) and \(\hat{\mathcal {I}}_{\beta ,\sigma ,n}\):

    • In the expressions (2.8) and (2.9), we can replace the (Riemann) dt-integrals by the corresponding sample quantities:

      $$\begin{aligned} \frac{1}{n} \sum _{j=1}^{n}\big ( Y_{t_{j-1}}^2, Y_{t_{j-1}}X_{t_{j-1}}\big ) \xrightarrow {p}\frac{1}{T} \int _0^T \big ( Y_{t}^2, Y_{t}X_{t}\big ) dt. \nonumber \end{aligned}$$
    • The elements of the form \(E_{\theta _0}[H(\epsilon ;\beta _0)] = \int H(\epsilon ;\beta _0)\phi _{\beta _0}(\epsilon )d\epsilon \) with \(H(\epsilon ;\beta )\) smooth in \(\beta \) can be evaluated through a numerical integration involving the density \(\phi _\beta (\epsilon )\) and its partial derivatives with respect to \((\beta ,\epsilon )\), with plugging in the estimate \(\hat{\beta }_{K,n}\) for the value of \(\beta \) (the initial estimator \(\hat{\beta }_{0,n}\) is enough).

    • Again, note that \(n^{v}(\hat{\beta }_{K,n} - \beta _0,\, \hat{\sigma }_{K,n} - \sigma _0) = o_p(1)\) for any sufficiently small \(v \in (0,1/2)\), so that \(h^{1-1/\hat{\beta }_{K,n}} / h^{1-1/\beta _{0}} = (1/h)^{1/\hat{\beta }_{K,n}-1/\beta _{0}} \xrightarrow {p}1\). The values \(\overline{\varphi }_{lm}(\theta _0)\) contained in \(\mathcal {I}_{\beta ,\sigma }(\theta _0)\) are estimated by plugging in \(\hat{\theta }_{K,n}\) in (2.6):

      $$\begin{aligned} \left\{ \begin{array}{l} \hat{\beta }_{K,n}^{-2} l' \varphi _{11,n}(\hat{\theta }_{K,n}) + \hat{\sigma }_{K,n}^{-1}\varphi _{21,n}(\hat{\theta }_{K,n}) \xrightarrow {p}\overline{\varphi }_{21}(\theta _0), \nonumber \\ \hat{\beta }_{K,n}^{-2} l' \varphi _{12,n}(\hat{\theta }_{K,n}) + \hat{\sigma }_{K,n}^{-1}\varphi _{22,n}(\hat{\theta }_{K,n}) \xrightarrow {p}\overline{\varphi }_{22}(\theta _0), \nonumber \\ \varphi _{11,n}(\hat{\theta }_{K,n}) \xrightarrow {p}\overline{\varphi }_{11}(\theta _0), \nonumber \\ \varphi _{12,n}(\hat{\theta }_{K,n}) \xrightarrow {p}\overline{\varphi }_{12}(\theta _0). \nonumber \\ \end{array}\right. \nonumber \end{aligned}$$

      We can replace \((\hat{\beta }_{K,n},\hat{\sigma }_{K,n})\) by \((\hat{\beta }_{0,n},\hat{\sigma }_{0,n})\) all through the above.

  • Since \(\varphi _{n}^{-1}(\hat{\theta }_{K,n}-\theta _0)=O_p(1)\), it follows that

    $$\begin{aligned} \sqrt{n}\,\tilde{\varphi }_n(\hat{\theta }_{K,n})^{-1}\left( {\begin{array}{c}\hat{\beta }_{K,n}-\beta _0\\ \hat{\sigma }_{K,n}-\sigma _0\end{array}}\right)&= \sqrt{n}\, \left( \tilde{\varphi }_n(\theta _0)^{-1} + O_p\big ( (l')^C n^{-1/2}\big )\right) \left( {\begin{array}{c}\hat{\beta }_{K,n}-\beta _0\\ \hat{\sigma }_{K,n}-\sigma _0\end{array}}\right) \nonumber \\&= \sqrt{n}\, \tilde{\varphi }_n(\theta _0)^{-1} \left( {\begin{array}{c}\hat{\beta }_{K,n}-\beta _0\\ \hat{\sigma }_{K,n}-\sigma _0\end{array}}\right) + O_p\big ( (l')^C n^{-1/2}\big ) \nonumber \\&= \sqrt{n}\, \tilde{\varphi }_n(\theta _0)^{-1} \left( {\begin{array}{c}\hat{\beta }_{K,n}-\beta _0\\ \hat{\sigma }_{K,n}-\sigma _0\end{array}}\right) + o_p(1). \nonumber \end{aligned}$$

The property (3.21) entails

$$\begin{aligned}{} & {} \left| \hat{\mathcal {I}}_{\lambda ,\mu ,n}^{1/2}\sqrt{n} \,h^{1-1/\hat{\beta }_{K,n}} \left( {\begin{array}{c}\hat{\lambda }_{K,n}-\lambda _0\\ \hat{\mu }_{K,n}-\mu _0\end{array}}\right) \right| ^2 \\{} & {} \quad + \left| \hat{\mathcal {I}}_{\beta ,\sigma ,n}^{1/2}\sqrt{n}\,\tilde{\varphi }_n(\hat{\theta }_{K,n})^{-1}\left( {\begin{array}{c}\hat{\beta }_{K,n}-\beta _0\\ \hat{\sigma }_{K,n}-\sigma _0\end{array}}\right) \right| ^2 \xrightarrow {\mathcal {L}}\chi ^2(p)=\chi ^2(q+3), \nonumber \end{aligned}$$

which can be used for constructing an approximate confidence ellipsoid and for goodness-of-fit testing, in particular variable selection among the components of X.

Remark 3.2

From the proof of Theorem 3.1, we see that it is possible to weaken (3.6) as \(\beta _0 > 2/3\) if the integrated-process sequence \((\int _j X_s ds)_{j=1}^n\) is observable. Moreover, It is possible to remove (3.6) if the model is the Markovian \(Y_t = Y_0 + \int _0^t (\mu -\lambda Y_s)ds + \sigma J_t\) with constant \(\mu \in \mathbb {R}\) with modifying the definition (3.5) as in the estimating function of Clément and Gloter (2020). However, we worked under (3.3) and (3.5) to deal with a possibly time-varying X.

Remark 3.3

The standard form of the one-step estimator is not (3.9), but

$$\begin{aligned} \hat{\theta }_{k,n} = \hat{\theta }_{k-1,n} + \left( -\partial _\theta ^2\mathbb {H}_n(\hat{\theta }_{k-1,n})\right) ^{-1} \partial _\theta \mathbb {H}_n(\hat{\theta }_{k-1,n}). \nonumber \end{aligned}$$

By inspecting the proof of Theorem 3.1, we found that the off-block-diagonal part \(-\partial _a \partial _b \mathbb {H}_n(\hat{\theta }_{k-1,n})\) made the claim therein invalid. This has happened since the rate of convergence for estimating the component \(b=(\beta ,\sigma )\) could be too slow. Still, because of the block-diagonality of the original form (2.7), it seems to be a natural and reasonable strategy to use the block-diagonal form from the beginning of defining (3.9).

Remark 3.4

The necessity of more than one iteration (\(K\ge 2\)) would be a technical one. If we could verify the tail-probability estimate \(\sup _n P[|r_n(\hat{\lambda }_{0,n}-\lambda _0,\hat{\mu }_{0,n}-\mu _0)| \ge s]\lesssim s^{-M}\) for a sufficiently large \(M>0\), then it is possible to deduce the optimality of the one-step Newton–Raphson procedure even when a strategy of construction \((\hat{\beta }_{0,n},\hat{\sigma }_{0,n})\) is not smooth in \((\hat{\lambda }_{0,n},\hat{\mu }_{0,n})\) as in the function \(\hat{M}_n(a')\) in Section 3.2.2. However, the model under consideration is heavy tailed and it seems impossible to deduce such a bound since we cannot make use of the localization for that purpose.

3.2 Specific preliminary estimators

In this section, we consider a specific construction of \(\hat{\theta }_{0,n}=(\hat{\lambda }_{0,n},\hat{\mu }_{0,n},\hat{\beta }_{0,n},\hat{\sigma }_{0,n})\) satisfying \(\varphi _{0,n}^{-1}(\hat{\theta }_{0,n} - \theta _0) = O_p(1)\) with \(\varphi _{0,n}\) given by (3.7). We keep assuming that the available sample is \(\{(X_{t_j}, Y_{t_j})\}_{j=0}^{n}\) and the conditions (3.3) and (3.6) are in force. We will proceed in two steps.

  1. (1)

    First, we will estimate the trend parameter \((\lambda ,\mu )\) by the least absolute deviation (LAD) estimator, which will turn out to be rate optimal, and asymptotically mixed-normally distributed; although the identification of the asymptotic distribution is not necessary here, it would be of independent interest (see Section 3.2.3).

  2. (2)

    Next, by plugging in the LAD estimator we construct a sequence of residuals for the noise term, based on which we will consider the lower-order fractional moment matching.

Recall that we are working under the localization (2.10) by removing large jumps of J.

3.2.1 LAD estimator

Let us recall the autoregressive structure together (2.2) with the approximation of the (non-random) integral:

$$\begin{aligned} Y_{t_j}&= e^{-\lambda _0 h}Y_{t_{j-1}} + \mu _0 \cdot \zeta _{j}(\lambda _0)h + \sigma _0 \int _j e^{-\lambda _0 (t_j -s)}dJ_s \nonumber \\&= Y_{t_{j-1}} - \lambda _0 h Y_{t_{j-1}} + \mu _0 \cdot X_{t_{j-1}} h + \sigma _0 \int _j e^{-\lambda _0 (t_j -s)}dJ_s + h^{1/\beta _0} \delta '_{j-1}, \end{aligned}$$
(3.22)

where

$$\begin{aligned} \delta '_{j-1}=\delta '_{j-1}(\theta _0) := h^{-1/\beta _0} \left( Y_{t_{j-1}} ( {e^{-\lambda _0 h}} - 1 + {\lambda _0} h) + {\mu _0} \cdot \left( \zeta _{j}({\lambda _0}) - X_{t_{j-1}}\right) h \right) \nonumber \end{aligned}$$

is an \(\mathcal {F}_{t_{j-1}}\)-measurable random variable such that

$$\begin{aligned} |\delta '_{j-1}| \lesssim (1+|Y_{t_{j-1}}|) h^{1+\kappa -1/\beta _0}. \end{aligned}$$
(3.23)

We define the LAD estimator \((\hat{\lambda }_{0,n},\hat{\mu }_{0,n}) \in \mathbb {R}^{q+1}\) by any element \((\hat{\lambda }_{0,n},\hat{\mu }_{0,n}) \in {\text {argmin}}_{(\lambda ,\mu )} M_n(\lambda ,\mu ),\) leaving \((\beta ,\sigma )\) unknown, where

$$\begin{aligned} M_n(\lambda ,\mu )&:= \sum _{j=1}^{n}\left| Y_{t_j}- Y_{t_{j-1}} - \left( - \lambda Y_{t_{j-1}} + \mu \cdot X_{t_{j-1}} \right) h \right| . \end{aligned}$$
(3.24)

This is a slight modification of the previously studied approximate LAD estimator in Masuda (2010) concerning the ergodic locally stable OU process.

We introduce the following convex random function on \(\mathbb {R}\times \mathbb {R}^q\) (recall the notation (2.15)):

$$\begin{aligned} \Lambda _{n}(u,v) := \frac{1}{\sigma _0 \eta (\lambda _0\beta _0 h)^{1/\beta _0}\, h^{1/\beta _0}} \left\{ M_n\left( \lambda _0 + \frac{u}{r_n},\, \mu _0 + \frac{1}{r_n}v\right) - M_n(\lambda _0, \mu _0) \right\} . \nonumber \end{aligned}$$

The minimizer of \(\Lambda _n\) is \(\hat{w}_n:=(\hat{u}_n,\hat{v}_n),\) where \(\hat{u}_n := r_n(\hat{\lambda }_{0,n}-\lambda _0)\) and \(\hat{v}_n := r_n(\hat{\mu }_{0,n}-\mu _0)\). Further, letting \(z_{j-1}:=(-Y_{t_{j-1}}, X_{t_{j-1}})\), \(w:=(u,v)\), and

$$\begin{aligned} \epsilon '_j := \frac{1}{\eta (\lambda _0\beta _0 h)^{1/\beta _0}\, h^{1/\beta _0}} \int _j e^{-\lambda _0 (t_j -s)}dJ_s ~{\mathop {\sim }\limits ^{P_{\theta _0}}}~\text {i.i.d.}~\mathcal {L}(J_1), \nonumber \end{aligned}$$

we also introduce the quadratic random function

$$\begin{aligned} \Lambda _{n}^\sharp (w) := \Delta '_{n}[w] + \frac{1}{2}\Gamma _{0}[w,w], \nonumber \end{aligned}$$

where

$$\begin{aligned} \Delta '_{n}&:= {-} \sum _{j=1}^{n}\frac{1}{s_{0,n}\sqrt{n}} \, \textrm{sgn}\left( \epsilon '_j + \delta _{j-1}'\right) z_{j-1}, \nonumber \\ \nonumber \\ \Gamma _0&:= \frac{2\phi _{\beta _0}(0)}{\sigma _0^2}\frac{1}{T}\int _0^T \begin{pmatrix} Y_{t}^2 &{} -Y_{t}X_{t}^\top \\ -Y_{t}X_{t} &{} X_{t}^{\otimes 2} \end{pmatrix}dt, \nonumber \end{aligned}$$

where \(s_{0,n}:=\sigma _0 \eta (\lambda _0\beta _0 h)^{1/\beta _0} = (1+o(1)) \sigma _0\). The a.s. positive definiteness of \(\Gamma _{0}\) (see Sect. 2) implies that \({\text {argmin}} \Lambda _{n}^{\sharp }\) a.s. consists of the single point \(\hat{w}_{n}^{\sharp }:=-\Gamma _{0}^{-1}\Delta '_{n}\). Then, our objective is to prove that

$$\begin{aligned} \hat{w}_{n}=\hat{w}_{n}^{\sharp }+o_{p}(1). \end{aligned}$$
(3.25)

The proof is analogous to Mas10ejs (2010, Proof of Theorem 2.1); hence, we will appropriately omit the full technical details, referring to the corresponding parts therein.

By (3.22) and (3.24), we have

$$\begin{aligned} \Lambda _{n}(w)&= \sum _{j=1}^{n}\left( \left| \epsilon '_j + {\frac{\delta _{j-1}'}{s_{0,n}}} - \frac{1}{s_{0,n}\sqrt{n}} w\cdot z_{j-1}\right| - \left| \epsilon '_j + {\frac{\delta _{j-1}'}{s_{0,n}}}\right| \right) . \nonumber \end{aligned}$$

As in Masuda (2010, Eq.(4.6)), we can write \(\Lambda _n(w) = \Delta '_n[w] + Q_n(w), \) where

$$\begin{aligned} Q_{n}(w)&:= 2\sum _{j=1}^{n}\int _{0}^{w\cdot z_{j-1} / (s_{0,n}\sqrt{n})} \left\{ I\left( \epsilon '_j + {\frac{\delta _{j-1}'}{s_{0,n}}} \le s\right) - I\left( \epsilon '_j + {\frac{\delta _{j-1}'}{s_{0,n}}} \le 0\right) \right\} ds. \nonumber \end{aligned}$$

Let us suppose that

$$\begin{aligned} \Delta '_n&= O_p(1), \end{aligned}$$
(3.26)
$$\begin{aligned} Q_n(w)&= \frac{1}{2} \Gamma _0 [w,w] + o_p(1), \qquad w\in \mathbb {R}^{1+q}. \end{aligned}$$
(3.27)

Then, we can make use of the argument of Hjørt and Pollard (2011) to conclude (3.25). To see this, we note the inequality due to Hjørt & Pollard (2011, Lemma 2): for any \(\epsilon >0\),

$$\begin{aligned} P\left[ |\hat{w}_{n}-\hat{w}_{n}^{\sharp }|\ge \epsilon \right] \le P\left[ \sup _{w:\, |w-\hat{v}_{n}^{\sharp }|\le \epsilon }|\delta _{n}({w})| \ge \frac{1}{2} \left( \inf _{\begin{array}{c} (w,z):\, |z|=1, \\ w=\hat{w}_{n}^{\sharp }+\epsilon z \end{array}}\Lambda _{n}^{\sharp }({w}) - \Lambda _{n}^{\sharp }(\hat{w}_n^\sharp )\right) \right] , \nonumber \end{aligned}$$

where \(\delta _n(w) := \Lambda _n(w) - \Lambda _n^\sharp (w)\). Obviously, \(\Lambda _{n}^{\sharp }(\hat{w}_n^\sharp ) = {-} (1/2)\Delta '_n \cdot \Gamma _0^{-1}\Delta '_n\). By straightforward computations, we obtain

$$\begin{aligned} \inf _{\begin{array}{c} (w,z):\, |z|=1, \\ w=\hat{w}_{n}^{\sharp }+\epsilon z \end{array}}\Lambda _{n}^{\sharp }({w}) - \Lambda _{n}^{\sharp }(\hat{w}_n^\sharp ) \ge \epsilon ^2 \lambda _{\min }(\Gamma _0). \nonumber \end{aligned}$$

Also, because of the convexity, we have the uniform convergence \(\sup _{w\in A}\left| \delta _n(w)\right| \xrightarrow {p}0\) for each compact \(A\subset \mathbb {R}^{1+q}\) (see Hjørt & Pollard (2011, Lemma 1)). Note that \(\hat{w}_{n}^{\sharp } = O_p(1)\) by (3.26) and the a.s. positive definiteness of \(\Gamma _0\). Given any \(\epsilon ,\epsilon '>0\), we can find sufficiently large \(K>0\) and \(N\in \mathbb {N}\) for which the following three estimates hold simultaneously:

$$\begin{aligned}&\sup _n P[|\hat{w}_{n}^{\sharp }|> K]< \epsilon '/3, \nonumber \\&\sup _{n\ge N} P\left[ \sup _{w:\, |w|\le K + \epsilon }|\delta _{n}(v)| > \epsilon ' \right]< \epsilon '/3, \nonumber \\&P\left[ \epsilon ' \ge \frac{\epsilon ^2}{2} \lambda _{\min }(\Gamma _0)\right] < \epsilon '/3. \nonumber \end{aligned}$$

Piecing together the above arguments concludes that, for any \(\epsilon ,\epsilon '>0\), there exists an \(N\in \mathbb {N}\) such that \(\sup _{n\ge N} P\left[ |\hat{w}_{n}-\hat{w}_{n}^{\sharp }|\ge \epsilon \right] < \epsilon '\). This establishes (3.25), and it follows that

$$\begin{aligned} \hat{w}_{n} = -\Gamma _{0}^{-1}\Delta '_{n} + o_p(1) = O_p(1). \end{aligned}$$
(3.28)

It remains to prove (3.26) and (3.27). Below, we will write \(P^{j-1}\) and \(E^{j-1}\) for the conditional probability and expectation given \(\mathcal {F}_{t_{j-1}}\), respectively.

Proof of (3.26) follows on showing \(\Delta _n(1)=O_p(1)\) and \(R_{1,n}=o_p(1)\), where

$$\begin{aligned} \Delta _{n}(t)&:= \sum _{j=1}^{[nt]} \frac{1}{\sigma _{0}\sqrt{n}} \, \left\{ \textrm{sgn}\left( \epsilon '_j + {\frac{\delta _{j-1}'}{s_{0,n}}}\right) - E_{0}^{j-1}\left[ \textrm{sgn}\left( \epsilon '_j + {\frac{\delta _{j-1}'}{s_{0,n}}}\right) \right] \right\} z_{j-1},\nonumber \\&\quad t\in [0,1], \nonumber \\ R_{1,n}&:= \sum _{j=1}^{n}\frac{1}{\sigma _{0}\sqrt{n}} \, E_{0}^{j-1}\left[ \textrm{sgn}\left( \epsilon '_j + {\frac{\delta _{j-1}'}{s_{0,n}}}\right) z_{j-1}\right] . \end{aligned}$$
(3.29)

The (matrix-valued) predictable quadratic variation process of \(\{\Delta _{n}(\cdot )\}_{t\in [0,1]}\) is given by

$$\begin{aligned} \langle \Delta _n(\cdot )\rangle _t := \sigma _0^{-2} \frac{1}{n} \sum _{j=1}^{[nt]} \left\{ \textrm{sgn}\left( \epsilon '_j + {\frac{\delta _{j-1}'}{s_{0,n}}}\right) - E_{0}^{j-1}\left[ \textrm{sgn}\left( \epsilon '_j + {\frac{\delta _{j-1}'}{s_{0,n}}}\right) \right] \right\} ^2 z_{j-1}^{\otimes 2}. \nonumber \end{aligned}$$

We apply the Lenglart inequality Jacod and Shiryaev (2003, I.3.31) for the submartingale \(|\Delta _n(t)|^2\): for any \(K,L>0\),

$$\begin{aligned} \sup _{n} P\left[ \sup _{t\in [0,1]}|\Delta _{n}(t)| \ge K \right]&\lesssim \frac{L}{K} + \sup _{n} P\left[ \frac{1}{n}\sum _{j=1}^{n}|z_{j-1}|^2 \ge C \sigma _0^2 L\right] \nonumber \\&\lesssim \frac{L}{K} + \sup _{n} P\left[ \frac{1}{n}\sum _{j=1}^{n}(1+|Y_{t_{j-1}}|)^2 \ge C L\right] . \nonumber \end{aligned}$$

We have \(n^{-1}\sum _{j=1}^{n}(1+|Y_{t_{j-1}}|)^2=O_p(1)\). To conclude that \(\Delta _n:=\Delta _n(1)=O_p(1)\), let L and K sufficiently large in this order. To see \(R_{1,n}=o_p(1)\), we proceed in exactly the same way as in Masuda (2010, pp. 544–545): by partly using (3.6) and (3.23),

$$\begin{aligned} |R_{1,n}|&= \left| \frac{1}{n} \sum _{j=1}^{n}2\sqrt{n} \, z_{j-1} \int _0^{{\delta '_{j-1}/s_{0,n}}} \phi _{\beta _0}(y)dy \right| \nonumber \\&\lesssim \frac{1}{n} \sum _{j=1}^{n}\sqrt{n} |z_{j-1}| |\delta '_{j-1}| \lesssim \frac{1}{n} \sum _{j=1}^{n}(1+|Y_{t_{j-1}}|)^2 \sqrt{n} h^{1+\kappa -1/\beta _0} \nonumber \\&=O_p\left( \sqrt{n} h^{1+\kappa -1/\beta _0}\right) = O_p\left( h^{1/2+\kappa -1/\beta _0}\right) = o_p(1). \nonumber \end{aligned}$$

Thus, we have obtained (3.26), and now we can replace \(\Delta _n'\) by \(\Delta _n\) in (3.28):

$$\begin{aligned} \hat{w}_{n} = -\Gamma _{0}^{-1}\Delta _{n} + o_p(1) = O_p(1). \end{aligned}$$
(3.30)

Proof of (3.27). We decompose \(Q_{n}(w) =: \sum _{j=1}^{n}\zeta _j(w)\) as \(Q_n(w)=Q_{1,n}(w) + Q_{2,n}(w)\), where \(Q_{1,n}(w) := \sum _{j=1}^{n}E^{j-1}[\zeta _j(w)]\) and \(Q_{2,n}(w) := \sum _{j=1}^{n}(\zeta _j(w) - E^{j-1}[\zeta _j(w)])\). Then, for each \(w\in \mathbb {R}^{1+q},\) we can readily mimic the flow of Masuda (2010, pp.545–546) (for handling the term \(\mathbb {Q}_n(u)\) therein). The sketches are given below.

  • We have

    $$\begin{aligned} Q_{1,n}(w) = \frac{1}{2} \Gamma _n [w,w] + A_n(w), \nonumber \end{aligned}$$

    where

    $$\begin{aligned} \Gamma _n := \frac{2\phi _{\beta _0}(0)}{s_{0,n}^2}\frac{1}{n}\sum _{j=1}^{n} \begin{pmatrix} Y_{t_{j-1}}^2 &{} -Y_{t_{j-1}}X_{t_{j-1}}^\top \\ -Y_{t_{j-1}}X_{t_{j-1}} &{} X_{t_{j-1}}^{\otimes 2} \end{pmatrix} = \Gamma _0 + o_p(1), \nonumber \end{aligned}$$

    and where

    $$\begin{aligned} |A_n(w)|&\lesssim \left| \frac{1}{n}\sum _{j=1}^{n}(w\cdot z_{j-1})^{2}\left\{ \phi _{\beta _0}\left( -{\frac{\delta _{j-1}'}{s_{0,n}}}\right) - \phi _{\beta _0}(0)\right\} \right| \nonumber \\&{}\qquad + \left| \sum _{j=1}^{n}\int _{0}^{w\cdot z_{j-1}/(s_{0,n}\sqrt{n})}s^{2}\int _{0}^{1}(1-y) \partial \phi _{\beta _0}\left( sy-{\frac{\delta _{j-1}'}{s_{0,n}}}\right) dyds \right| \nonumber \\&\lesssim \frac{1}{n} \sum _{j=1}^{n}(1+|Y_{t_{j-1}}|)^4 (1+|w|)^4 \left( h^{2(1+\kappa -1/\beta _0)} \vee \frac{1}{n} \vee \frac{h^{1+\kappa -1/\beta _0}}{\sqrt{n}} \right) =o_p(1). \nonumber \end{aligned}$$
  • We have \(Q_{2,n}(w)=o_p(1)\): by the Burkholder–Davis–Gundy inequality,

    $$\begin{aligned}&E\left[ \left( \sum _{j=1}^{n}(\zeta _j(w) - E^{j-1}[\zeta _j(w)]) \right) ^2 \right] \nonumber \\&\lesssim \sum _{j=1}^{n}E\left[ \left( \int _{0}^{|w\cdot z_{j-1} / (s_{0,n}\sqrt{n})|} I\left( \left| \epsilon '_j + {\frac{\delta _{j-1}'}{s_{0,n}}}\right| \le s\right) ds \right) ^2 \right] \nonumber \\&\lesssim \sum _{j=1}^{n}\frac{|w|}{\sqrt{n}} E\left[ |z_{j-1}| \int _{0}^{|w\cdot z_{j-1} / (s_{0,n}\sqrt{n})|} P^{j-1}\left[ \left| \epsilon '_j + {\frac{\delta _{j-1}'}{s_{0,n}}}\right| \le s\right] ds \right] \nonumber \\&\lesssim \sum _{j=1}^{n}\frac{|w|}{\sqrt{n}}E\left[ |z_{j-1}| \int _{0}^{|w\cdot z_{j-1} / (s_{0,n}\sqrt{n})|} \left( s+\left| {\frac{\delta _{j-1}'}{s_{0,n}}}\right| \right) ds \right] \nonumber \\&\lesssim (1+|w|)^3 \frac{1}{n} \sum _{j=1}^{n}E\left[ (1+|Y_{t_{j-1}}|)^3\right] \left( \frac{1}{\sqrt{n}} \vee h^{1+\kappa -1/\beta _0} \right) \nonumber \\&=O\left( \frac{1}{\sqrt{n}} \vee h^{1+\kappa -1/\beta _0} \right) =o(1). \nonumber \end{aligned}$$

Summarizing the above yields (3.27).

The tightness (3.30) is sufficient for our purpose. As a matter of fact, the LAD estimator \((\hat{\lambda }_{0,n},\hat{\mu }_{0,n})\) is asymptotically mixed-normally distributed. We give the details in Sect. 3.2.3.

3.2.2 Rates of convergence at the moment matching for \((\beta ,\sigma )\)

The remaining task is to construct a specific estimator \((\hat{\beta }_{0,n},\hat{\sigma }_{0,n})\) such that

$$\begin{aligned} \left( n^{r/2}(\hat{\beta }_{0,n}-\beta _0),\,\frac{n^{r/2}}{l'}(\hat{\sigma }_{0,n}-\sigma _0) \right) = O_{p}(1). \end{aligned}$$
(3.31)

This can be achieved simply by fitting some appropriate moments; for this purpose, the localization does not make sense, since precise expressions of truly existing moments without the localization come into play. Here we consider, as in Brouste and Masuda (2018), the pair of the absolute moments of order r and 2r.

Let \(a'\in (0, \beta _0/2)\) and define

$$\begin{aligned} \hat{M}_n(a') := \frac{1}{n} \sum _{j=1}^{n}\left| Y_{t_{j}} - Y_{t_{j-1}} + \hat{\lambda }_{0,n} Y_{t_{j-1}}h - \hat{\mu }_{0,n} \cdot X_{t_{j-1}} h \right| ^{a'}. \nonumber \end{aligned}$$

Let

$$\begin{aligned} \epsilon ''_j := \frac{1}{h^{1/\beta _0}} \int _j e^{-\lambda _0 (t_j -s)}dJ_s =(1+o(1))\epsilon '_j, \nonumber \end{aligned}$$

which are approximately i.i.d. with common distribution \(\mathcal {L}(J_1)\), and also let

$$\begin{aligned} M_n(a') := \sigma _0^{a'} h^{a'/\beta _0}\frac{1}{n} \sum _{j=1}^{n}\left| \epsilon ''_j \right| ^{a'}. \nonumber \end{aligned}$$

We can apply the central limit theorem to ensure that \(\sqrt{n}\Big ( h^{-a'/\beta _0} \sigma _0^{-a'}M_n(a') - m (a';\beta _0) \Big ) = O_p(1)\) as soon as \(a'<\beta _0/2\), where

$$\begin{aligned} m(a';\beta _0) := E\big [|J_1|^{a'}\big ] = \frac{2^{a'}}{\sqrt{\pi }} \frac{\Gamma ((a'+1)/2) \Gamma (1-a'/\beta _0)}{\Gamma (1-a'/2)}. \nonumber \end{aligned}$$

Moreover, it follows from the discussions in Sect. 3.2.1 that

$$\begin{aligned} h^{-a'/\beta _0}\sigma _0^{-a'}\hat{M}_n(a')&= \frac{1}{n} \sum _{j=1}^{n}\left| \epsilon ''_j + \frac{1}{\sqrt{n}}\left( \sqrt{n}\delta '_{j-1} - \hat{w}_n \cdot z_{j-1} \right) \right| ^{a'}, \nonumber \end{aligned}$$

which in turn gives

$$\begin{aligned} n^{a'/2}\left| h^{-a'/\beta _0}\sigma _0^{-a'}\left( \hat{M}_n(a') - M_n(a')\right) \right|&\le \frac{1}{n} \sum _{j=1}^{n}\left( \sqrt{n}|\delta '_{j-1}| + |\hat{w}_n| |z_{j-1}| \right) ^{a'} \nonumber \\&\lesssim O_p\left( \sqrt{n} \, h^{1+\kappa -1/\beta _0}\right) + O_p(1) = O_p(1). \nonumber \end{aligned}$$

It follows that

$$\begin{aligned} n^{a'/2} \left( h^{-a'/\beta _0} \sigma _0^{-a'} \hat{M}_n(a') - m(a';\beta _0)\right) = O_p(1 \vee n^{(a'-1)/2}) = O_p(1). \nonumber \end{aligned}$$

Now we want to take \(a'=r,2r\), which necessitates that \(r\in (0, \beta _0/4)\) in the current argument. Then, we conclude that

$$\begin{aligned} n^{r/2} \left( h^{-r/\beta _0}\sigma _0^{-r} \hat{M}_n(r) - m(r;\beta _0), \, h^{-2r/\beta _0}\sigma _0^{-2r} \hat{M}_n(2r;\beta _0) - m(2r;\beta _0) \right) = O_p(1), \nonumber \end{aligned}$$

so that

$$\begin{aligned} n^{r/2} \left( \frac{\hat{M}_n(r)^2}{\hat{M}_n(2r)} - \frac{m(r;\beta _0)^2}{m(2r;\beta _0)} \right) = O_p(1). \nonumber \end{aligned}$$

There exists a bijection \(f_r\) such that \(f_r(m(r;\beta )^2/m(2r;\beta ))=\beta \); see Brouste and Masuda (2018, Section 3.2) and the references therein for the related details. Therefore, taking \(\hat{\beta }_{0,n} := f_r(\hat{M}_n(r)^2 /\hat{M}_n(2r))\) results in \(n^{r/2}(\hat{\beta }_{0,n} -\beta _0)=O_p(1)\), as was to be shown. The bisection method is sufficient for numerically finding \(\hat{\beta }_{0,n}\).

Turning to \(\hat{\sigma }_{0,n}\), we note that

$$\begin{aligned} n^{r/2} \left( \frac{h^{-r/\beta _0}\hat{M}_n(r)}{m(r;\beta _0)} - \sigma _0^r \right) = O_p(1). \end{aligned}$$
(3.32)

Let \(\hat{\sigma }_{0,n} := \left( \frac{h^{-r/\hat{\beta }_{0,n}}\hat{M}_n(r)}{m(r;\hat{\beta }_{0,n})}\right) ^{1/r}\): we claim that \(\frac{n^{r/2}}{l'}(\hat{\sigma }_{0,n}-\sigma _0) = O_{p}(1)\). Since \(\frac{m(r;\hat{\beta }_{0,n})}{m(r;\beta _0)} = O_p(1)\),

$$\begin{aligned} \left| h^{r(1/\hat{\beta }_{0,n} - 1/\beta _0)} \frac{m(r;\hat{\beta }_{0,n})}{m(r;\beta _0)} -1 \right| \le \left| h^{r(1/\hat{\beta }_{0,n} - 1/\beta _0)} - 1 \right| O_p(1) + \left| \frac{m(r;\hat{\beta }_{0,n})}{m(r;\beta _0)} -1 \right| . \nonumber \end{aligned}$$

Recall that \(n^{r/2}(\hat{\beta }_{0,n} -\beta _0)=O_p(1)\), hence the second term in the upper bound equals \(O_p(n^{-r/2})\). As for the first term, using that \((1/\hat{\beta }_{0,n} - 1/\beta _0) l' = O_p( l' /n^{r/2}) = o_p(1)\), we observe

$$\begin{aligned} h^{r(1/\hat{\beta }_{0,n} - 1/\beta _0)} - 1 = \exp \left( r(1/\hat{\beta }_{0,n} - 1/\beta _0) l' \right) - 1 = O_p\left( \frac{l'}{n^{r/2}} \right) =o_p(1). \nonumber \end{aligned}$$

These estimates combined with (3.32) conclude the claim: we have (3.31) for the above constructed \((\hat{\beta }_{0,n},\hat{\sigma }_{0,n})\). Given an \(r\in (0,\beta _0/4)\), by Theorem 3.1, the K-step estimator for \(K> \log _2(1/r)\) is asymptotically efficient; if \(\beta _0> 1\) is supposed beforehand, then we can take an \(r>1/4\) small enough to ensure that \(K=2\) is enough.

3.2.3 Asymptotic mixed normality of the LAD estimator

Recall (3.30): \(\hat{w}_{n} = -\Gamma _{0}^{-1}\Delta _{n} + o_{p}(1)\). To deduce the asymptotic mixed normality, it suffices to identify the appropriate asymptotic distribution of \((\Delta _{n},\Gamma _{0})\), equivalently of \((\Delta _{n},\Gamma _{n})\).

First, we clarify the leading term of \(\Delta _n\) in a simpler form. We have \(E[\textrm{sgn}(\epsilon '_{j})]=0\) and \(E[\textrm{sgn}(\epsilon '_{j})^{2}]=1\), Observe that \(\Delta _n = \Delta _{0,n} + R_{1,n} + R_{2,n}\), where \(R_{1,n}\) is given in (3.29) and

$$\begin{aligned} \Delta _{0,n}&:= \sum _{j=1}^{n} \frac{1}{\sigma _{0}\sqrt{n}} \, \textrm{sgn}(\epsilon '_j) z_{j-1}, \nonumber \\ R_{2,n}&:= \sum _{j=1}^{n} \frac{1}{\sigma _{0}\sqrt{n}} \, \left( \textrm{sgn}\left( \epsilon '_j + \delta _{j-1}'\right) - \textrm{sgn}(\epsilon '_j) \right) z_{j-1}. \nonumber \end{aligned}$$

We have already seen that \(R_{1,n}=o_p(1)\). We claim that \(R_{2,n}=o_p(1)\). Write \(R_{2,n} = \sum _{j=1}^{n} \xi _{j}\). The claim follows on showing that both \(\sum _{j=1}^{n} E^{j-1}[\xi _{j}]=o_p(1)\) and \(|\sum _{j=1}^{n} E^{j-1}[\xi _{j}^{\otimes 2}]|=o_p(1)\), but the first one obviously follows from \(R_{1,n}=o_p(1)\). The second one can be shown as follows: first, we have

$$\begin{aligned} \sum _{j=1}^{n} E^{j-1}[\xi _{j}^{\otimes 2}]&= \sigma _0^{-2} \frac{1}{n} \sum _{j=1}^{n}E^{j-1}\left[ \left( \textrm{sgn}\left( \epsilon '_j + \delta _{j-1}'\right) - \textrm{sgn}(\epsilon '_j) \right) ^2\right] z_{j-1}^{\otimes 2} \nonumber \\&= 2\sigma _0^{-2} \frac{1}{n} \sum _{j=1}^{n}\left( 1 - E^{j-1}\left[ \textrm{sgn}\left( \epsilon '_j + \delta _{j-1}'\right) \textrm{sgn}(\epsilon '_j) \right] \right) z_{j-1}^{\otimes 2}. \nonumber \end{aligned}$$

Moreover,

$$\begin{aligned}&E^{j-1}\left[ \textrm{sgn}\left( \epsilon '_j + \delta _{j-1}'\right) \textrm{sgn}(\epsilon '_j) \right] \nonumber \\&= \left( \int _{0\vee (-\delta '_{j-1})}^\infty + \int _{-\infty }^{0\vee (-\delta '_{j-1})} - \int _{-\delta '_{j-1}}^0 - \int _0^{-\delta '_{j-1}}\right) \phi _{\beta _0}(y)dy = 1+ D_{j-1} \nonumber \end{aligned}$$

for some \(\mathcal {F}_{t_{j-1}}\)-measurable term \(D_{j-1}\) satisfying the estimate \(|D_{j-1}|\lesssim |\delta '_{j-1}|\lesssim (1+|Y_{t_{j-1}}|) h^{1+\kappa -1/\beta _0}\). These observations conclude that \(|\sum _{j=1}^{n} E^{j-1}[\xi _{j}^{\otimes 2}]|=o_p(1)\).

It remains to look at \(\Delta _{0,n}\). The mere convergence in distribution is unsuitable since the matrix \(\Gamma _0\) is random. We will apply the weak limit theorem for stochastic integrals: we refer the reader to Jacod and Shiryaev (2003, VI.6) for a detailed account of the limit theorems as well as the standard notation used below.

We introduce the partial sum process

$$\begin{aligned} S^{n}_{t}:=\sum _{j=1}^{[nt]}\frac{1}{\sqrt{n}}\textrm{sgn}(\epsilon '_{j}),\qquad t\in [0,1]. \nonumber \end{aligned}$$

We apply Jacod (2007, Lemma 4.3) to derive \(S^{n} \xrightarrow {\mathcal {L}_{s}}w'\) in \(\mathcal {D}(\mathbb {R})\) (the Skorokhod space of \(\mathbb {R}\)-valued functions, equipped with the Skorokhod topology), where \(w'=(w')_{t\in [0,1]}\) denotes a standard Wiener process defined on an extended probability space and independent of \(\mathcal {F}\). Here, the symbol \(\xrightarrow {\mathcal {L}_{s}}\) denotes the (\(\mathcal {F}\)-)stable convergence in law, which is strictly stronger than the mere weak convergence and in particular implies the joint weak convergence in \(\mathcal {D}(\mathbb {R}^{q+2})\):

$$\begin{aligned} (S^{n},H^{n})\xrightarrow {\mathcal {L}}(w',H^{\infty }) \end{aligned}$$
(3.33)

for any \(\mathbb {R}^{q+1}\)-valued \(\mathcal {F}\)-measurable càdlàg processes \(H^{n}\) and \(H^{\infty }\) such that \(H^{n} \xrightarrow {p}H^{\infty }\) in \(\mathcal {D}(\mathbb {R}^{q+1})\).

We note the following two points.

  • We have \(S^{n}\xrightarrow {\mathcal {L}}w'\) in \(\mathcal {D}(\mathbb {R})\), and for each \(n\in \mathbb {N}\) the process \((S^{n}_{t})_{t\in [0,1]}\) is an \((\mathcal {F}_{[nt]/n})\)-martingale such that \(\sup _{n,t}|\Delta S^{n}_{t}|\le 1\). These facts combined with Jacod and Shiryaev (2003, VI.6.29) imply that the sequence \((S^{n})\) is predictably uniformly tight.

  • Given any continuous function \(f:\mathbb {R}^{q+1}\rightarrow \mathbb {R}^{q'}\) (for some \(q'\in \mathbb {N}\)), we consider the function \(H^{n}=(H^{1,n},H^{2,n})\) with

    $$\begin{aligned} H^{1,n}_{t}&:= \left( -Y_{[nt]/n},\,X_{[nt]/n}\right) , \nonumber \\ H^{2,n}_{t}&:=\frac{1}{n}\sum _{j=1}^{[nt]}f(Y_{t_{j-1}},X_{t_{j-1}}). \nonumber \end{aligned}$$

    Then, we have \(H^{1,n}\xrightarrow {p}H^{1,\infty }:={(-Y,X)}\) in \(\mathcal {D}(\mathbb {R}^{q+1})\) and \(H^{2,n}\xrightarrow {p}H^{2,\infty }:=\int _{0}^{\cdot }f(Y_s,X_{s})ds\) in \(\mathcal {D}(\mathbb {R}^{q'})\), with which (3.33) concludes the joint weak convergence in \(\mathcal {D}(\mathbb {R}^{2+q+q'})\):

    $$\begin{aligned} (S^{n},H^{1,n},H^{2,n})\xrightarrow {\mathcal {L}}(w',H^{1,\infty },H^{2,\infty }). \nonumber \end{aligned}$$

With these observations, we can apply Jacod and Shiryaev (2003, VI.6.22) to derive the weak convergence of stochastic integrals:

$$\begin{aligned} ( H^{1,n}_{-}\cdot S^{n},H^{2,n} ) \xrightarrow {\mathcal {L}}( H^{1,\infty }_{-}\cdot w',H^{2,\infty } ), \nonumber \end{aligned}$$

which entails that, for any continuous function f,

$$\begin{aligned}&\left( \Delta _{0,n},\ \frac{1}{n}\sum _{j=1}^{n}f(Y_{t_{j-1}},X_{t_{j-1}})\right) \xrightarrow {\mathcal {L}}\left( \frac{\sigma _0^{-1}}{T} \int _{0}^{T}(-Y_{s},X_s) dw'_{s},\ \frac{1}{T} \int _{0}^{T} f(Y_s,X_{s})ds\right) \nonumber \\&\quad {\mathop {=}\limits ^{\mathcal {L}}} \left( \left\{ \frac{\sigma _0^{-2}}{T} \int _{0}^{T} \begin{pmatrix} Y_{t}^2 &{} -Y_{t}X_{t}^\top \\ -Y_{t}X_{t} &{} X_{t}^{\otimes 2} \end{pmatrix}dt\right\} ^{1/2} Z, \right. \\&\quad \left. ~\frac{1}{T} \int _{0}^{T} f(Y_s,X_{s})ds\right) , \nonumber \end{aligned}$$

where \(Z\sim N(0,1)\) is independent of \(\mathcal {F}\). Now, by taking

$$\begin{aligned} f(x,y) = \frac{2\phi _{\beta _0}(0)}{\sigma _{0}^2} \begin{pmatrix} y^2 &{} - x^\top y \\ - xy &{} x^{\otimes 2} \end{pmatrix}, \nonumber \end{aligned}$$

we arrive at

$$\begin{aligned}{} & {} \left( \Delta _{0,n}, \Gamma _0\right) \xrightarrow {\mathcal {L}}\left( \left\{ \sigma _0^{-2}\frac{1}{T} \int _{0}^{T} \begin{pmatrix} Y_{t}^2 &{} -Y_{t}X_{t}^\top \\ -Y_{t}X_{t} &{} X_{t}^{\otimes 2} \end{pmatrix}dt\right\} ^{1/2} Z,\right. \\{} & {} \left. ~\frac{2\phi _{\beta _0}(0)}{\sigma _{0}^2}\frac{1}{T} \int _{0}^{T} \begin{pmatrix} Y_{t}^2 &{} -Y_{t}X_{t}^\top \\ -Y_{t}X_{t} &{} X_{t}^{\otimes 2} \end{pmatrix} dt\right) . \nonumber \end{aligned}$$

In sum, applying Slutsky’s theorem concludes that

$$\begin{aligned} \hat{w}_n = \Gamma _{0}^{-1}\Delta _{0,n} + o_{p}(1) \xrightarrow {\mathcal {L}}MN_{q+1,\theta _0}\left( 0,\, \frac{\sigma _{0}^2}{4\phi _{\beta _0}(0)^2} \left\{ \frac{1}{T} \int _{0}^{T} \begin{pmatrix} Y_{t}^2 &{} -Y_{t}X_{t}^\top \\ -Y_{t}X_{t} &{} X_{t}^{\otimes 2} \end{pmatrix}dt\right\} ^{-1}\right) . \nonumber \end{aligned}$$