Keywords

1 Introduction

For many complex diseases, it is essential to identify important risk factors that are associated with prognosis. In the omics era, profiling studies have been extensively conducted. It has been found that, beyond the main effects of genetic (G) and environmental (E) risk factors, gene-environment (G-E) interactions can also have important implications.

Denote T and C as the prognosis and censoring times, respectively. Denote \(X = (X_1, \ldots , X_q)^\top \) as the q environmental/clinical variables, and \(Z=(Z_1, \ldots , Z_p)^\top \) as the p genetic variables. The existing G-E interaction analysis methods mainly belong to two families. The first family conducts marginal analysis (Hunter, 2005; Shi et al., 2014; Thomas, 2010), under which one or a small number of genes are analyzed at a time. Despite its significant computational simplicity, marginal analysis contradicts the fact that the prognosis of complex diseases is attributable to the joint effects of multiple main effects and interactions. The second family of methods, which is biologically more sensible, conducts joint analysis (Liu et al., 2013; Wu et al., 2014; Zhu et al., 2014). Among the existing joint analyses, the regression-based is the most popular and proceeds as follows. Consider the model \(T \sim \phi ( \alpha _{0}+\sum _{j=1}^qX_j\alpha _{j} + \sum _{k =1}^pZ_k\beta _k + \sum _{j = 1}^q\sum _{k=1}^p X_jZ_k\gamma _{j,k})\), where model \(\phi (\cdot )\) is known up to the regression coefficients \(\alpha _{0},\{\alpha _{j}\}_1^q, \{\beta _{k}\}_1^p\), and \(\{\gamma _{j,k}\}_1^{q,p}\). Conclusions on the importance of interactions are drawn based on \(\{\gamma _{j,k}\}_1^{q,p}\). With the high data dimensionality and demand for the selection of relevant effects, regularized estimation is usually needed.

In the dominating majority of the existing studies, estimation is based on the standard likelihood, which is nonrobust. In practice, data contamination is not uncommon and can be caused by multiple reasons. Many diseases are heterogeneous, and different subtypes behave differently. When the subtype information is accurately available, subtype-specific analysis can be conducted. However, when such information is not or partially available, which is often the case in practice (He et al., 2015), subjects belonging to small subtypes may be viewed as “contamination” to those of the leading subtype. Human errors can also happen. It has been well noted that survival information extracted from medical records is not always reliable (Bowman, 2015; Fall et al., 2008), creating contamination in prognosis distributions. In low-dimensional biomedical studies, it has been well established that even a single contaminated observation can lead to biased model estimation and so false marker identification (Huber & Ronchetti, 2009). Our literature review suggests that in the analysis of G-E interactions, robust methods that can effectively accommodate contamination in prognosis outcomes have been very rare. For marginal interaction analysis, a few robust methods, for example, the multifactor dimensionality reduction (MDR), have been developed. However, they are not directly applicable to joint analysis because of both methodological and computational challenges. As discussed in (Wu & Ma, 2015), a handful of robustness studies have been conducted under high-dimensional settings for joint analysis. However, they are mostly on main effects and not directly applicable to interaction analysis because of the additional complexity caused by the “main effects, interactions” hierarchy. Most of them adopt the quantile regression technique. Studies under low-dimensional settings suggest that no robust technique can dominate. It is thus desirable to examine alternative robust techniques under high-dimensional settings. In addition, for quite a few existing methods, statistical properties have not been well studied, casting doubts on their validity.

Consider data with a prognosis outcome and both G and E measurements. Our goal is to conduct joint analysis and identify important G-E interactions and main G and E effects. This study advances from the literature in multiple aspects. Specifically, we consider the scenario with possible contamination in the prognosis outcome, which is commonly encountered but little addressed. We adopt an exponential squared loss to achieve robustness. This loss function provides a useful alternative to the popular quantile regression and other robust approaches but has not been well investigated under high-dimensional settings, especially not for interaction analysis. This study also marks a novel extension of the exponential squared loss to accommodate censored survival data. For regularized estimation and selection of relevant effects, we propose adopting a penalization technique, which respects the “main effects, interactions” hierarchy. Significantly advancing from most of the existing studies, consistency properties are rigorously established. Theoretical research for high-dimensional robust methods remains limited. As such, this study may provide valuable insights. With both methodological and theoretical developments, this study is warranted beyond the existing literature.

2 Methods

2.1 Data and Model Settings

For describing prognosis, we adopt the AFT model, which has been the choice of multiple studies with high-dimensional genetic data (Liu et al., 2013; Shi et al., 2014). Compared to alternatives including the Cox model, advantages of the AFT model include intuitive interpretations and low computational cost, which are especially desirable with high-dimensional genetic data. With a slight abuse of notation, still use T and C to denote the logarithms of the event and censoring times, and \(\delta = I_{\{T\le C\}}\). The AFT model specifies that

$$T =\alpha _0 +\sum _{j=1}^qX_j \alpha _j+ \sum _{k=1}^pZ_k\beta _k + \sum _{j=1}^{q}\sum _{k=1}^p X_jZ_k\gamma _{j,k}+\varepsilon ,$$

where \(\varepsilon \) is the random error. Following Stute (1993, 1996), we assume that T and C are independent, and \(\delta \) is conditionally independent of \((X^\top , Z^\top )^\top \) given T. Let \(W_k= (Z_k, X_1Z_k, \ldots , X_q Z_k)^\top \) and \(b_k=(\beta _k, \gamma _{1,k}, \ldots , \gamma _{q,k})^\top \), which represent all main and interaction effects corresponding to the kth genetic variable.

With n independent subjects, use subscript “i” to denote the ith subject. For subject i, let \(y_i = \min \{T_i, C_i\}\) and \(\delta _i=I_{\{T_i\le C_i\}}\) be the observed time and event indicator, respectively. Then the ith observation consists of \((y_i, \delta _i, \textbf{x}_i, \textbf{z}_i)\), with \(\textbf{x}_i=(x_{i1},\ldots , x_{iq})^\top \), \(\textbf{z}_i=(z_{i1},\ldots , z_{ip})^\top \), and \(W_{k,i} = (z_{ik}, x_{i1}z_{ik}, \ldots , x_{iq}z_{ik})^\top \) denoting the ith realization of X, Z, and \(W_k\), respectively. Denote \(\textbf{u}_{i,}^\top =(1,\textbf{x}_i^\top ,W_{1,i}^\top , \ldots , W_{p,i}^\top )\), \(\textbf{U}= (\textbf{u}_{1,}, \cdots , \textbf{u}_{n,})^\top \), and \(\zeta =(\alpha _0,\ldots ,\alpha _q, b_1^\top ,\ldots ,b_p^\top )^\top \). Without loss of generality, assume that \((y_i,\delta _i,\textbf{u}_{i,})\)’s have been sorted according to \(y_i\)’s in an ascending manner.

2.2 Robust Estimation and Identification

Consider the scenario where the distribution of \(\varepsilon \) is not specified, which significantly differs from the existing parametric studies and makes the proposed method more flexible. To motivate the proposed estimation, first consider data without contamination. Stute (1993) developed a weighted least squared estimation approach. Under low-dimensional settings, Stute’s estimator is defined as the minimizer of the loss function

$$ \sum _{i=1}^n\omega _i(y_i-\textbf{u}_{i,}^\top {\zeta })^2. $$

Here the weights \(\mathbf {\omega }=(\omega _i)_{i=1}^n\) are computed based on the Kaplan-Meier estimation and defined as

$$\omega _{1}=\frac{\delta _{1}}{n}, \omega _{i}=\frac{\delta _{i}}{n-i+1}\prod _{j=1}^{i-1}\left( \frac{n-j}{n-j+1} \right) ^{\delta _{j}},i=2,\ldots ,n.$$

It is noted that Stute’s estimator is not necessarily the most efficient. However, under high-dimensional settings, it can be computationally the most convenient with the least squared loss. It can be seen that, if \(\omega _i \ne 0\), one contaminated \(y_i\) can lead to severely biased model estimation.

Now consider the scenario with possible outliers in the prognosis data. We propose the objective function

$$\begin{aligned} Q_{\theta }(\zeta )= \sum _{i=1}^n\omega _i\exp (-(y_i - \textbf{u}_{i,}^\top \zeta )^2/\theta ). \end{aligned}$$
(1)

This function has been motivated by the following considerations. Under low-dimensional regression analysis without censoring, (Wang et al., 2013) adopted an exponential squared loss to achieve robustness. The intuition is as follows. For a contaminated subject with the observed \(y_i\) deviating from \(\textbf{u}_{i,}^\top \zeta \) (the “predicted” value based on the model), \((y_i - \textbf{u}_{i,}^\top \zeta )^2\) has a large value. The exponential function down-weighs such a contaminated observation. The degree of down-weighing is adjusted by \(\theta \): when \(\theta \) gets smaller, the contaminated observations have smaller influence. While sharing certain similar ground as (Wang et al., 2013) and others, the present study has three main challenges/advancements. The first is the high dimensionality, which brings tremendous challenges to theoretical and computational developments. The second is the need to respect the “main effects, interactions” hierarchy (more details below). The third is censoring, to accommodate which we introduce the weight function \(\omega _i\) motivated by Stute’s approach. As the weights are data-dependent, they bring challenges to the establishment of theoretical properties.

When \(p\gg n\), regularized estimation is needed. In addition, out of a large number of profiled G factors and G-E interactions, only a few are expected to be associated with prognosis. We adopt penalization for regularized estimation and identification, which has been the choice of a large number of genetic studies, especially recent interaction analyses (Bien et al., 2013; Liu et al., 2013; Shi et al., 2014). Specifically, consider the penalized robust objective function

$$\begin{aligned} L_{\lambda _1, \lambda _2, \theta }(\zeta )=Q_{\theta }(\zeta )-\sum _{k=1}^{p}\rho (\Vert b_k\Vert ; \lambda _1, s)-\sum _{k=1}^{p}\sum _{j=2}^{q+1}\rho (|b_{kj}|; \lambda _2, s), \end{aligned}$$
(2)

where \(\Vert \cdot \Vert \) is the \(\ell _2\) norm, \(\rho (t; \lambda , s) = \lambda \int _0^{|t|} \left( 1-\frac{x}{\lambda s} \right) _+ dx\) is the MCP (minimax concave penalty, (Zhang 2010)), and \(b_{kj}\) is the jth element of \(b_{k}\). \(\lambda _1\) and \(\lambda _2\) are data-dependent tuning parameters, and s is the regularization parameter, per the terminologies in Zhang (2010). The robust estimator is defined as the maximizer of \(L_{\lambda _1, \lambda _2, \theta }(\zeta )\). An interaction term (or main effect) is concluded as important if its estimate is nonzero.

In recent genetic interaction analysis, it has been stressed that the “main effects, interactions” hierarchy should be respected. That is, if an interaction term is identified as important, its corresponding main effect(s) should be automatically identified. G-E interaction analysis has its uniqueness. The E variables usually have a low dimensionality and are manually chosen. As such, selection is usually not conducted on the E variables (if desirable, this can be easily achieved). Thus for G-E interaction analysis, the hierarchy postulates that if an G-E interaction is identified as important, its corresponding main G effect is automatically identified. In the adopted sparse group penalty, the first penalty, which is a group MCP, determines which groups are selected. Here one group corresponds to one genetic variable and its interactions. As the group MCP does not have within-group sparsity, the second penalty is imposed, where we penalize the interaction terms and determine which are nonzero. With the special design that the second penalty is only imposed on interactions, important interactions correspond to important groups, automatically leading the estimates of the corresponding main G effects nonzero. As such, the combination of the two penalties guarantees the hierarchy. We note that although sparse group penalization has been studied in the literature (Liu et al. 2013), it has been very rarely coupled with robust loss functions. It is also noted that MCP can be potentially replaced by other penalties.

2.3 Computation

In this section, we develop an efficient algorithm to compute the maximizer of \(L_{\lambda _1, \lambda _2, \theta }(\zeta )\). The basic strategy is to iteratively approximate the objective function by its quadratic minorization. Then a coordinate-wise updating procedure is used to find the maximizer of each approximated objective function. The maximizer then serves as the starting point for the next minorization. Overall, this is a coordinate-descent (CD) algorithm nested in a Minorize-Maximization (MM) algorithm.

Let \(\textbf{W}(\zeta )\) be a diagonal matrix with the ith diagonal element \(\textbf{W}_{i,i}=2\omega _i\exp (-(y_i-\textbf{u}_{i,}^\top \zeta )^2/\theta )/\theta \). Also let \(\textbf{v}(\zeta ) = (v_1, \cdots , v_n)^\top \) with \(v_i=y_i-\textbf{u}^\top _{i,}\zeta \). Define \(\textbf{U}_{,-j}\) as the sub-matrix of \(\textbf{U}\) with the jth column excluded. Define \(\textbf{u}_{,j}\) as the jth column of matrix \(\textbf{U}\), and \(u_{i,j}\) as the jth component of vector \(\textbf{u}_{i,}\). Similarly, define \(\mathbf {\zeta }_{-j}\) as the sub-vector of \(\zeta \) with the jth element excluded. For the exponential squared objective function in (1), its first- and second-order derivatives with respect to \(\zeta \) are

$$\begin{aligned} \frac{\partial {Q_\theta (\zeta })}{\partial \zeta _j}= & {} 2\sum _{i=1}^n\omega _i\exp (-(y_i - \textbf{u}_{i,}^\top \zeta )^2 / \theta )u_{i,j}(y_i - \textbf{u}_{i,}^\top \zeta )/\theta =\textbf{u}_{,j}^\top \textbf{W}(\zeta )\textbf{v}(\zeta ),\\ \frac{\partial ^2{Q_\theta (\zeta )}}{\partial \zeta _{j}\partial \zeta _k}= & {} 2\sum _{i=1}^n\omega _i\exp (-(y_i - \textbf{u}_{i,}^\top \zeta )^2 / \theta )u_{i,j}u_{i,k}[2(y_i - \textbf{u}_{i,}^\top \zeta ) ^ 2/\theta -1]/\theta . \end{aligned}$$

If \((y_i-\textbf{u}^\top _{i,}\zeta ) ^ 2 /\theta >0.5\), \(\frac{\partial ^2{Q_\theta (\zeta )}}{\partial \zeta _{j}\partial \zeta _k}\ge 0\). On the other hand if \((y_i-\textbf{u}^\top _{i,}\zeta )^2 /\theta \le 0.5\), \(\frac{\partial ^2{Q_\theta (\zeta )}}{\partial \zeta _{j}\partial \zeta _k}\le 0\). Hence, to find the maximizer of \(Q_\theta (\zeta )\), the simple Newton-Raphson approach may lead to infinity if the starting value is too far from the true value. To tackle this problem, a minorization of \(Q_\theta (\zeta )\) is used to approximate \(Q_\theta (\zeta )\). Note that \(\frac{\partial ^2{Q_\theta (\zeta )}}{\partial \zeta _{j}\partial \zeta _k}\ge - 2\sum _{i=1}^n\omega _i\exp (-(y_i - \textbf{u}_{i,}^\top \zeta )^2 / \theta )u_{i,j}u_{i,k}/\theta \). Hence a minorized approximation to \(Q_\theta (\zeta )\) at \(\zeta ^{m}\) is

$$ Q_\theta (\zeta ^m) + \textbf{v}^\top (\zeta ^m)\textbf{W}(\zeta ^m)\textbf{U}(\zeta -\zeta ^m) -\frac{1}{2}(\zeta -\zeta ^m)^\top \textbf{U}^\top \textbf{W}(\zeta ^m)\textbf{U}(\zeta -\zeta ^m). $$

Note that \(\zeta ^m=(\alpha _0^m,\ldots ,\alpha _q^m, {b_1^m}^\top ,\ldots ,{b_p^m}^\top )^\top \) with \(b_k^m=(\beta _k^m, \gamma _{1,k}^m, \ldots , \gamma _{q,k}^m)^\top \). For the penalty, we apply a local linear approximation at \(\zeta ^{m}\), which is given by

$$\begin{aligned} -\sum _{k=1}^p\dot{\rho }(\Vert b_k^m\Vert ;\lambda _1, s)\frac{|\beta _k^m|}{\Vert b_k^m\Vert }|\beta _k| - \sum _{k=1}^p\sum _{j=1}^q \left\{ \dot{\rho }(\Vert b_k^m\Vert ;\lambda _1, s)\frac{|\gamma _{j,k}^m|}{\Vert b_k^m\Vert }+ \dot{\rho }(|\gamma _{j,k}^m|;\lambda _2, s)\right\} |\gamma _{j,k}| \end{aligned}$$

if the terms that do not depend on \(\zeta \) are ignored, where \(\dot{\rho }(t;\lambda , s) = sgn(t) \left( \lambda - \frac{|t|}{s}\right) _{+}.\) If we replace \(Q_\theta (\zeta )\) in (2) with its minorized approximation and plug in the approximation of the penalty, the penalized objective function then has the form

$$\begin{aligned}{} & {} L_{\lambda _1, \lambda _2,\theta }(\zeta |\zeta ^m) = Q(\zeta ^m) + \textbf{v}^\top (\zeta ^m)\textbf{W}(\zeta ^m)\textbf{U} (\zeta -\zeta ^m)\nonumber \\{} & {} ~-\frac{1}{2}(\zeta -\zeta ^m)^\top \textbf{U}^\top \textbf{W}(\zeta ^m)\textbf{U}(\zeta -\zeta ^m) -\sum _{k=1}^p\dot{\rho }(\Vert b_k^m\Vert ;\lambda _1, s)\frac{|\beta _k^m|}{\Vert b_k^m\Vert }|\beta _k|\nonumber \\{} & {} ~- \sum _{k=1}^p\sum _{j=1}^q \left\{ \dot{\rho }(\Vert b_k^m\Vert ;\lambda _1, s)\frac{|\gamma _{j,k}^m|}{\Vert b_k^m\Vert }+ \dot{\rho }(|\gamma _{j,k}^m|;\lambda _2, s)\right\} |\gamma _{j,k}| . \end{aligned}$$
(3)

This function has a “weighted quadratic + penalty” form and can be optimized using the coordinate-descent approach.

The algorithm starts with \(m=0\) and \(\zeta ^m = \textbf{0}\), where m is the index of the MM iteration. At iteration m, the objective function is approximated by its minorization \(L_{\lambda _1, \lambda _2, \theta }(\zeta |\zeta ^m)\) given in (3). Then the penalized weighted quadratic function is maximized using the coordinate-descent algorithm. Denote \(\bar{\zeta }^{old}\) as the estimate of \(\zeta \) before updating. We update each element of the estimate and denote the new estimate as \(\bar{\zeta }^{new}\). This is repeated until the distance between \(\bar{\zeta }^{old}\) and \(\bar{\zeta }^{new}\) is smaller than a prefixed constant. Then \(\zeta ^{m+1} = \bar{\zeta }^{new}\) serves as the new expansion base point for the next minorization. The overall procedure is repeated until convergence. Convergence properties of the MM and CD techniques have been well studied in the literature. With our problem, the objective function increases at each step and is bounded above, which leads to convergence. In numerical study, we conclude convergence if the difference between two estimates after two consecutive MM steps is small enough. We observe convergence in all numerical examples after a small to moderate number of MM iterations.

The proposed method involves tuning parameters. For s in MCP, we follow (Zhang, 2010) and other published studies, which suggest examining a small number of values or fixing it. In our numerical study, we fix \(s=6\), which has been adopted in published studies (Shi et al., (2014; Xu et al., 2018). We have also examined s values near 6 and observed similar performance (details omitted). In practice, for settings significantly different from ours, other s values may need to be considered. Under low-dimensional settings, (Wang et al., 2013) proposed an iterative approach to select the robust tuning parameter \(\theta \). However, their approach is computationally infeasible for high-dimensional data. Under the present setting, for each combination of \((\lambda _1, \lambda _2, \theta )\), we compute the solution. This way, we can obtain a solution surface over a three-dimensional tuning parameter grid. This is feasible as the proposed computational algorithm only involves simple updates and incurs low cost. Then the tuning parameters can be selected using a prediction-based method which proceeds as follows: (a) compute the cross-validated sum of prediction errors for each \((\lambda _1, \lambda _2,\theta )\) combination; (b) for each fixed \(\theta \), average the sum of prediction errors over \(\lambda _1, \lambda _2\). Select \(\theta \) that has the smallest average sum of prediction errors; (c) with the selected \(\theta \), select \(\lambda _1, \lambda _2\) that has the smallest sum of prediction errors. This procedure first groups all \((\lambda _1, \lambda _2)\) values together and selects the best \(\theta \) value. Then with the optimal \(\theta \) value, the optimal \((\lambda _1, \lambda _2)\) values are selected. Our numerical experiments suggest that this procedure generates more stable estimates than directly searching over the three-dimensional \((\lambda _1, \lambda _2, \theta )\) grid.

With a complex robust goodness-of-fit and a penalty that respects the hierarchy, the proposed method is inevitably computationally more expensive than some simpler alternatives. However, as the proposed computational algorithm is composed of relatively simple calculations, the overall computational cost is affordable. With fixed tunings, the analysis of one simulated dataset (described in detail below) takes about nine minutes on a regular laptop. Tuning parameter selection can be conducted in a highly parallel manner to save computer time.

2.4 Consistency Properties

In this section, we rigorously prove that the proposed method can consistently identify the important interactions (and main effects) under ultrahigh-dimensional settings. In the literature, theoretical development for robust methods under high-dimensional settings has been limited. It is especially rare for methods other than the quantile based. With the consistency properties, the proposed method can be preferred over the alternatives whose statistical properties have not been well established. Our theoretical development not only provides a solid ground for the proposed method but also sheds insights for other robust methods under high-dimensional settings.

For any two subsets \(S_1\) and \(S_2\) of \(\{1, \cdots , p+q+ pq+ 1\}\) and a matrix H, we denote by \(H_{S_1 S_2}\) the sub-matrix of H with rows and columns indexed by \(S_1\) and \(S_2\), respectively. Let \(\zeta ^*=(\alpha _0^*,\ldots ,\alpha _q^*, {b_1^*}^\top ,\ldots ,{b_p^*}^\top )^\top \), where \(b_k^*=(\beta _k^*, \gamma _{1,k}^*, \ldots , \gamma _{q,k}^*)^\top \) is the true value of \(\zeta \). Here we make the sparsity assumption, under which only a subset of the components of \(\zeta ^*\) is nonzero. Define the three groups of parameters:

$$\begin{aligned}{} & {} A_1 = \{\alpha _0^*, \ldots , \alpha _q^*\}, ~~~A_2 = \{\gamma _{j,k}^*: \gamma _{j,k}^*\ne 0, j=1,\ldots , q; k=1, \ldots , p\}, \\{} & {} A_3 = \{\beta _k^*: \beta _k^*\ne 0~\text{ or } \text{ there } \text{ exsits } \text{ some } ~1\le j\le q~\text{ such } \text{ that }~\gamma _{j,k}^*\ne 0, k=1, \ldots , p \}. \end{aligned}$$

Denote \(\mathscr {A}\) as the set of indices of \(A_1\cup A_2 \cup A_3\) in the vector \({\zeta }^*\). Let \(\mathscr {A}^c\) and \(|\mathscr {A}|\) denote the complement and cardinality of set \(\mathscr {A}\), respectively. We then divide \(\mathscr {A}^c\) into there sets of indices \(\mathscr {B}_1, \mathscr {B}_2\), and \(\mathscr {B}_3\), which correspond to the following three sets

$$\begin{aligned}{} & {} B_1 = \{\beta _k^*: \beta _k^*= 0, k=1, \ldots , p\}, \\{} & {} B_2 = \{\gamma _{j,k}^*: \gamma _{j,k}^*= 0~\text{ but }~\beta _k^*\ne 0, j=1,\ldots , q; k=1, \ldots , p \}, \\{} & {} B_3 = \{\gamma _{j,k}^*: \gamma _{j,k}^*= 0 ~\text{ and }~\beta _k^*= 0 , j=1,\ldots , q; k=1, \ldots , p\}, \end{aligned}$$

respectively. Define

$$D_{n}({\zeta }) = \sum _{i=1}^n \omega _i \exp (-(y_i - \textbf{u}_{i}^\top {\zeta })^2 / \theta )\frac{2(y_i - \textbf{u}_{i,}^\top {\zeta })}{\theta }\textbf{u}_{i,}$$

and

$$I_{n}({\zeta }) = \frac{2}{\theta }\sum _{i=1}^n \omega _i \exp (-(y_i - \textbf{u}_{i,}^\top {\zeta })^2 / \theta )\left( \frac{2(y_i - \textbf{u}_{i,}^\top {\zeta })^2}{\theta }-1\right) \textbf{u}_{i,} \textbf{u}_{i,}^\top .$$

The following conditions are needed to establish the consistency properties.

C1.:

T and C are independent, and \(P(T \le C|T,X,Z) = P(T \le C|T)\).

C2.:

The support of T is dominated by that of C. For example, \(\tau _T<\tau _C\) or \(\tau _T= \tau _C=\infty \), where \(\tau _T\) and \(\tau _C\) are the right end points of the support of T and C, respectively.

C3.:

\(E[D_{n}({\zeta }^*)] = 0\).

C4.:

The distributions of \(D_{n,j}({\zeta }^*)\)’s are subgaussian, that is, \(\Pr (|D_{n,j}({\zeta }^*)| >t )\le 2 \exp \left( - n t^2/\sigma ^2\right) \). Moreover, \(I_{n, jk}({\zeta })- I_{jk}({\zeta })\)’s are subgaussian for all \(\zeta \in \varTheta =\{\zeta : \Vert \zeta -\zeta ^*\Vert _2 <\delta \}\), where \(\delta \) is a positive constant, \(I({\zeta }) = E[I_{n}({\zeta })]\), and \(I_{jk}({\zeta })\) is the (jk)th component of matrix \(I({\zeta })\). Moreover, there exists a bounded constant \(\kappa \) such that \({ \mathbf {\nu }}^\top [I({\zeta }^1)-I({\zeta }^2)]{ \mathbf {\nu }} \le \kappa \Vert {\zeta }^1-{\zeta }^2\Vert _2\) for any \({\zeta }^1, {\zeta }^2\in \varTheta \) and \(\Vert { \mathbf {\nu }}\Vert _2 = 1\).

C5.:

\(I_{\mathscr {A}\mathscr {A}}({\zeta }^*)\) is a \(|\mathscr {A}|\times |\mathscr {A}|\) negative-definite matrix. The eigenvalues of \(I_{\mathscr {A}\mathscr {A}}({\zeta }^*)\) are bounded away from zero and infinity.

C6.:

\(\min _{j,k} \{|\gamma _{j,k}^*|: \gamma _{j,k}^*\ne 0\}\gg \lambda _1\vee \lambda _2\). \(\lambda _1\wedge \lambda _2\gg \sqrt{|\mathscr {A}|/n}\).

C1 and C2 have been commonly assumed in the literature. See, for example, (Stute, 1993, 1996; Huang et al., 2007). We note that the independent censoring assumption usually holds in practice, although from a theoretical perspective, quite a few studies have made the weaker conditional independence assumption. We have explored relaxing this assumption and found that alternative and less intuitive assumptions would have to be made. The zero expectations in C3 and C5 ensure the consistency of estimation. C4 is required for Theorem 1, and a similar assumption has been made in (Ma & Du, 2012). C6 requires that the smallest signal does not decay too fast, which is common in studies on high-dimensional inference. The following theorem establishes consistency of the proposed estimator \(\widehat{\zeta }\).

Theorem 1

Suppose that conditions C1-C6 hold.

Let \(\varpi _n= (\lambda _1\wedge \lambda _2)/\{\max (\varPhi _1, \varPhi _2, \varPhi _3)\}\), where \(\varPhi _t= \Vert I_{\mathscr {B}_t \mathscr {A}}({\zeta }^*)I_{ {\mathscr {A}\mathscr {A}}}({{\zeta }}^*)^{-1}\Vert _\infty \),

\(t=1,2,3\). If \(|\mathscr {A}| =o(n)\), \(\lambda _1\vee \lambda _2 \rightarrow 0\), \(n \varpi _n^2 \rightarrow \infty \), and \(\log p = o ( n\varpi _n^2)\),

with probability tending to one, we have

$$(a)~~~\Vert \widehat{\zeta }_\mathscr {A} - {\zeta }_\mathscr {A}^*\Vert _2= O_p(\sqrt{|\mathscr {A}|/n});~~~~ (b)~~~ \widehat{\zeta }_{\mathscr {A}^c}= {\textbf {0}}. $$

Proof

For the proof, see Appendix.    \(\square \)

This theorem establishes that the proposed method is able to accommodate p with \(\log p = o (n\varpi _n^2)\). The penalized robust estimator enjoys the same asymptotic properties as the oracle estimator with probability approaching one. This property holds under high dimensions without restrictive conditions on the errors. To the best of our knowledge, properties of the robust exponential loss, even without censoring, have not been studied under high-dimensional settings. Thus our theoretical investigation can have independent value. Proof of the theorem is presented in Appendix.

3 Simulations

In simulation, we set \(n = 300, q = 5\), and \(p = 1000\). The underlying true model contains a total of 35 nonzero effects, including 5 main E effects, 10 main G effects, and 20 interactions. The “positions” of nonzero main G effects are randomly placed. The nonzero interactions are generated to respect the “main effects, interactions” hierarchy. The nonzero regression coefficients are randomly generated from uniform (0.7, 1.3). We consider both continuous and categorical distributions to mimic, for example, gene expression and SNP data. Specifically, under the continuous scenario, the E and G factors are generated from multivariate normal distributions with marginal means zero, marginal variances one, and the following variance matrix structures: Independent, AR(0.3), AR(0.8), Band(0.3), Band(0.6), and CS(0.2). Under the independent scenario, all factors have zero correlations. Under the AR\((\rho )\) correlation structure, for the ith and jth factors, \(corr = \rho ^{|i-j|}\). Under the Band\((\rho )\) correlation structure, for the ith and jth factors, \(corr =\rho \cdot I(|i - j| = 2) + 0.3\cdot I(|i-j| =1)+I(|i-j|=0)\). Under the CS\((\rho )\) correlation structure, for the ith and jth factors, the correlation coefficient \(corr = \rho ^{I(i\ne j)}\). Under the categorical scenario, we first apply the same data generating approach as described above to obtain \(\textbf{U}\). Then for each \(u_{i,j}\), the categorical measurement is generated as \(I(u_{i,j}>-0.7)\). The threshold value \(-0.7\) is chosen such that the proportion of 1’s for each factor is roughly 75%. Under each of the above simulation settings, consider the random error distribution \((1-\xi )N(0,1)+\xi Cauchy\), with the contamination probability \(\xi =0\), 0.1, and 0.3. When \(\xi =0\), the error distribution has no contamination and favors the nonrobust approaches, while the latter two values lead to different levels of contamination. The log event times are generated from the AFT model. The censoring times are generated independently from Weibull distributions. The censoring parameters are adjusted so that the censoring rates are about 25%. Beyond the above scenarios, we also consider a set of parallel scenarios, under which there are 10 main E effects, 20 main G effects, and 40 interactions (that is, the number of important effects is doubled), and the nonzero coefficients are generated from uniform (0.4, 0.6) (that is, the signal levels are reduced by about 50%). Other settings remain the same.

The simulated data are analyzed using the proposed method. In addition, we also consider two alternatives: (a) the nonrobust method that adopts the weighted least squared loss and the same penalty as the proposed, and (b) the quantile regression-based method that adopts an \(L_1\) robust loss and the same penalty as the proposed. We note that multiple other methods are potentially applicable. Comparing with the nonrobust method can directly establish the merit of being robust. The quantile regression-based approach is the most popular for high-dimensional data (Wu & Ma, 2015). Thus these two alternatives are the most sensible to compare with.

All three methods involve tuning parameters. To eliminate the (possibly different) effects of tuning parameter selection on identification accuracy, we consider a sequence of tuning parameter values, evaluate identification accuracy at each value, and calculate the AUC (area under the ROC curve) as the overall measure. This approach has been adopted extensively in published studies (Zhu et al., 2014).

Summary statistics are computed based on 500 replicates. The AUC results for interactions and main effects combined are presented in Tables 1 and 2, respectively, for the scenarios with 35 and 70 important effects. To be thorough, we have also evaluated identification accuracy for interactions and main effects separately and present the AUC results in Tables 4, 5, 6 and 7 in Appendix. For all three methods, the AUC value decreases as the contamination proportion increases, as expected. In Table 1, the proposed method outperforms the two alternatives under all except one scenario. In Table 2, it dominates the alternatives. Under some scenarios, the proposed method leads to a significant improvement in identification accuracy. For example in Table 1, with the continuous G distribution, 30% contamination, and Band(0.3) correlation, the proposed method has a mean AUC of 0.901, while the alternatives have mean AUCs of 0.761 and 0.789. Compared to the nonrobust alternative, the proposed method also has smaller standard errors (Table 3).

Table 1 Simulation: identification of both G-E interactions and main G effects. In each cell, mean AUC (se). There are a total of 35 nonzero effects, with coefficients \(\sim \)uniform (0.7, 1.3)
Table 2 Simulation: identification of both G-E interactions and main G effects. There are a total of 70 nonzero effects, with coefficients \(\sim \)uniform (0.4, 0.6)

We have also experimented with a few other scenarios and made similar observations. In particular, we have examined the scenarios where the event and censoring times have weak to moderate correlations and observed similar satisfactory performance (details omitted). The proposed method and two alternatives respect the hierarchy. We have also looked into simpler alternatives, including MCP and Lasso, which may violate the hierarchy, and observed inferior performance.

4 Analysis of the TCGA Lung Adenocarcinoma Data

Adenocarcinoma of the lung is the leading cause of cancer death worldwide. Profiling studies have been extensively conducted searching for its prognostic factors. Here we analyze the TCGA( The Cancer Genome Atlas Research Network, 2014) data on the prognosis of lung adenocarcinoma. The TCGA data were recently collected and published by NCI and have high quality. The prognosis outcome of interest is overall survival. The dataset contains measurements on 43 clinical/environmental variables and 18,897 gene expressions. There are a total of 468 patients, among whom 117 died during follow-up. The median follow-up time is 8 months. We select four E factors for downstream analysis, namely, age, gender, smoking pack years, and smoking history. These factors have a relatively low missing rate in the TCGA dataset and have been previously suggested as potentially related to lung cancer prognosis. There are a total of 436 samples with both E and G measurements available. Among them, 110 died during follow-up, and the median follow-up time is 23 months. For the 326 censored subjects, the median follow-up time is 6 months. In principle, the proposed method can directly analyze all of the available gene expressions. To improve stability and reduce the computational cost, we conduct marginal prescreening. Specifically, genes are screened based on their univariate regression significance (p-value less than or equal to 0.1) and interquartile range (above the median of all interquartile ranges). Similar prescreenings have been adopted in the literature. A total of 819 gene expressions are included in the downstream model fitting. Note that with the main G effects as well as interactions, the number of unknown parameters is much larger than the sample size.

Detailed estimation results are presented in Table 3 for the proposed method and Tables 8 and 9 in Appendix for the two alternatives. It is observed that the three methods lead to quite different findings. Specifically, the proposed and quantile methods share four common main G effects and four interactions. Otherwise, there is no overlap in identification. The “signals” in practical data can be weaker than those in simulated data, leading to the significant differences across methods.

Table 3 Analysis of the TCGA lung adenocarcinoma data using the proposed method. The identified interactions are denoted as “gene * environmental variable”. For the interactions, values in “()” are the stability results

With the proposed method, sixteen genes are identified to have interactions with either age or smoking status. As for many other cancer types, age has been identified as a critical factor in lung cancer prognosis. Smoking has been confirmed as the most important E factor for lung cancer risk and prognosis. In the literature, G-E interaction analysis for lung cancer prognosis is still very limited. However, there have been many studies on the functionalities of genes. Searching such studies can provide a partial support to the validity of our analysis results. Among the identified genes, many have been implicated in cancer in the literature. Specifically, the AGPAT family, which includes AGPAT6 as a member, has been found to play a role in multiple cancer types. For example, AGPAT2 and AGPAT11 have been found to be upregulated in ovarian, breast, cervical, and colorectal cancers (Agarwal and Garg, 2010). Another gene that is worth attention is ATF6, which acts both as a sensor and a transcription factor during endoplasmic reticulum stress. ATF6\(\alpha \) has been found to promote hepatocarcinogenesis and cancer cell proliferation through activating downstream target gene BIP. Its efficiency of stress recognition and signaling has been found to decrease with age (Naidoo, 2009). We find that gene COLCA2 (colorectal cancer associated 2) interacts with smoking pack years. Studies have shown that COLCA2 may have critical functions in suppressing tumor formation in epithelial cells (Peltekova et al., 2014). We also identify an interaction between NOS1AP and age. It has been found that the protein complex of SCRIB, NOS1AP, and VANGL1 regulates cell polarity and migration, and this complex can be associated with cancer progression (Anastas et al., 2012). An interaction between PPP1R15B and smoking pack years has also been identified. It has been suggested that PPP1R15B is likely to be regulated by Nrf2, which has a protective response to smoking induced oxidative stress in the lung (Taylor et al., 2008). Also, PPP1R15B may promote cancer cell proliferation.

Table 4 Simulation: identification of main G effects. In each cell, mean AUC (se). There are a total of 10 nonzero main effects, with coefficients \(\sim \)uniform (0.7, 1.3)
Table 5 Simulation: identification of G-E interactions. In each cell, mean AUC (se). There are a total of 20 nonzero interactions, with coefficients \(\sim \)uniform (0.7, 1.3)
Table 6 Simulation: identification of main G effects. In each cell, mean AUC (se). There are a total of 20 nonzero main effects, with coefficients \(\sim \)uniform (0.4, 0.6)
Table 7 Simulation: identification of G-E interactions. In each cell, mean AUC (se). There are a total of 40 nonzero interactions, with coefficients \(\sim \)uniform (0.4, 0.6)
Table 8 Analysis of the TCGA lung adenocarcinoma data using the nonrobust method. The identified interactions are denoted as “gene * environmental variable”. For the interactions, values in “()” are the stability results
Table 9 Analysis of the TCGA lung adenocarcinoma data using the quantile method. The identified interactions are denoted as “gene * environmental variable”. For the interactions, values in “()” are the stability results

To complement the identification and estimation analysis, we also evaluate stability. Specifically, we randomly select 3/4 of the subjects and apply the proposed method and alternatives. This procedure is repeated 200 times. We then compute the probability that an interaction is identified. Similar procedures have been extensively adopted in published studies. The stability results are also provided in Tables 3, 8, and 9. We see that most of the identified interactions are relatively stable, with many having the probabilities of being identified close to one.

5 Discussions

To understand the prognosis of complex diseases, it is essential to study G-E interactions. In “classic” low-dimensional biomedical studies, data contamination is found to be not rare, and it has been suggested that robust methods are needed to accommodate contamination. This study has developed a robust method for high-dimensional genetic interaction analysis, which is still limited in the literature. The proposed method consists of a novel robust loss function and a penalized identification strategy that respects the “main effects, interactions” hierarchy, both of which have novel advancements. Also significantly advancing from the literature, we have rigorously established the consistency properties. The theoretical results may seem “familiar”, which is “comforting” in that the consistency properties are not sacrificed with the additional robustness, high dimensionality, and interactions. It is worth noting that the consistency results do not demand excessive assumptions on the error distribution, which are usually needed in the existing literature. In simulation, the proposed method outperforms the nonrobust alternative. It is interesting to note that it has superior performance when there is no contamination. Another important finding is that it also outperforms the quantile-based robust method. Most of the existing high-dimensional robust studies have adopted the quantile regression technique. Our simulation suggests that it is prudent to develop alternative robust methods. In the analysis of TCGA lung cancer data, the proposed method generates results with some overlap**s with the quantile regression method, however, none with the nonrobust method. The identified genes have important implications, and the identified interactions are stable.

The proposed study can be potentially extended in multiple directions. In survival analysis, there are many other models beyond the AFT. It can be of interest to develop robust methods based on other models. We have studied G-E interactions. It can be of interest to extend to G-G interactions. In theoretical analysis, one problem left is the breakdown point. Because of the extremely high complexity, this problem has been left uninvestigated in many other robust studies too. In our simulation, we have experimented with contamination rate as high as 30%, which is much higher than many of the existing studies. The superiority of the proposed method over the quantile regression method is observed. The relative efficiency of different robust methods, although of interest, will be postponed to future studies. In data analysis, the proposed method identifies a different set of main effects and interactions. Mining the literature and the stability evaluation can support the validity of findings to a certain extent. More validations need to be pursued in the future.