Abstract
For complex diseases, beyond the main effects of genetic (G) and environmental (E) factors, gene-environment (G-E) interactions also play an important role. Many of the existing G-E interaction methods conduct marginal analysis, which may not appropriately describe disease biology. Joint analysis methods have been developed, with most of the existing loss functions constructed based on likelihood. In practice, data contamination is not uncommon. Development of robust methods for interaction analysis that can accommodate data contamination is very limited. In this study, we consider censored survival data and adopt an accelerated failure time (AFT) model. An exponential squared loss is adopted to achieve robustness. A sparse group penalization approach, which respects the “main effects, interactions” hierarchy, is adopted for estimation and identification. Consistency properties are rigorously established. Simulation shows that the proposed method outperforms direct competitors. In data analysis, the proposed method makes biologically sensible findings.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
For many complex diseases, it is essential to identify important risk factors that are associated with prognosis. In the omics era, profiling studies have been extensively conducted. It has been found that, beyond the main effects of genetic (G) and environmental (E) risk factors, gene-environment (G-E) interactions can also have important implications.
Denote T and C as the prognosis and censoring times, respectively. Denote \(X = (X_1, \ldots , X_q)^\top \) as the q environmental/clinical variables, and \(Z=(Z_1, \ldots , Z_p)^\top \) as the p genetic variables. The existing G-E interaction analysis methods mainly belong to two families. The first family conducts marginal analysis (Hunter, 2005; Shi et al., 2014; Thomas, 2010), under which one or a small number of genes are analyzed at a time. Despite its significant computational simplicity, marginal analysis contradicts the fact that the prognosis of complex diseases is attributable to the joint effects of multiple main effects and interactions. The second family of methods, which is biologically more sensible, conducts joint analysis (Liu et al., 2013; Wu et al., 2014; Zhu et al., 2014). Among the existing joint analyses, the regression-based is the most popular and proceeds as follows. Consider the model \(T \sim \phi ( \alpha _{0}+\sum _{j=1}^qX_j\alpha _{j} + \sum _{k =1}^pZ_k\beta _k + \sum _{j = 1}^q\sum _{k=1}^p X_jZ_k\gamma _{j,k})\), where model \(\phi (\cdot )\) is known up to the regression coefficients \(\alpha _{0},\{\alpha _{j}\}_1^q, \{\beta _{k}\}_1^p\), and \(\{\gamma _{j,k}\}_1^{q,p}\). Conclusions on the importance of interactions are drawn based on \(\{\gamma _{j,k}\}_1^{q,p}\). With the high data dimensionality and demand for the selection of relevant effects, regularized estimation is usually needed.
In the dominating majority of the existing studies, estimation is based on the standard likelihood, which is nonrobust. In practice, data contamination is not uncommon and can be caused by multiple reasons. Many diseases are heterogeneous, and different subtypes behave differently. When the subtype information is accurately available, subtype-specific analysis can be conducted. However, when such information is not or partially available, which is often the case in practice (He et al., 2015), subjects belonging to small subtypes may be viewed as “contamination” to those of the leading subtype. Human errors can also happen. It has been well noted that survival information extracted from medical records is not always reliable (Bowman, 2015; Fall et al., 2008), creating contamination in prognosis distributions. In low-dimensional biomedical studies, it has been well established that even a single contaminated observation can lead to biased model estimation and so false marker identification (Huber & Ronchetti, 2009). Our literature review suggests that in the analysis of G-E interactions, robust methods that can effectively accommodate contamination in prognosis outcomes have been very rare. For marginal interaction analysis, a few robust methods, for example, the multifactor dimensionality reduction (MDR), have been developed. However, they are not directly applicable to joint analysis because of both methodological and computational challenges. As discussed in (Wu & Ma, 2015), a handful of robustness studies have been conducted under high-dimensional settings for joint analysis. However, they are mostly on main effects and not directly applicable to interaction analysis because of the additional complexity caused by the “main effects, interactions” hierarchy. Most of them adopt the quantile regression technique. Studies under low-dimensional settings suggest that no robust technique can dominate. It is thus desirable to examine alternative robust techniques under high-dimensional settings. In addition, for quite a few existing methods, statistical properties have not been well studied, casting doubts on their validity.
Consider data with a prognosis outcome and both G and E measurements. Our goal is to conduct joint analysis and identify important G-E interactions and main G and E effects. This study advances from the literature in multiple aspects. Specifically, we consider the scenario with possible contamination in the prognosis outcome, which is commonly encountered but little addressed. We adopt an exponential squared loss to achieve robustness. This loss function provides a useful alternative to the popular quantile regression and other robust approaches but has not been well investigated under high-dimensional settings, especially not for interaction analysis. This study also marks a novel extension of the exponential squared loss to accommodate censored survival data. For regularized estimation and selection of relevant effects, we propose adopting a penalization technique, which respects the “main effects, interactions” hierarchy. Significantly advancing from most of the existing studies, consistency properties are rigorously established. Theoretical research for high-dimensional robust methods remains limited. As such, this study may provide valuable insights. With both methodological and theoretical developments, this study is warranted beyond the existing literature.
2 Methods
2.1 Data and Model Settings
For describing prognosis, we adopt the AFT model, which has been the choice of multiple studies with high-dimensional genetic data (Liu et al., 2013; Shi et al., 2014). Compared to alternatives including the Cox model, advantages of the AFT model include intuitive interpretations and low computational cost, which are especially desirable with high-dimensional genetic data. With a slight abuse of notation, still use T and C to denote the logarithms of the event and censoring times, and \(\delta = I_{\{T\le C\}}\). The AFT model specifies that
where \(\varepsilon \) is the random error. Following Stute (1993, 1996), we assume that T and C are independent, and \(\delta \) is conditionally independent of \((X^\top , Z^\top )^\top \) given T. Let \(W_k= (Z_k, X_1Z_k, \ldots , X_q Z_k)^\top \) and \(b_k=(\beta _k, \gamma _{1,k}, \ldots , \gamma _{q,k})^\top \), which represent all main and interaction effects corresponding to the kth genetic variable.
With n independent subjects, use subscript “i” to denote the ith subject. For subject i, let \(y_i = \min \{T_i, C_i\}\) and \(\delta _i=I_{\{T_i\le C_i\}}\) be the observed time and event indicator, respectively. Then the ith observation consists of \((y_i, \delta _i, \textbf{x}_i, \textbf{z}_i)\), with \(\textbf{x}_i=(x_{i1},\ldots , x_{iq})^\top \), \(\textbf{z}_i=(z_{i1},\ldots , z_{ip})^\top \), and \(W_{k,i} = (z_{ik}, x_{i1}z_{ik}, \ldots , x_{iq}z_{ik})^\top \) denoting the ith realization of X, Z, and \(W_k\), respectively. Denote \(\textbf{u}_{i,}^\top =(1,\textbf{x}_i^\top ,W_{1,i}^\top , \ldots , W_{p,i}^\top )\), \(\textbf{U}= (\textbf{u}_{1,}, \cdots , \textbf{u}_{n,})^\top \), and \(\zeta =(\alpha _0,\ldots ,\alpha _q, b_1^\top ,\ldots ,b_p^\top )^\top \). Without loss of generality, assume that \((y_i,\delta _i,\textbf{u}_{i,})\)’s have been sorted according to \(y_i\)’s in an ascending manner.
2.2 Robust Estimation and Identification
Consider the scenario where the distribution of \(\varepsilon \) is not specified, which significantly differs from the existing parametric studies and makes the proposed method more flexible. To motivate the proposed estimation, first consider data without contamination. Stute (1993) developed a weighted least squared estimation approach. Under low-dimensional settings, Stute’s estimator is defined as the minimizer of the loss function
Here the weights \(\mathbf {\omega }=(\omega _i)_{i=1}^n\) are computed based on the Kaplan-Meier estimation and defined as
It is noted that Stute’s estimator is not necessarily the most efficient. However, under high-dimensional settings, it can be computationally the most convenient with the least squared loss. It can be seen that, if \(\omega _i \ne 0\), one contaminated \(y_i\) can lead to severely biased model estimation.
Now consider the scenario with possible outliers in the prognosis data. We propose the objective function
This function has been motivated by the following considerations. Under low-dimensional regression analysis without censoring, (Wang et al., 2013) adopted an exponential squared loss to achieve robustness. The intuition is as follows. For a contaminated subject with the observed \(y_i\) deviating from \(\textbf{u}_{i,}^\top \zeta \) (the “predicted” value based on the model), \((y_i - \textbf{u}_{i,}^\top \zeta )^2\) has a large value. The exponential function down-weighs such a contaminated observation. The degree of down-weighing is adjusted by \(\theta \): when \(\theta \) gets smaller, the contaminated observations have smaller influence. While sharing certain similar ground as (Wang et al., 2013) and others, the present study has three main challenges/advancements. The first is the high dimensionality, which brings tremendous challenges to theoretical and computational developments. The second is the need to respect the “main effects, interactions” hierarchy (more details below). The third is censoring, to accommodate which we introduce the weight function \(\omega _i\) motivated by Stute’s approach. As the weights are data-dependent, they bring challenges to the establishment of theoretical properties.
When \(p\gg n\), regularized estimation is needed. In addition, out of a large number of profiled G factors and G-E interactions, only a few are expected to be associated with prognosis. We adopt penalization for regularized estimation and identification, which has been the choice of a large number of genetic studies, especially recent interaction analyses (Bien et al., 2013; Liu et al., 2013; Shi et al., 2014). Specifically, consider the penalized robust objective function
where \(\Vert \cdot \Vert \) is the \(\ell _2\) norm, \(\rho (t; \lambda , s) = \lambda \int _0^{|t|} \left( 1-\frac{x}{\lambda s} \right) _+ dx\) is the MCP (minimax concave penalty, (Zhang 2010)), and \(b_{kj}\) is the jth element of \(b_{k}\). \(\lambda _1\) and \(\lambda _2\) are data-dependent tuning parameters, and s is the regularization parameter, per the terminologies in Zhang (2010). The robust estimator is defined as the maximizer of \(L_{\lambda _1, \lambda _2, \theta }(\zeta )\). An interaction term (or main effect) is concluded as important if its estimate is nonzero.
In recent genetic interaction analysis, it has been stressed that the “main effects, interactions” hierarchy should be respected. That is, if an interaction term is identified as important, its corresponding main effect(s) should be automatically identified. G-E interaction analysis has its uniqueness. The E variables usually have a low dimensionality and are manually chosen. As such, selection is usually not conducted on the E variables (if desirable, this can be easily achieved). Thus for G-E interaction analysis, the hierarchy postulates that if an G-E interaction is identified as important, its corresponding main G effect is automatically identified. In the adopted sparse group penalty, the first penalty, which is a group MCP, determines which groups are selected. Here one group corresponds to one genetic variable and its interactions. As the group MCP does not have within-group sparsity, the second penalty is imposed, where we penalize the interaction terms and determine which are nonzero. With the special design that the second penalty is only imposed on interactions, important interactions correspond to important groups, automatically leading the estimates of the corresponding main G effects nonzero. As such, the combination of the two penalties guarantees the hierarchy. We note that although sparse group penalization has been studied in the literature (Liu et al. 2013), it has been very rarely coupled with robust loss functions. It is also noted that MCP can be potentially replaced by other penalties.
2.3 Computation
In this section, we develop an efficient algorithm to compute the maximizer of \(L_{\lambda _1, \lambda _2, \theta }(\zeta )\). The basic strategy is to iteratively approximate the objective function by its quadratic minorization. Then a coordinate-wise updating procedure is used to find the maximizer of each approximated objective function. The maximizer then serves as the starting point for the next minorization. Overall, this is a coordinate-descent (CD) algorithm nested in a Minorize-Maximization (MM) algorithm.
Let \(\textbf{W}(\zeta )\) be a diagonal matrix with the ith diagonal element \(\textbf{W}_{i,i}=2\omega _i\exp (-(y_i-\textbf{u}_{i,}^\top \zeta )^2/\theta )/\theta \). Also let \(\textbf{v}(\zeta ) = (v_1, \cdots , v_n)^\top \) with \(v_i=y_i-\textbf{u}^\top _{i,}\zeta \). Define \(\textbf{U}_{,-j}\) as the sub-matrix of \(\textbf{U}\) with the jth column excluded. Define \(\textbf{u}_{,j}\) as the jth column of matrix \(\textbf{U}\), and \(u_{i,j}\) as the jth component of vector \(\textbf{u}_{i,}\). Similarly, define \(\mathbf {\zeta }_{-j}\) as the sub-vector of \(\zeta \) with the jth element excluded. For the exponential squared objective function in (1), its first- and second-order derivatives with respect to \(\zeta \) are
If \((y_i-\textbf{u}^\top _{i,}\zeta ) ^ 2 /\theta >0.5\), \(\frac{\partial ^2{Q_\theta (\zeta )}}{\partial \zeta _{j}\partial \zeta _k}\ge 0\). On the other hand if \((y_i-\textbf{u}^\top _{i,}\zeta )^2 /\theta \le 0.5\), \(\frac{\partial ^2{Q_\theta (\zeta )}}{\partial \zeta _{j}\partial \zeta _k}\le 0\). Hence, to find the maximizer of \(Q_\theta (\zeta )\), the simple Newton-Raphson approach may lead to infinity if the starting value is too far from the true value. To tackle this problem, a minorization of \(Q_\theta (\zeta )\) is used to approximate \(Q_\theta (\zeta )\). Note that \(\frac{\partial ^2{Q_\theta (\zeta )}}{\partial \zeta _{j}\partial \zeta _k}\ge - 2\sum _{i=1}^n\omega _i\exp (-(y_i - \textbf{u}_{i,}^\top \zeta )^2 / \theta )u_{i,j}u_{i,k}/\theta \). Hence a minorized approximation to \(Q_\theta (\zeta )\) at \(\zeta ^{m}\) is
Note that \(\zeta ^m=(\alpha _0^m,\ldots ,\alpha _q^m, {b_1^m}^\top ,\ldots ,{b_p^m}^\top )^\top \) with \(b_k^m=(\beta _k^m, \gamma _{1,k}^m, \ldots , \gamma _{q,k}^m)^\top \). For the penalty, we apply a local linear approximation at \(\zeta ^{m}\), which is given by
if the terms that do not depend on \(\zeta \) are ignored, where \(\dot{\rho }(t;\lambda , s) = sgn(t) \left( \lambda - \frac{|t|}{s}\right) _{+}.\) If we replace \(Q_\theta (\zeta )\) in (2) with its minorized approximation and plug in the approximation of the penalty, the penalized objective function then has the form
This function has a “weighted quadratic + penalty” form and can be optimized using the coordinate-descent approach.
The algorithm starts with \(m=0\) and \(\zeta ^m = \textbf{0}\), where m is the index of the MM iteration. At iteration m, the objective function is approximated by its minorization \(L_{\lambda _1, \lambda _2, \theta }(\zeta |\zeta ^m)\) given in (3). Then the penalized weighted quadratic function is maximized using the coordinate-descent algorithm. Denote \(\bar{\zeta }^{old}\) as the estimate of \(\zeta \) before updating. We update each element of the estimate and denote the new estimate as \(\bar{\zeta }^{new}\). This is repeated until the distance between \(\bar{\zeta }^{old}\) and \(\bar{\zeta }^{new}\) is smaller than a prefixed constant. Then \(\zeta ^{m+1} = \bar{\zeta }^{new}\) serves as the new expansion base point for the next minorization. The overall procedure is repeated until convergence. Convergence properties of the MM and CD techniques have been well studied in the literature. With our problem, the objective function increases at each step and is bounded above, which leads to convergence. In numerical study, we conclude convergence if the difference between two estimates after two consecutive MM steps is small enough. We observe convergence in all numerical examples after a small to moderate number of MM iterations.
The proposed method involves tuning parameters. For s in MCP, we follow (Zhang, 2010) and other published studies, which suggest examining a small number of values or fixing it. In our numerical study, we fix \(s=6\), which has been adopted in published studies (Shi et al., (2014; Xu et al., 2018). We have also examined s values near 6 and observed similar performance (details omitted). In practice, for settings significantly different from ours, other s values may need to be considered. Under low-dimensional settings, (Wang et al., 2013) proposed an iterative approach to select the robust tuning parameter \(\theta \). However, their approach is computationally infeasible for high-dimensional data. Under the present setting, for each combination of \((\lambda _1, \lambda _2, \theta )\), we compute the solution. This way, we can obtain a solution surface over a three-dimensional tuning parameter grid. This is feasible as the proposed computational algorithm only involves simple updates and incurs low cost. Then the tuning parameters can be selected using a prediction-based method which proceeds as follows: (a) compute the cross-validated sum of prediction errors for each \((\lambda _1, \lambda _2,\theta )\) combination; (b) for each fixed \(\theta \), average the sum of prediction errors over \(\lambda _1, \lambda _2\). Select \(\theta \) that has the smallest average sum of prediction errors; (c) with the selected \(\theta \), select \(\lambda _1, \lambda _2\) that has the smallest sum of prediction errors. This procedure first groups all \((\lambda _1, \lambda _2)\) values together and selects the best \(\theta \) value. Then with the optimal \(\theta \) value, the optimal \((\lambda _1, \lambda _2)\) values are selected. Our numerical experiments suggest that this procedure generates more stable estimates than directly searching over the three-dimensional \((\lambda _1, \lambda _2, \theta )\) grid.
With a complex robust goodness-of-fit and a penalty that respects the hierarchy, the proposed method is inevitably computationally more expensive than some simpler alternatives. However, as the proposed computational algorithm is composed of relatively simple calculations, the overall computational cost is affordable. With fixed tunings, the analysis of one simulated dataset (described in detail below) takes about nine minutes on a regular laptop. Tuning parameter selection can be conducted in a highly parallel manner to save computer time.
2.4 Consistency Properties
In this section, we rigorously prove that the proposed method can consistently identify the important interactions (and main effects) under ultrahigh-dimensional settings. In the literature, theoretical development for robust methods under high-dimensional settings has been limited. It is especially rare for methods other than the quantile based. With the consistency properties, the proposed method can be preferred over the alternatives whose statistical properties have not been well established. Our theoretical development not only provides a solid ground for the proposed method but also sheds insights for other robust methods under high-dimensional settings.
For any two subsets \(S_1\) and \(S_2\) of \(\{1, \cdots , p+q+ pq+ 1\}\) and a matrix H, we denote by \(H_{S_1 S_2}\) the sub-matrix of H with rows and columns indexed by \(S_1\) and \(S_2\), respectively. Let \(\zeta ^*=(\alpha _0^*,\ldots ,\alpha _q^*, {b_1^*}^\top ,\ldots ,{b_p^*}^\top )^\top \), where \(b_k^*=(\beta _k^*, \gamma _{1,k}^*, \ldots , \gamma _{q,k}^*)^\top \) is the true value of \(\zeta \). Here we make the sparsity assumption, under which only a subset of the components of \(\zeta ^*\) is nonzero. Define the three groups of parameters:
Denote \(\mathscr {A}\) as the set of indices of \(A_1\cup A_2 \cup A_3\) in the vector \({\zeta }^*\). Let \(\mathscr {A}^c\) and \(|\mathscr {A}|\) denote the complement and cardinality of set \(\mathscr {A}\), respectively. We then divide \(\mathscr {A}^c\) into there sets of indices \(\mathscr {B}_1, \mathscr {B}_2\), and \(\mathscr {B}_3\), which correspond to the following three sets
respectively. Define
and
The following conditions are needed to establish the consistency properties.
- C1.:
-
T and C are independent, and \(P(T \le C|T,X,Z) = P(T \le C|T)\).
- C2.:
-
The support of T is dominated by that of C. For example, \(\tau _T<\tau _C\) or \(\tau _T= \tau _C=\infty \), where \(\tau _T\) and \(\tau _C\) are the right end points of the support of T and C, respectively.
- C3.:
-
\(E[D_{n}({\zeta }^*)] = 0\).
- C4.:
-
The distributions of \(D_{n,j}({\zeta }^*)\)’s are subgaussian, that is, \(\Pr (|D_{n,j}({\zeta }^*)| >t )\le 2 \exp \left( - n t^2/\sigma ^2\right) \). Moreover, \(I_{n, jk}({\zeta })- I_{jk}({\zeta })\)’s are subgaussian for all \(\zeta \in \varTheta =\{\zeta : \Vert \zeta -\zeta ^*\Vert _2 <\delta \}\), where \(\delta \) is a positive constant, \(I({\zeta }) = E[I_{n}({\zeta })]\), and \(I_{jk}({\zeta })\) is the (j, k)th component of matrix \(I({\zeta })\). Moreover, there exists a bounded constant \(\kappa \) such that \({ \mathbf {\nu }}^\top [I({\zeta }^1)-I({\zeta }^2)]{ \mathbf {\nu }} \le \kappa \Vert {\zeta }^1-{\zeta }^2\Vert _2\) for any \({\zeta }^1, {\zeta }^2\in \varTheta \) and \(\Vert { \mathbf {\nu }}\Vert _2 = 1\).
- C5.:
-
\(I_{\mathscr {A}\mathscr {A}}({\zeta }^*)\) is a \(|\mathscr {A}|\times |\mathscr {A}|\) negative-definite matrix. The eigenvalues of \(I_{\mathscr {A}\mathscr {A}}({\zeta }^*)\) are bounded away from zero and infinity.
- C6.:
-
\(\min _{j,k} \{|\gamma _{j,k}^*|: \gamma _{j,k}^*\ne 0\}\gg \lambda _1\vee \lambda _2\). \(\lambda _1\wedge \lambda _2\gg \sqrt{|\mathscr {A}|/n}\).
C1 and C2 have been commonly assumed in the literature. See, for example, (Stute, 1993, 1996; Huang et al., 2007). We note that the independent censoring assumption usually holds in practice, although from a theoretical perspective, quite a few studies have made the weaker conditional independence assumption. We have explored relaxing this assumption and found that alternative and less intuitive assumptions would have to be made. The zero expectations in C3 and C5 ensure the consistency of estimation. C4 is required for Theorem 1, and a similar assumption has been made in (Ma & Du, 2012). C6 requires that the smallest signal does not decay too fast, which is common in studies on high-dimensional inference. The following theorem establishes consistency of the proposed estimator \(\widehat{\zeta }\).
Theorem 1
Suppose that conditions C1-C6 hold.
Let \(\varpi _n= (\lambda _1\wedge \lambda _2)/\{\max (\varPhi _1, \varPhi _2, \varPhi _3)\}\), where \(\varPhi _t= \Vert I_{\mathscr {B}_t \mathscr {A}}({\zeta }^*)I_{ {\mathscr {A}\mathscr {A}}}({{\zeta }}^*)^{-1}\Vert _\infty \),
\(t=1,2,3\). If \(|\mathscr {A}| =o(n)\), \(\lambda _1\vee \lambda _2 \rightarrow 0\), \(n \varpi _n^2 \rightarrow \infty \), and \(\log p = o ( n\varpi _n^2)\),
with probability tending to one, we have
Proof
For the proof, see Appendix. \(\square \)
This theorem establishes that the proposed method is able to accommodate p with \(\log p = o (n\varpi _n^2)\). The penalized robust estimator enjoys the same asymptotic properties as the oracle estimator with probability approaching one. This property holds under high dimensions without restrictive conditions on the errors. To the best of our knowledge, properties of the robust exponential loss, even without censoring, have not been studied under high-dimensional settings. Thus our theoretical investigation can have independent value. Proof of the theorem is presented in Appendix.
3 Simulations
In simulation, we set \(n = 300, q = 5\), and \(p = 1000\). The underlying true model contains a total of 35 nonzero effects, including 5 main E effects, 10 main G effects, and 20 interactions. The “positions” of nonzero main G effects are randomly placed. The nonzero interactions are generated to respect the “main effects, interactions” hierarchy. The nonzero regression coefficients are randomly generated from uniform (0.7, 1.3). We consider both continuous and categorical distributions to mimic, for example, gene expression and SNP data. Specifically, under the continuous scenario, the E and G factors are generated from multivariate normal distributions with marginal means zero, marginal variances one, and the following variance matrix structures: Independent, AR(0.3), AR(0.8), Band(0.3), Band(0.6), and CS(0.2). Under the independent scenario, all factors have zero correlations. Under the AR\((\rho )\) correlation structure, for the ith and jth factors, \(corr = \rho ^{|i-j|}\). Under the Band\((\rho )\) correlation structure, for the ith and jth factors, \(corr =\rho \cdot I(|i - j| = 2) + 0.3\cdot I(|i-j| =1)+I(|i-j|=0)\). Under the CS\((\rho )\) correlation structure, for the ith and jth factors, the correlation coefficient \(corr = \rho ^{I(i\ne j)}\). Under the categorical scenario, we first apply the same data generating approach as described above to obtain \(\textbf{U}\). Then for each \(u_{i,j}\), the categorical measurement is generated as \(I(u_{i,j}>-0.7)\). The threshold value \(-0.7\) is chosen such that the proportion of 1’s for each factor is roughly 75%. Under each of the above simulation settings, consider the random error distribution \((1-\xi )N(0,1)+\xi Cauchy\), with the contamination probability \(\xi =0\), 0.1, and 0.3. When \(\xi =0\), the error distribution has no contamination and favors the nonrobust approaches, while the latter two values lead to different levels of contamination. The log event times are generated from the AFT model. The censoring times are generated independently from Weibull distributions. The censoring parameters are adjusted so that the censoring rates are about 25%. Beyond the above scenarios, we also consider a set of parallel scenarios, under which there are 10 main E effects, 20 main G effects, and 40 interactions (that is, the number of important effects is doubled), and the nonzero coefficients are generated from uniform (0.4, 0.6) (that is, the signal levels are reduced by about 50%). Other settings remain the same.
The simulated data are analyzed using the proposed method. In addition, we also consider two alternatives: (a) the nonrobust method that adopts the weighted least squared loss and the same penalty as the proposed, and (b) the quantile regression-based method that adopts an \(L_1\) robust loss and the same penalty as the proposed. We note that multiple other methods are potentially applicable. Comparing with the nonrobust method can directly establish the merit of being robust. The quantile regression-based approach is the most popular for high-dimensional data (Wu & Ma, 2015). Thus these two alternatives are the most sensible to compare with.
All three methods involve tuning parameters. To eliminate the (possibly different) effects of tuning parameter selection on identification accuracy, we consider a sequence of tuning parameter values, evaluate identification accuracy at each value, and calculate the AUC (area under the ROC curve) as the overall measure. This approach has been adopted extensively in published studies (Zhu et al., 2014).
Summary statistics are computed based on 500 replicates. The AUC results for interactions and main effects combined are presented in Tables 1 and 2, respectively, for the scenarios with 35 and 70 important effects. To be thorough, we have also evaluated identification accuracy for interactions and main effects separately and present the AUC results in Tables 4, 5, 6 and 7 in Appendix. For all three methods, the AUC value decreases as the contamination proportion increases, as expected. In Table 1, the proposed method outperforms the two alternatives under all except one scenario. In Table 2, it dominates the alternatives. Under some scenarios, the proposed method leads to a significant improvement in identification accuracy. For example in Table 1, with the continuous G distribution, 30% contamination, and Band(0.3) correlation, the proposed method has a mean AUC of 0.901, while the alternatives have mean AUCs of 0.761 and 0.789. Compared to the nonrobust alternative, the proposed method also has smaller standard errors (Table 3).
We have also experimented with a few other scenarios and made similar observations. In particular, we have examined the scenarios where the event and censoring times have weak to moderate correlations and observed similar satisfactory performance (details omitted). The proposed method and two alternatives respect the hierarchy. We have also looked into simpler alternatives, including MCP and Lasso, which may violate the hierarchy, and observed inferior performance.
4 Analysis of the TCGA Lung Adenocarcinoma Data
Adenocarcinoma of the lung is the leading cause of cancer death worldwide. Profiling studies have been extensively conducted searching for its prognostic factors. Here we analyze the TCGA( The Cancer Genome Atlas Research Network, 2014) data on the prognosis of lung adenocarcinoma. The TCGA data were recently collected and published by NCI and have high quality. The prognosis outcome of interest is overall survival. The dataset contains measurements on 43 clinical/environmental variables and 18,897 gene expressions. There are a total of 468 patients, among whom 117 died during follow-up. The median follow-up time is 8 months. We select four E factors for downstream analysis, namely, age, gender, smoking pack years, and smoking history. These factors have a relatively low missing rate in the TCGA dataset and have been previously suggested as potentially related to lung cancer prognosis. There are a total of 436 samples with both E and G measurements available. Among them, 110 died during follow-up, and the median follow-up time is 23 months. For the 326 censored subjects, the median follow-up time is 6 months. In principle, the proposed method can directly analyze all of the available gene expressions. To improve stability and reduce the computational cost, we conduct marginal prescreening. Specifically, genes are screened based on their univariate regression significance (p-value less than or equal to 0.1) and interquartile range (above the median of all interquartile ranges). Similar prescreenings have been adopted in the literature. A total of 819 gene expressions are included in the downstream model fitting. Note that with the main G effects as well as interactions, the number of unknown parameters is much larger than the sample size.
Detailed estimation results are presented in Table 3 for the proposed method and Tables 8 and 9 in Appendix for the two alternatives. It is observed that the three methods lead to quite different findings. Specifically, the proposed and quantile methods share four common main G effects and four interactions. Otherwise, there is no overlap in identification. The “signals” in practical data can be weaker than those in simulated data, leading to the significant differences across methods.
With the proposed method, sixteen genes are identified to have interactions with either age or smoking status. As for many other cancer types, age has been identified as a critical factor in lung cancer prognosis. Smoking has been confirmed as the most important E factor for lung cancer risk and prognosis. In the literature, G-E interaction analysis for lung cancer prognosis is still very limited. However, there have been many studies on the functionalities of genes. Searching such studies can provide a partial support to the validity of our analysis results. Among the identified genes, many have been implicated in cancer in the literature. Specifically, the AGPAT family, which includes AGPAT6 as a member, has been found to play a role in multiple cancer types. For example, AGPAT2 and AGPAT11 have been found to be upregulated in ovarian, breast, cervical, and colorectal cancers (Agarwal and Garg, 2010). Another gene that is worth attention is ATF6, which acts both as a sensor and a transcription factor during endoplasmic reticulum stress. ATF6\(\alpha \) has been found to promote hepatocarcinogenesis and cancer cell proliferation through activating downstream target gene BIP. Its efficiency of stress recognition and signaling has been found to decrease with age (Naidoo, 2009). We find that gene COLCA2 (colorectal cancer associated 2) interacts with smoking pack years. Studies have shown that COLCA2 may have critical functions in suppressing tumor formation in epithelial cells (Peltekova et al., 2014). We also identify an interaction between NOS1AP and age. It has been found that the protein complex of SCRIB, NOS1AP, and VANGL1 regulates cell polarity and migration, and this complex can be associated with cancer progression (Anastas et al., 2012). An interaction between PPP1R15B and smoking pack years has also been identified. It has been suggested that PPP1R15B is likely to be regulated by Nrf2, which has a protective response to smoking induced oxidative stress in the lung (Taylor et al., 2008). Also, PPP1R15B may promote cancer cell proliferation.
To complement the identification and estimation analysis, we also evaluate stability. Specifically, we randomly select 3/4 of the subjects and apply the proposed method and alternatives. This procedure is repeated 200 times. We then compute the probability that an interaction is identified. Similar procedures have been extensively adopted in published studies. The stability results are also provided in Tables 3, 8, and 9. We see that most of the identified interactions are relatively stable, with many having the probabilities of being identified close to one.
5 Discussions
To understand the prognosis of complex diseases, it is essential to study G-E interactions. In “classic” low-dimensional biomedical studies, data contamination is found to be not rare, and it has been suggested that robust methods are needed to accommodate contamination. This study has developed a robust method for high-dimensional genetic interaction analysis, which is still limited in the literature. The proposed method consists of a novel robust loss function and a penalized identification strategy that respects the “main effects, interactions” hierarchy, both of which have novel advancements. Also significantly advancing from the literature, we have rigorously established the consistency properties. The theoretical results may seem “familiar”, which is “comforting” in that the consistency properties are not sacrificed with the additional robustness, high dimensionality, and interactions. It is worth noting that the consistency results do not demand excessive assumptions on the error distribution, which are usually needed in the existing literature. In simulation, the proposed method outperforms the nonrobust alternative. It is interesting to note that it has superior performance when there is no contamination. Another important finding is that it also outperforms the quantile-based robust method. Most of the existing high-dimensional robust studies have adopted the quantile regression technique. Our simulation suggests that it is prudent to develop alternative robust methods. In the analysis of TCGA lung cancer data, the proposed method generates results with some overlap**s with the quantile regression method, however, none with the nonrobust method. The identified genes have important implications, and the identified interactions are stable.
The proposed study can be potentially extended in multiple directions. In survival analysis, there are many other models beyond the AFT. It can be of interest to develop robust methods based on other models. We have studied G-E interactions. It can be of interest to extend to G-G interactions. In theoretical analysis, one problem left is the breakdown point. Because of the extremely high complexity, this problem has been left uninvestigated in many other robust studies too. In our simulation, we have experimented with contamination rate as high as 30%, which is much higher than many of the existing studies. The superiority of the proposed method over the quantile regression method is observed. The relative efficiency of different robust methods, although of interest, will be postponed to future studies. In data analysis, the proposed method identifies a different set of main effects and interactions. Mining the literature and the stability evaluation can support the validity of findings to a certain extent. More validations need to be pursued in the future.
References
Agarwal, A. K., & Garg, A. (2010). Enzymatic activity of the human 1-acylglycerol-3-phosphate-o-acyltransferase isoform 11: upregulated in breast and cervical cancers. Journal of Lipid Research, 51, 2143–2152.
Anastas, J., Biechele, T., Robitaille, M., Muster, J., Allison, K., Angers, S., & Moon, R. (2012). A protein complex of SCRIB, NOS1AP and VANGL1 regulates cell polarity and migration, and is associated with breast cancer progression. Oncogene, 31, 3696.
Bien, J., Taylor, J., & Tibshirani, R. (2013). A lasso for hierarchical interactions. Annals of Statistics, 41, 1111–1141.
Bowman, L. (2011). Doctors, researchers worry about accuracy of social security “death file”. http://projects.scrippsnews.com/story/doctors-researchers-worry/. Accessed 30 Apr. 2015
Comprehensive molecular profiling of lung adenocarcinoma. (2014). The cancer genome atlas research network. Nature, 511, 543–550.
Fall, K., Stromberg, F., Rosell, J., Andren, O., & Varenhorst, E. (2008). Reliability of death certificates in prostate cancer patients. Scandinavian Journal of Urology, 42, 352–357.
He, S., Chen, H., Zhu, Z., Ward, D., Cooper, H., Viant, M., Heath, J., & Yao, X. (2015). Robust twin boosting for feature selection from high-dimensional omics data with label noise. Information Sciences, 291, 1–18.
Huang, J., Ma, S., & **e, H. (2007). Least absolute deviations estimation for the accelerated failure time model. Statistica Sinica, 17, 1533–1548.
Huber, P., & Ronchetti, E. (2009). Robust statistics (2nd ed.). Hoboken, NJ: Wiley.
Hunter, D. J. (2005). Gene-environment interactions in human diseases. Nature Reviews Genetics, 6, 287–298.
Liu, J., Huang, J., Zhang, Y., Lan, Q., Rothman, N., Zheng, T., & Ma, S. (2013). Identification of gene-environment interactions in cancer studies using penalization. Genomics, 102, 189–194.
Ma, S., & Du, P. (2012). Variable selection in partly linear regression model with diverging dimensions for right censored data. Statistica Sinica, 22, 1003–1020.
Naidoo, N. (2009). ER and aging-protein folding and the ER stress response. Ageing Research Reviews, 8, 150–159.
Peltekova, V., Lemire, M., Qazi, A., Zaidi, S., Trinh, Q., Bielecki, R., Rogers, M., Hodgson, L., Wang, M., D’souza, D., et al. (2014). Identification of genes expressed by immune cells of the colon that are regulated by colorectal cancer-associated variants. International Journal of Cancer, 134, 2330–2341.
Shi, X., Liu, J., Huang, J., Zhou, Y., **e, Y., & Ma, S. (2014). A penalized robust method for identifying gene-environment interactions. Genetic Epidemiology, 38, 220–230.
Stute, W. (1993). Consistent estimation under random censorship when covariables are present. The Journal of Multivariate Analysis, 45, 89–103.
Stute, W. (1996). Distributional convergence under random censorship when covariables ae present. Scandinavian Journal of Statistics, 23, 461–471.
Taylor, R., Acquaah-Mensah, G., Singhal, M., Malhotra, D., & Biswal, S. (2008). Network inference algorithms elucidate Nrf2 regulation of mouse lung oxidative stress. PLOS Computational Biology, 4, e1000166.
Thomas, D. (2010). Methods for investigating gene-environment interactions in candidate pathway and genome-wide association studies. Annual Review of Public Health, 31, 21–36.
Wang, X., Jiang, Y., Huang, M., & Zhang, H. (2013). Robust variable selection with exponential squared loss. Journal of the American Statistical Association, 108, 632–643.
Wu, C., Cui, Y., & Ma, S. (2014). Integrative analysis of gene-environment interactions under a multi-response partially linear varying coefficient model. Statistics in Medicine, 33, 4988–4998.
Wu, C., & Ma, S. (2015). A selective review of robust variable selection with applications in bioinformatics. Briefings in Bioinformatics, 16, 873–883.
Xu, Y., Wu, M., Ma, S., & Ejaz Ahmed, S. (2018). Robust gene-environment interaction analysis using penalized trimmed regression. Journal of Statistical Computation and Simulation, 88, 3502–3528.
Zhang, C. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38, 894–942.
Zhu, R., Zhao, H., & Ma, S. (2014). Identifying gene-environment and gene-gene interactions using a progressive penalization approach. Genetic Epidemiology, 38, 353–368.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
Proof of Theorem 1
Proof
Define the oracle estimator \(\widehat{\zeta }\) with \(\widehat{\zeta }_{\mathscr {A}^c}=0\) and
Recall that the proposed objective function is
In what follows, we first establish the estimation consistency of \(\widehat{\zeta }\) in Step 1, and then show that \(\widehat{\zeta }\) is a local maximizer of \(L_{\lambda _1, \lambda _2, \theta }(\zeta )\) is Step 2.
\(\underline{{\textbf {{ Step 1}}}}\). Define the objective function
Then \(\widehat{\zeta }_\mathscr {A} = {\arg \max }R_n({\zeta }_\mathscr {A})\). Let \(r_n = \sqrt{|\mathscr {A}|/n}\). To prove \(\Vert \widehat{\zeta }_\mathscr {A} - {\zeta }_\mathscr {A}^*\Vert _2 = O_p(r_n)\), it suffices to show that for any given \(\eta >0\), there exists a sufficiently large constant \(C>0\),
where \(\mathscr {I}=\left\{ {\zeta }_\mathscr {A}: \Vert {\zeta }_\mathscr {A} - {\zeta }_\mathscr {A}^*\Vert _2 = Cr_n\right\} \). This implies that \(R_n({\zeta }_\mathscr {A})\) has a local maximizer \(\widehat{\zeta }_\mathscr {A}\) that satisfies \(\Vert {\zeta }_\mathscr {A} - {\zeta }_\mathscr {A}^*\Vert _2 = O_p(r_n)\).
Recall the definitions of \(D_{n}({\zeta })\) and \(I_{n}({\zeta })\). By Taylor’s expansion, we have
where \(\bar{\zeta }\) lies between \({\zeta }^*\) and \({\zeta }\). By C3 and C4, we have that for all \(j\in \{1, \cdots , p+q+pq+1\}\) and any given t, \(\Pr (|D_{n,j}({\zeta }^*)| >t )\le 2 \exp \left( - n t^2/\sigma ^2\right) \). Then \(E(|\sqrt{n}D_{n,j}({\zeta }^*)|)<K<\infty \) for all j. With Markov’s inequality,
By the Cauchy-Schwarz inequality, \(Q_1\le C\Vert D_{n, \mathscr {A}}({\zeta }^*)\Vert _2 r_n\). Let \(t = C\rho _* r_n/3\), where \(\rho _*\) is the smallest eigenvalue of \(-I_{AA}({\zeta }_A^*)\). From C5, we have that \(\rho _*\) is bounded away from zero and infinity. Then we have
For \(Q_2\), we have
Since \(\lambda _{\max }(I_{{\mathscr {A}\mathscr {A}}}(\zeta ^*))\le -\rho _*\) by C5, we have
Under C4, we have
The second inequality holds since \(\bar{\zeta }\) lies between \({\zeta }^*\) and \({\zeta }\), which yields \(\Vert \bar{{\zeta }}-{\zeta }^*\Vert _2<C r_n\). When n is sufficiently large, the last inequality holds. With C4 and Bonferroni’s inequality,
where \(\Vert \cdot \Vert _F\) denotes the Frobenius norm. By the inequality \(\lambda _{\max }(I_{n,{\mathscr {A}\mathscr {A}}}(\bar{{\zeta }}) -I_{{\mathscr {A}\mathscr {A}}}(\bar{{\zeta }}))\le \Vert I_{n,{\mathscr {A}\mathscr {A}}}(\zeta ^*)-I_{{\mathscr {A}\mathscr {A}}} ({\zeta }^*)\Vert _F\), we have
Combining (9), (10), (11), and (12), we have
With (7), (8), and (13), we have
with probability at least
Note that \(\rho _*\) is bounded away from zero and infinity in C5. As \(n\rightarrow \infty \), the above probability is bigger than \(1- \frac{16K}{C^2\rho ^2_*}\). Let \(C=4\rho _*^{-1} \sqrt{K/\eta }\), then we can conclude (6).
\(\underline{{\textbf {{Step 2}}}}\). Next we show that the oracle estimator \(\widehat{\zeta }\) studied in Step 1 satisfies the Karush-Kuhn-Tucher (KKT) condition, and then \(\widehat{\zeta }\) is a local maximizer of \(L_{\lambda _1, \lambda _2, \theta }(\zeta )\). Based on the results in Step 1 and C6, we only need to check the following conditions
hold with asymptotic probability one, where \(\Vert \nu \Vert _\infty = \max _i |\nu _i|\) for any vector \(\nu = (\nu _1, \cdots , \nu _{|\mathscr {A}^c|})\). Applying Taylor’s expansion,
where \(\widetilde{\zeta }\) lies between \({\zeta }^*\) and \(\widehat{{\zeta }}\). From (4) and the proof of Theorem 1(a), we have
where \(\bar{\zeta }\) lies between \({\zeta }^*\) and \(\widehat{{\zeta }}\), which is defined in Step 1. By substituting (17) into (16),
Here we define
Inspired by the deduction of \(Q_2\) in Step 1, we can establish that
That is, we only need to focus on \(\Vert \varDelta _{n, \mathscr {B}_1}^*\Vert _\infty \) in order to evaluate the probability of \(\{\Vert D_{n, \mathscr {B}_1}(\widehat{\zeta })\Vert _\infty <\lambda _1\}\) in (15). Note that,
Recall that \(\varPhi _1= \Vert I_{\mathscr {B}_1 \mathscr {A}}({\zeta }^*)I_{ {\mathscr {A}\mathscr {A}}}({{\zeta }}^*)^{-1}\Vert _\infty \). If
along with (19), we have \(\Vert \varDelta _{n, \mathscr {B}_1}^*\Vert _\infty <\lambda _1\). Similarly, we also need
to satisfy the other two conditions in (15), where \(\varPhi _2= \Vert I_{\mathscr {B}_2 \mathscr {A}}({\zeta }^*)I_{ {\mathscr {A}\mathscr {A}}}({{\zeta }}^*)^{-1}\Vert _\infty \) and \(\varPhi _3= \Vert I_{\mathscr {B}_3 \mathscr {A}}({\zeta }^*)I_{ {\mathscr {A}\mathscr {A}}}({{\zeta }}^*)^{-1}\Vert _\infty \). Based on the above discussions, we have
We now derive the probability bound for the above event. By Bonferroni’s inequality and C4, we can obtain
Combining the results in Steps 1 and 2, we conclude that \(\widehat{\zeta }\) is a local maximizer of \(L_{\lambda _1, \lambda _2, \theta }(\zeta )\) with probability at least
and satisfies \(\Vert \widehat{\zeta }_{\mathscr {A}} - {\zeta }^*_{\mathscr {A}}\Vert _2=O_p(\sqrt{|\mathscr {A}|/n}),~ \widehat{\zeta }_{\mathscr {A}^c}=0\). With C6, \(\log p = O ( n \varpi _n^2)\), and \(\varpi _n= (\lambda _1\wedge \lambda _2)/\{\max (\varPhi _1, \varPhi _2, \varPhi _3)\}\), this tail probability is exponentially small. The theorem is thus proved.
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this paper
Cite this paper
Zhang, Q., Chai, H., Liang, W., Ma, S. (2023). Robust Identification of Gene-Environment Interactions Under High-Dimensional Accelerated Failure Time Models. In: Zheng, Z. (eds) Proceedings of the Second International Forum on Financial Mathematics and Financial Technology. IFFMFT 2021. Financial Mathematics and Fintech. Springer, Singapore. https://doi.org/10.1007/978-981-99-2366-3_3
Download citation
DOI: https://doi.org/10.1007/978-981-99-2366-3_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-2365-6
Online ISBN: 978-981-99-2366-3
eBook Packages: Economics and FinanceEconomics and Finance (R0)