Abstract
Evaluating multiple binary outcomes is common in genetic studies of complex diseases. These outcomes are often correlated because they are collected from the same individual and they may share common marker effects. In this paper, we propose a procedure to test for effect of a single nucleotide polymorphism-set on multiple, possibly correlated, binary responses. We develop a score-based test using a non-parametric modeling framework that jointly models the global effect of the marker set. We account for the non-linear effects and potentially complicated interaction between markers using reproducing kernels. Our testing procedure only requires estimation under the null hypothesis and we use multivariate generalized estimating equations to estimate the model components to account for the correlation among the outcomes. We evaluate finite sample performance of our test via simulation study and demonstrate our methods using the Clinical Antipsychotic Trials of Intervention Effectiveness antibody study data and the CoLaus study data.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12561-017-9189-9/MediaObjects/12561_2017_9189_Fig1_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12561-017-9189-9/MediaObjects/12561_2017_9189_Fig2_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12561-017-9189-9/MediaObjects/12561_2017_9189_Fig3_HTML.gif)
Similar content being viewed by others
References
Agresti A (2002) Categorical data analysis, 2nd edn. Wiley, New York
Arsenault BJ, Rana JS, Stroes ESG, Desprs J-P, Shah PK, Kastelein JJP, Wareham NJ, Boekholdt SM, Khaw K-T (2010) Beyond low-density lipoprotein cholesterol: respective contributions of nonhigh-density lipoprotein cholesterol levels, triglycerides, and the total cholesterol/high-density lipoprotein cholesterol ratio to coronary heart disease risk in apparently healthy men and women. J Am Coll Cardiol 55:3541
Austin MA, Hokanson JE, Edwards KL (1998) Hypertriglyceridemia as a cardiovascular risk factor. Am J Cardiol 81:7B12B
Bauer CR, Shankaran S, Bada HS, Lester B, Wright LL, Krause-Steinrauf H, Smeriglio VL, Finnegan LP, Maza PL, Verter J (2002) The maternal lifestyle study: drug exposure during pregnancy and short-term maternal outcomes. Am J Obstet Gynecol 186:487–495
Buhmann MD (2003) Radial basis functions: theory and implementations. Cambridge University Press, Cambridge
Burges CJ (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2:121–167
Chen J, Chen W, Zhao N, Wu MC, Schaid DJ (2016) Small-sample kernel association tests for human genetic and microbiome association studies. Genet Epidemiol 40:5–19
Das A, Poole WK, Bada HS (2004) A repeated measures approach for simultaneous modeling of multiple neurobehavioral outcomes in newborns exposed to cocaine in utero. Am J Epidemiol 159:891–899
Expert Panel on Detection, Evaluation, and Treatment of High Blood Cholesterol in Adults (2001) Executive summary of the third report of the National Cholesterol Education Program (NCEP) Expert Panel on Detection, Evaluation, and Treatment of High Blood Cholesterol in Adults (Adult Treatment Panel III). JAMA 285:2486–2497
Firmann M, Mayor V, Vidal PM, Bochud M, Pecoud A, Hayoz D, Paccaud F, Preisig M, Song KS, Yuan X et al (2008) The CoLaus study: a population-based study to investigate the epidemiology and genetic determinants of cardiovascular risk factors and metabolic syndrome. BMC Cardiovasc Disord 8:6
Freytag S, Bickeböller H, Amos CI, Kneib T, Schlather M (2012) A novel kernel for correcting size bias in the logistic kernel machine test with an application to rheumatoid arthritis. Hum Hered Hum7:97–108
Girault EM, Foppen E, Ackermans MT, Fliers E, Kalsbeek A (2013) Central administration of an orexin receptor 1 antagonist prevents the stimulatory effect of Olanzapine on endogenous glucose production. Brain Res 1527:238–245
Grundy SM, Cleeman JI, Daniels SR, Donato KA, Eckel RH, Franklin BA, Gordon DJ, Krauss RM, Savage PJ, Smith SC Jr et al (2005) Diagnosis and management of the metabolic syndrome: an American Heart Association/National Heart, Lung, and Blood Institute scientific statement. Circulation 112:2735–2752
Hofmann T, Schölkopf B, Smola AJ (2008) Kernel methods in machine learning. Ann Stat 36:1171–1220
Kralisch S, Klein J, Lossner U, Bluher M, Paschke R, Stumvoll M, Fasshauer M (2005) Isoproterenol, TNFalpha, and insulin downregulate adipose triglyceride lipase in 3T3-L1 adipocytes. Mol Cell Endocrinol 240:43–49
Kwee LC, Liu D, Lin X, Ghosh D, Epstein MP (2008) A powerful and flexible multilocus association test for quantitative traits. Am J Hum Genet 82:386–397
Lanckriet GRG, Cristianini N, Bartlett P, El Ghaoui L, Jordan M (2004) Learning the kernel matrix with semidefinite programming. J Mach Learn Res 5:27–72
Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR (2010) MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol 34:816–834
Liang KY, Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika 73:13–22
Lieberman JA, Stroup TS, McEvoy JP, Swartz MS, Rosenheck RA, Perkins DO, Keef RSE, Davis SM, Davis CE, Lebowitz BD et al (2005) Effectiveness of antipsychotic drugs in patients with chronic schizophrenia. N Engl J Med 353:1209–1223
Lin X (1997) Variance component testing in generalised linear models with random effects. Biometrika 84:309–326
Lipsitz SR, Fitzmaurice GM, Ibrahim JG, Sinha D, Parzen M, Lipshultz S (2009) Joint generalized estimating equations for multivariate longitudinal binary outcomes with missing data: an application to acquired immune deficiency syndrome data. J R Stat Soc 172:3–20
Liu D, Lin X, Ghosh D (2007) Semiparametric regression of multi-dimensional genetic pathway data: least squares kernel machines and linear mixed models. Biometrics 63:1077–1088
Liu D, Ghosh D, Lin X (2008) Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC Bioinform 9:292
Maity A, Sullivan PF, Tzeng JY (2012) Multivariate phenotype association analysis by marker-set kernel machine regressions. Genet Epidemiol 36:686–695
McCartan C, Mason R, Jayasinghe SR, Griffiths LR (2012) Cardiomyopathy classification: ongoing debate in the genomics era. Biochem Res Int 2012:796926
Miller M, Stone NJ, Ballantyne C, Bittner V, Criqui MH, Ginsberg HN, Goldberg AC, Howard WJ, Jacobson MS, Kris-Etherton PM et al (2011) Triglycerides and cardiovascular disease: a scientific statement from the American Heart Association. Circulation 123:2292–2333
Nam D, Kim SY (2008) Gene-set approach for expression pattern analysis. Brief Bioinform 9:189–197
Nelson MR, Wegmann D, Ehm MG, Kessner D, St Jean P, Verzilli C, Shen J, Tang Z, Bacanu SA, Fraser D (2012) An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science 337:100–114
Pan KH, Lih CJ, Cohen SN (2005) Effects of threshold choice on biological conclusions reached during analysis of gene expression by DNA microarrays. Proc Natl Acad Sci USA 102:8961–8965
Shen Y, Zhao Y, Zheng D, Chang X, Ju S, Guo L (2013) Effects of orexin A on GLUT4 expression and lipid content via MAPK signaling in 3T3-L1 adipocytes. J Steroid Biochem Mol Biol 138:376–383
Sikder D, Kodadek T (2007) The neurohormone orexin stimulates hypoxia-inducible factor-1 activity. Genes Dev 21:2995–3005
Sullivan PF, Lin D, Tzeng JY, van den Oord E, Perkins D, Stroup TS, Wagner M, Lee S, Wright FA, Zou F et al (2008) Genomewide association for schizophrenia in the CATIE study: results of Stage 1. Mol Psychiatry 13:570–584
Suykens JAK, Van Gestel T, De Brabanter J, De Moor B, Vandewalle J (2002) Least squares support vector machines. World Scientific, Singapore
Szafranski M, Grandvalet Y, Rakotomamonjy A (2010) Composite kernel learning. Mach Learn 79:73–103
Tsuneki H, Wada T, Sasaoka T (2012) Role of orexin in the central regulation of glucose and energy homeostasis. Endocr J 59:365–374
Vapnik VN (1998) Statistical learning theory. Wiley, New York
Wang X, Lee S, Zhu X, Redline S, Lin X (2013) GEE-based SNP set association test for continuous and discrete traits in family based association studies. Genet Epidemiol 37:778–786
Wortley KE, Chang GQ, Davydova Z, Leibowitz SF (2003) Peptides that regulate food intake: orexin gene expression is increased during states of hypertriglyceridemia. Am J Physiol 284:R1454–R1465
Wu MC, Kraft P, Epstein MP, Taylor DM, Chanock SJ, Hunter DJ, Lin X (2010) Powerful SNP set analysis for case-control genome-wide association studies. Am J Hum Genet 86:929–942
Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X (2011) Rare variant association testing for sequencing data using the sequence kernel association test (SKAT). Am J Hum Genet 89:82–93
Wu M, Maity A, Lee S, Simmons EM, Harmon QE, Lin X, Engel S, Molldrem JJ, Armistead PM (2013) Kernel machine SNP-set testing under multiple candidate kernels. Genet Epidemiol 37:267–275
Yan Q, Tiwari HK, Yi N, Gao G, Zhang K, Lin W, Lou XY, Cui X, Liu N (2015) A sequence kernel association test for dichotomous traits in family samples under a generalized linear mixed model. Hum Hered 79:60–68
Yolken RH, Torrey EF, Lieberman JA, Yang S, Dickerson FB (2011) Serological evidence of exposure to herpes simplex virus Type 1 is associated with cognitive deficits in the CATIE schizophrenia sample. Schizophr Res 128:61–65
Zhang D, Lin X (2003) Hypothesis testing in semiparametric additive mixed models. Biostatistics 4:57–74
Zhang Y, Xu z, Shen X, Pan W, Alzheimer’s Disease Neuroimaging Initiative (2014) Testing for association with multiple traits in generalized estimation equations, with application to neuroimaging data. Neuroimage 96:309–325
Zhao Y, Chen F, Zhai R, Lin X, Diao N (2012) Association test based on SNP set: logistic kernel machine based test vs. principal component analysis. PLoS ONE 7:e44978
Acknowledgements
The authors thank Dr. Robert Yolken at Johns Hopkins University for providing the antibody data. The authors also thank Drs. Peter Vollenweider and Gerard Waeber, PIs of the CoLaus study, and Drs. Meg Ehm and Matthew Nelson, collaborators at GlaxoSmithKline for providing the CoLaus phenotype and sequence data. This work was supported by National Institutes of Health Grants R00 ES017744 (to A.M.), R01 MH084022 (to J.Y.T. and P.F.S.), and P01 CA142538 (to J.Y.T.).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors have no conflicts of interest to declare.
Appendix
Appendix
From Sect. 2.1, the parameters \({\varvec{\beta }}\) and h in (1) can be estimated by maximizing the penalized log-likelihood using a Fisher scoring or a Newton–Raphson algorithm. [24] show that, by treating \({\varvec{\beta }}\) as a vector of fixed effects and \(\mathbf{h } = (h_{1},\ldots , h_{n})^\mathrm{T}\) as a vector of random effects, the logistic KM estimator is the same as the penalized quasi-likelihood estimator from a logistic mixed model \(\text {logit}(p_{i}) = \mathbf{x }_{i}^\mathrm{T} {\varvec{\beta }} + h_{i},\) where \(\mathbf{h } \sim N(\mathbf 0 ,\, \tau \mathbf{K }),\, \tau = 1/ \lambda ,\,\lambda \) is the penalty parameter from the penalized likelihood, and \(\mathbf{K }\) is a square matrix whose \((i,\,j)\)th element is (2). The normal equations given in (5) of [24] coincide with iteratively fitting a working linear mixed model \(\widetilde{\mathbf{y }} = \mathbf{X } {\varvec{\beta }} + \mathbf{h } + \varvec{\varepsilon }\) until convergence, where \({\varvec{\beta }}\) and \(\mathbf{h }\) are estimated using BLUE and BLUP, respectively and \(\varvec{\varepsilon } \sim \text {N}(\mathbf 0 ,\, \mathbf{D })\) where \(\mathbf{D } = \text {diag}\{p_{i}(1-p_{i})\}.\) The regularization parameter \(\tau \) can be estimated by treating it as a variance component and maximizing the REML criterion
where \(\mathbf{V }_{u} = \mathbf{D }^{-1} + \tau \mathbf{K }\) and \(\widetilde{\mathbf{y }} = \mathbf{X } {\varvec{\beta }} + \mathbf{h } + \mathbf{D }^{-1}(\mathbf{y } - \mathbf p ).\) We refer to [24] for full details.
Testing the overall genetic effect \(H_{0}{\text {:}}\,h(\mathbf{z }) = 0\) for UV responses is equivalent to testing \(H_{0}{\text {:}}\, \tau = 0.\) Liu et al. [24] propose the following score test statistic based on the derivative of (8) with respect to \(\tau \)
where \(Q(\widehat{{\varvec{\beta }}}_{0}) = (\widetilde{\mathbf{y }} - \mathbf{X } \widehat{{\varvec{\beta }}}_{0})^\mathrm{T} \mathbf{D } \mathbf{K } \mathbf{D } (\widetilde{\mathbf{y }} - \mathbf{X } \widehat{{\varvec{\beta }}}_{0}) = (\mathbf{y } - \widehat{\mathbf{p }}_{0})^\mathrm{T} \mathbf{K } (\mathbf{y } - \widehat{\mathbf{p }}_{0}),\, \text {logit} (\widehat{\mathbf{p }}_{0}) = \mathbf{X } \widehat{{\varvec{\beta }}}_{0},\, \widehat{{\varvec{\beta }}}_{0}\) is the MLE of \({\varvec{\beta }}\) under the null logistic model, \(p_{Q} = \text {tr}\{\mathbf{P }_{0} \mathbf{K } \},\, \sigma _{Q} = 2 \text {tr}\{\mathbf{P }_{0} \mathbf{K } \mathbf{P }_{0} \mathbf{K } \},\) and \(\mathbf{P }_{0} = \mathbf{D }_{0} - \mathbf{D }_{0} \mathbf{X } (\mathbf{X }^\mathrm{T}\mathbf{D }_{0} \mathbf{X })^{-1} \mathbf{X }^\mathrm{T} \mathbf{D }_{0}\) where \(\mathbf{D }_{0} = \text {diag}\{\hat{p}_{i0}(1-\hat{p}_{i0}) \}.\)
From Sect. 2.2.1, we modify the working model in (5). Define \(\mathbf{D }\) as the block diagonal matrix with blocks \(\mathbf{D }_{1}, \ldots , \mathbf{D }_{t}.\) Then the variance–covariance matrix of \(\varvec{\varepsilon }\) is \(\mathbf{D }^{-1} \mathbf{D }^{1/2} \mathbf{S } \mathbf{D }^{1/2} \mathbf{D }^{-1} = \mathbf{D }^{-1/2} \mathbf{S } \mathbf{D }^{-1/2}\) and the modified working model will have the same form as (5) but with \(\varvec{\varepsilon } \sim \text {MVN}(\mathbf 0 ,\, \mathbf{D }^{-1/2} \mathbf{S } \mathbf{D }^{-1/2}).\) The parameters \({\varvec{\beta }}_{j}\) and \(\mathbf{h }_{j}\) can now be estimated using BLUE and BLUP, respectively, and the variance components \(\tau _{j}\) can be estimated by maximizing the restricted quasi-likelihood criterion
where \(\mathbf{V }_{m} = \mathbf{D }^{-1/2} \mathbf{S } \mathbf{D }^{-1/2} + \mathbf{V }_{h}.\)
The main goal is to test for genetic pathway effects \(H_{0}{\text {:}}\, \mathbf{h }(\cdot ) = \mathbf 0 \) which is equivalent to testing \(H_{0}{\text {:}}\, \tau _{1} = \cdots = \tau _{t} = 0.\) To do this, we propose a score-type test statistic based on the derivative of the quasi-likelihood like that in (10). Taking the derivative of the criterion in (10) with respect to \(\tau _{j}\) for \(j=1,\ldots , t\) and then setting \(\tau _{j} = 0,\) the score function for \(\tau _{j}\) is \(S_{j} = Q_{j}({\varvec{\beta }},\,\varvec{\theta }) - p_{jQ},\) where
and \(p_{jQ} = \text {tr}\{ \mathbf{P } \mathbf{K }_{j}^{*} \}.\) Because the \(\tau _{j}\)’s are considered as variance components and thus are non-negative, testing \(H_{0}{\text {:}}\,\tau _{1} = \cdots = \tau _{t} = 0\) is equivalent to testing \(H_{0}{\text {:}}\,\tau _{1} + \cdots + \tau _{t} = 0\) and we adopt a similar technique to [25].
From Sect. 2.2.2, in order to evaluate Q in (6), we estimate \({\varvec{\beta }}\) and \(\varvec{\theta }\) under the null using GEEs. We posit the GEEs under \(H_{0}\)
where \(\widetilde{\mathbf{X }}\) is an \(nt \times nt\) block diagonal matrix with elements \(\mathbf{X }\) and \(\mathbf{V }_{m0} = \mathbf{D }^{-1/2} \mathbf{S } \mathbf{D }^{-1/2}.\) If there is no genetic pathway effect (\(H_{0}{\text {:}}\,\mathbf{h }(\cdot ) = \mathbf 0 \) is true), then \(\mathbf {V} _{m0}\) is the working variance–covariance matrix of \(\mathbf{y }.\) To solve (11), Liang and Zeger [19] suggest using a modified Fisher scoring algorithm to find \({\varvec{\beta }}\) and a method of moments estimation for \(\varvec{\theta }.\) The updating equation is
The initial estimates in the first iteration come from fitting a GLM assuming independence.
We then use a simulation-based technique to get the p-values of Q. It can be shown that, under \(H_0,\) var\((\mathbf{y }_{j} - \widehat{\mathbf{p }}_{j})\) can be approximated as \(\widehat{\mathbf{D }}_{j} - \widehat{\mathbf{D }}_{j} \mathbf{X } (\mathbf{X }^\mathrm{T} \widehat{\mathbf{D }}_{j} \mathbf{X })^{-1} \mathbf{X }^\mathrm{T} \widehat{\mathbf{D }}_{j} = \widehat{\mathbf{P }}_{j}.\) This follows from the fact that under \(H_0,\,\mathbf{y }_{j} - \widehat{\mathbf{p }}_{j} = \mathbf{D }_{j}( \widetilde{\mathbf{y }}_{j} - \mathbf{X } \widehat{{\varvec{\beta }}}_{j}),\) and that \(\text {Cov}(\widetilde{\mathbf{y }}_{j} - \mathbf{X } \widehat{{\varvec{\beta }}}_{j}) = \mathbf{D }_{j}^{-1} - {\mathbf{D }}_{j}^{-1} \mathbf{X } (\mathbf{X }^\mathrm{T} {\mathbf{D }}_{j} \mathbf{X })^{-1} \mathbf{X }^\mathrm{T} {\mathbf{D }}_{j}^{-1}\) from linear model theory. If the multiple outcomes were independent, then the variance–covariance matrix of \(\mathbf{y } - \widehat{\mathbf{p }}\) would be \(\widehat{\mathbf{P }},\) a block diagonal matrix with elements \(\widehat{\mathbf{P }}_{j},\) but since the outcomes are correlated, the variance–covariance matrix of \(\mathbf{y } - \widehat{\mathbf{p }}\) is \(\widehat{\mathbf{P }}^{1/2} \widehat{\mathbf{S }} \widehat{\mathbf{P }}^{1/2},\) which is no longer block diagonal. By defining \(\widehat{\mathbf{M }} = \widehat{\mathbf{P }}^{1/2} \widehat{\mathbf{S }} \widehat{\mathbf{P }}^{1/2},\) (6) can be rewritten as \(Q = \{ \widehat{\mathbf{M }}^{-1/2}({y}- \widehat{\mathbf{p }})\}^\mathrm{T} \widehat{\mathbf{B }} \{ \widehat{\mathbf{M }} ^{1/2} (\mathbf{y }-\widehat{\mathbf{p }}) \}\) where \(\widehat{\mathbf{B }} = \widehat{\mathbf{M }}^{1/2} ( \widehat{\mathbf{S }} \widehat{\mathbf{D }}^{1/2} )^{-1} \widehat{\mathbf{D }}^{1/2} {K} \widehat{\mathbf{D }}^{1/2} ( \widehat{\mathbf{D }}^{1/2} \widehat{\mathbf{S }})^{-1} \widehat{\mathbf{M }}^{1/2}.\) Using eigenvalue decomposition, we can write \(\widehat{\mathbf{B }} = \mathbf{U } {\varvec{\Lambda }} \mathbf{U }^\mathrm{T}\), where \(\mathbf{U }\) is the matrix of orthogonal eigenvectors of \(\widehat{\mathbf{B }}\) and \({\varvec{\Lambda }}\) is a diagonal matrix whose elements are the corresponding eigenvalues. Thus \(Q = \widehat{\mathbf{R }}^\mathrm{T} {\varvec{\Lambda }} \widehat{\mathbf{R }},\) where \(\widehat{\mathbf{R }} = \mathbf{U }^\mathrm{T} \widehat{\mathbf{M }}^{-1/2} (\mathbf{y } - \widehat{\mathbf{p }}).\)
Rights and permissions
About this article
Cite this article
Davenport, C.A., Maity, A., Sullivan, P.F. et al. A Powerful Test for SNP Effects on Multivariate Binary Outcomes Using Kernel Machine Regression. Stat Biosci 10, 117–138 (2018). https://doi.org/10.1007/s12561-017-9189-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12561-017-9189-9