Genome-Wide Complex Trait Analysis (GCTA): Methods, Data Analyses, and Interpretations

Yang, Jian; Lee, Sang Hong; Goddard, Michael E.; Visscher, Peter M.

doi:10.1007/978-1-62703-447-0_9

Jian Yang⁴,
Sang Hong Lee⁵,
Michael E. Goddard^6,7 &
…
Peter M. Visscher^4,5

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1019))

19k Accesses
8 Altmetric

Abstract

Estimating genetic variance is traditionally performed using pedigree analysis. Using high-throughput DNA marker data measured across the entire genome it is now possible to estimate and partition genetic variation from population samples. In this chapter, we introduce methods and a software tool called Genome-wide Complex Trait Analysis (GCTA) to estimate genomic relationships between pairs of conventionally unrelated individuals using genome-wide single nucleotide polymorphism (SNP) data, to estimate variance explained by all SNPs simultaneously on genomic or chromosomal segments or over the whole genome, and to perform a joint and conditional multiple SNPs association analysis using summary statistics from a meta-analysis of genome-wide association studies and linkage disequilibrium between SNPs estimated from a reference sample.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 279.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits

Article 26 April 2018

From genome-wide associations to candidate causal variants by statistical fine-map**

Article 29 May 2018

Genome-Wide Association Studies and Heritability Estimation in the Functional Genomics Era

References

Hindorff LA, Sethupathy P, Junkins HA et al (2009) Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A 106(23):9362–9367
Article PubMed CAS Google Scholar
Maher B (2008) Personal genomes: the case of the missing heritability. Nature 456(7218):18–21
Article PubMed CAS Google Scholar
Yang J, Benyamin B, McEvoy BP et al (2010) Common SNPs explain a large proportion of the heritability for human height. Nat Genet 42(7):565–569
Article PubMed CAS Google Scholar
Yang J, Manolio TA, Pasquale LR et al (2011) Genome partitioning of genetic variation for complex traits using common SNPs. Nat Genet 43(6):519–525
Article PubMed CAS Google Scholar
Davies G, Tenesa A, Payton A et al (2011) Genome-wide association studies establish that human intelligence is highly heritable and polygenic. Mol Psychiatry 16(10):996–1005
Article PubMed CAS Google Scholar
Deary IJ, Yang J, Davies G et al (2012) Genetic contributions to stability and change in intelligence from childhood to old age. Nature 482(7384):212–215
PubMed CAS Google Scholar
Lee SH, Decandia TR, Ripke S et al (2012) Estimating the proportion of variation in susceptibility to schizophrenia captured by common SNPs. Nat Genet 44(3):247–250
Article PubMed CAS Google Scholar
Gibson G (2010) Hints of hidden heritability in GWAS. Nat Genet 42(7):558–560
Article PubMed CAS Google Scholar
Visscher PM, Brown MA, McCarthy MI, Yang J (2012) Five years of GWAS discovery. Am J Hum Genet 90(1):7–24
Article PubMed CAS Google Scholar
Teslovich TM, Musunuru K, Smith AV et al (2010) Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466(7307):707–713
Article PubMed CAS Google Scholar
Heid IM, Jackson AU, Randall JC et al (2010) Meta-analysis identifies 13 new loci associated with waist-hip ratio and reveals sexual dimorphism in the genetic basis of fat distribution. Nat Genet 42(11):949–960
Article PubMed CAS Google Scholar
Lango Allen H, Estrada K, Lettre G et al (2010) Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467(7317):832–838
Article PubMed CAS Google Scholar
Speliotes EK, Willer CJ, Berndt SI et al (2010) Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nat Genet 42(11):937–948
Article PubMed CAS Google Scholar
Ripke S, Sanders AR, Kendler KS et al (2011) Genome-wide association study identifies five new schizophrenia loci. Nat Genet 43(10):969–976
Article CAS Google Scholar
Yang J, Ferreira T, Morris AP et al (2012) Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat Genet 44(4):369–375
Article PubMed CAS Google Scholar
Yang J, Lee SH, Goddard ME, Visscher PM (2011) GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet 88(1):76–82
Article PubMed CAS Google Scholar
Hayes BJ, Visscher PM, Goddard ME (2009) Increased accuracy of artificial selection by using the realized relationship matrix. Genet Res 91(1):47–60
Article CAS Google Scholar
Strandén I, Garrick DJ (2009) Technical note: derivation of equivalent computing algorithms for genomic predictions and reliabilities of animal merit. J Dairy Sci 92(6):2971–2975
Article PubMed Google Scholar
VanRaden PM (2008) Efficient methods to compute genomic predictions. J Dairy Sci 91(11):4414–4423
Article PubMed CAS Google Scholar
Patterson HD, Thompson R (1971) Recovery of inter-block information when block sizes are unequal. Biometrika 58(3):545–554
Article Google Scholar
Purcell S, Neale B, Todd-Brown K et al (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81(3):559–575
Article PubMed CAS Google Scholar
Lee SH, van der Werf JH (2006) An efficient variance component approach implementing an average information REML suitable for combined LD and linkage map** with a general complex pedigree. Genet Sel Evol 38(1):25–43
Article PubMed CAS Google Scholar
Jorjani H, Klei L, Emanuelson U (2003) A simple method for weighted bending of genetic (co)variance matrices. J Dairy Sci 86(2):677–679
Article PubMed CAS Google Scholar
Hill WG, Thompson R (1978) Probabilities of non-positive definite between-group or genetic covariance matrices. Biometrics 34:429–439
Article Google Scholar
Haseman JK, Elston RC (1972) The investigation of linkage between a quantitative trait and a marker locus. Behav Genet 2:2–19
Article Google Scholar
Lynch M, Walsh B (1998) Genetics and analysis of quantitative traits. Sinauer Associates, Sunderland, MA
Google Scholar
Falconer DS (1965) The inheritance of liability to certain diseases, estimated from the incidence among relatives. Ann Hum Genet 29:51–71
Article Google Scholar
Dempster ER, Lerner IM (1950) Heritability of threshold characters. Genetics 35(2):212–236
PubMed CAS Google Scholar
Lee SH, Wray NR, Goddard ME, Visscher PM (2011) Estimating missing heritability for disease from genome-wide association studies. Am J Hum Genet 88(3):294–305
Article PubMed Google Scholar
Price AL, Weale ME, Patterson N et al (2008) Long-range LD can confound genome scans in admixed populations. Am J Hum Genet 83(1):132–135
Article PubMed CAS Google Scholar
Gilmour AR, Thompson R, Cullis BR (1995) Average information REML: an efficient algorithm for variance parameters estimation in linear mixed models. Biometrics 51:1440–1450
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Queensland Diamantina Institute, Princess Alexandra Hospital, University of Queensland, Brisbane, QLD, Australia
Jian Yang & Peter M. Visscher
The Queensland Brain Institute, The University of Queensland, Brisbane, QLD, Australia
Sang Hong Lee & Peter M. Visscher
Department of Food and Agricultural Systems, University of Melbourne, Melbourne, VIC, Australia
Michael E. Goddard
Biosciences Research Division, Department of Primary Industries, Bundoora, VIC, Australia
Michael E. Goddard

Authors

Jian Yang
View author publications
You can also search for this author in PubMed Google Scholar
Sang Hong Lee
View author publications
You can also search for this author in PubMed Google Scholar
Michael E. Goddard
View author publications
You can also search for this author in PubMed Google Scholar
Peter M. Visscher
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

, Ctr. Genetic Analysis and Applications, University of New England, Homestead Building 18, Armidale, 2351, New South Wales, Australia
Cedric Gondro
School of Environmental and Rural Scienc, Div. Animal Science, University of New England, Homestead Building 36, Armidale, 2351, New South Wales, Australia
Julius van der Werf
, Biosciences Research Division, Department of Primary Industries, Ring Road 5, Bundoora, 3083, Victoria, Australia
Ben Hayes

Appendix A

In Eq. 4, we have

$$ \mathbf{y}=\mathbf{Xb}+\sum\limits_{i-1}^r {{{\mathbf{g}}_i}} +\mathbf{e}\;\mathrm{ and} \operatorname {var}(\mathbf{y})=\mathbf{V}=\sum\limits_{i-1}^r {{{\mathbf{A}}_i}\sigma_i^2} +\mathbf{I}\sigma_{\mathrm{ e}}^2, $$

of which Eq. 2 is a special case with r = 1. By default in GCTA, we use the average information (AI) REML algorithm [31] to obtain the estimates the variance components $ \sigma_i^2 $ and $ \sigma_{\mathrm{ e}}^2 $ through iteration. In the tth iteration, $ {{\mathbf{q}}^{(t) }}={{\mathbf{q}}^{(t-1) }}+{{({{\mathbf{H}}^{(t-1) }})}^{-1 }}\frac{{\partial L}}{{\partial \mathbf{q}}}|{{\mathbf{q}}^{(t-1) }} $, where $ \mathbf{q} $ is a vector of the estimates of variance components ($ \hat{\sigma}_1^2 $, …, $ \hat{\sigma}_r^2 $ and $ \hat{\sigma}_{\mathrm{ e}}^2 $); L is the log likelihood function of the mixed linear model (ignoring the constant), $ L=-1/2(\log |\hat{\mathbf{V}} |+\log |{\mathbf{X}}^{\prime}{{\hat{\mathbf{V}}}^{-1 }\bf X}|+\mathbf{y} \mathbf{^{\prime}\bf Py}) $ with $ \hat{\mathbf{V}} =\sum\limits_{i=1}^r {{{\mathbf{A}}_i}\hat{\sigma}_i^{2(t-1) }} +\mathbf{I}\hat{\sigma}_e^{2(t-1) } $ and $ \mathbf{P}={{\hat{\mathbf{V}}}^{-1 }}-{{\hat{\mathbf{V}}}^{-1 }}\mathbf{X}{{({\mathbf{X}}^{\prime}{{\hat{\mathbf{V}}}^{-1 }}\mathbf{X})}^{-1 }}{\mathbf{X}}^{\prime}{{\hat{\mathbf{V}}}^{-1 }} $ ; H is the average of the observed and expected information matrices [22],

$$ \mathbf{H}=\frac{1}{2}\left\lfloor {\begin{array}{*{20}{c}} {\mathbf{y} \mathbf{^{\prime}P}{{\mathbf{A}}_1}\mathbf{P}{{\mathbf{A}}_1}\mathbf{P}\mathbf{y}} & \cdots & {\mathbf{y} \mathbf{^{\prime}P}{{\mathbf{A}}_1}\mathbf{P}{{\mathbf{A}}_r}\mathbf{P}\mathbf{y}} & {\mathbf{y} \mathbf{^{\prime}P}{{\mathbf{A}}_1}\mathbf{P}\mathbf{P}\mathbf{y}} \\ {\vdots } & \vdots & \vdots & \vdots \\ {\mathbf{y} \mathbf{^{\prime}P}{{\mathbf{A}}_r}\mathbf{P}{{\mathbf{A}}_1}\mathbf{P}\mathbf{y}} & \cdots & {\mathbf{y} \mathbf{^{\prime}P}{{\mathbf{A}}_r}\mathbf{P}{{\mathbf{A}}_r}\mathbf{P}\mathbf{y}} & {\mathbf{y} \mathbf{^{\prime}P}{{\mathbf{A}}_r}\mathbf{P}\mathbf{P}\mathbf{y}} \\ {\mathbf{y} \mathbf{^{\prime}PP}{{\mathbf{A}}_1}\mathbf{P}\mathbf{y}} & \cdots & {\mathbf{y} \mathbf{^{\prime}PP}{{\mathbf{A}}_r}\mathbf{P}\mathbf{y}} & {\mathbf{y} \mathbf{^{\prime}PPPy}} \\ \end{array}} \right\rfloor; $$

and $ \frac{{\partial L}}{{\partial \mathbf{q}}} $ is a vector of first derivatives of the log likelihood function with respect to each variance component,

$$ \frac{{\partial L}}{{\partial \mathbf{q}}}=-\frac{1}{2}\left[ {\begin{array}{*{20}{c}} {\mathrm{tr}(\mathbf{P}{{\mathbf{A}}_1})-\mathbf{y} \mathbf{^{\prime}P}{{\mathbf{A}}_1}\mathbf{Py}} \\ \vdots \\ {\mathrm{tr}(\mathbf{P}{{\mathbf{A}}_r})-\mathbf{y} \mathbf{^{\prime}P}{{\mathbf{A}}_r}\mathbf{Py}} \\ {\mathrm{tr}(\mathbf{P})-\mathbf{y} \mathbf{^{\prime}PPy}} \\ \end{array}} \right] $$

We also provide in GCTA two optional algorithms to estimate the variance components, which we call the direct REML and EM-REML. For the direct REML algorithm, the variance components in the tth iteration are estimated as

$$ {{\mathbf{q}}^{(t) }}={{\left[ {\begin{array}{*{20}{c}} {\mathrm{tr}(\mathbf{P}{{\mathbf{A}}_1}\mathbf{P}{{\mathbf{A}}_1})} & \cdots & {\mathrm{tr}(\mathbf{P}{{\mathbf{A}}_1}\mathbf{P}{{\mathbf{A}}_r})} & {\mathrm{tr}(\mathbf{P}{{\mathbf{A}}_1}\mathbf{P})} \\ \vdots & \vdots & \vdots & \vdots \\ {\mathrm{tr}(\mathbf{P}{{\mathbf{A}}_r}\mathbf{P}{{\mathbf{A}}_1})} & \cdots & {\mathrm{tr}(\mathbf{P}{{\mathbf{A}}_r}\mathbf{P}{{\mathbf{A}}_r})} & {\mathrm{tr}(\mathbf{P}{{\mathbf{A}}_r}\mathbf{P})} \\ {\mathrm{tr}(\mathbf{P}\mathbf{P}{{\mathbf{A}}_1})} & \cdots & {\mathrm{tr}(\mathbf{P}\mathbf{P}{{\mathbf{A}}_r})} & {\mathrm{tr}(\mathbf{P}\mathbf{P})} \\ \end{array}} \right]}^{-1 }}\left[ {\begin{array}{*{20}{c}} {\mathbf{y} \mathbf{^{\prime}P}{{\mathbf{A}}_1}\mathbf{P}\mathbf{y}} \\ \vdots \\ {\mathbf{y} \mathbf{^{\prime}P}{{\mathbf{A}}_r}\mathbf{P}\mathbf{y}} \\ {\mathbf{y} \mathbf{^{\prime}PPy}} \\ \end{array}} \right] $$

The direct REML algorithm is generally more robust but computationally less efficient than AI-REML. For the EM-REML algorithm, each variance component is estimated as

$$ \sigma_i^{2(t) }=[\sigma_i^{4(t-1)}\mathbf{y} \mathbf{^{\prime}\bf P}{{\mathbf{A}}_i}\mathbf{P}\mathbf{y}+\mathrm{ tr}(\sigma_i^{2(t-1)}\mathbf{I}-\sigma_i^{4(t-1)}\mathbf{P}{{\mathbf{A}}_i})]/n $$

The EM-REML is robust, which guarantees increased likelihood after each iteration, but is extremely slow to converge. We therefore do not recommend choosing EM-REML in GCTA unless we know that the starting values are very close to the estimates. The GCTA option for choosing different REML algorithm is --reml-alg with the input value 0 for AI-REML (default), 1 for the direct REML algorithm and 2 for EM-REML. At the beginning of the iteration process, all the variance components are initialized by an arbitrary value, i.e., $ \sigma_i^{2(0) }=\sigma_{\mathrm{ P}}^2/(r+1) $, which is subsequently updated by the EM-REML algorithm $ \sigma_i^{2(1) }=[\sigma_i^{4(0)}\mathbf{y} \mathbf{^{\prime}\bf P}{{\mathbf{A}}_i}\mathbf{P}\mathbf{y}+\mathrm{ tr}(\sigma_i^{2(0)}\mathbf{I}-\sigma_i^{4(0)}\mathbf{P}{{\mathbf{A}}_i})]/n $. The EM-REML algorithm is used as an initial step to determine the direction of the iteration updates because it is robust to poor starting values. We also provide options (--reml-priors and --reml-priors-var) in GCTA for users to specify starting values. After one EM-REML iteration, GCTA switches to the AI-REML algorithm (or the other two algorithms) for the remaining iterations until the iteration converges with the criteria of L ^(t) − L ^(t−1) < 10⁻⁴ where L ^(t) is the log likelihood of the tth iteration. By default, any variance component that escapes from the parameter space (i.e., its estimate is negative) will be set to $ {10^{-6 }} \times \sigma_{\mathrm{ P}}^2 $. If a component keeps esca** from the parameter space, it will be constrained at $ {10^{-6 }} \times \sigma_{\mathrm{ P}}^2 $. There is an option in GCTA (--reml-no-constrain) that allows the estimates of variance components to be negative. This is justified because if a parameter is zero, an unbiased estimate of this parameter will have half chance being negative. In practice, however, a negative variance component is usually difficult to interpret. We also provide an option (--reml-maxit) for users to specify the maximum number of iterations at which the iteration process will stop without convergence.

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Yang, J., Lee, S.H., Goddard, M.E., Visscher, P.M. (2013). Genome-Wide Complex Trait Analysis (GCTA): Methods, Data Analyses, and Interpretations. In: Gondro, C., van der Werf, J., Hayes, B. (eds) Genome-Wide Association Studies and Genomic Prediction. Methods in Molecular Biology, vol 1019. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-62703-447-0_9

Download citation

DOI: https://doi.org/10.1007/978-1-62703-447-0_9
Published: 11 May 2013
Publisher Name: Humana Press, Totowa, NJ
Print ISBN: 978-1-62703-446-3
Online ISBN: 978-1-62703-447-0
eBook Packages: Springer Protocols

Publish with us

Policies and ethics

Genome-Wide Complex Trait Analysis (GCTA): Methods, Data Analyses, and Interpretations

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits

From genome-wide associations to candidate causal variants by statistical fine-map**

Genome-Wide Association Studies and Heritability Estimation in the Functional Genomics Era

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Appendix A

Rights and permissions

Copyright information

About this protocol

Cite this protocol

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Genome-Wide Complex Trait Analysis (GCTA): Methods, Data Analyses, and Interpretations

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits

From genome-wide associations to candidate causal variants by statistical fine-map**

Genome-Wide Association Studies and Heritability Estimation in the Functional Genomics Era

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Appendix A

Appendix A

Rights and permissions

Copyright information

About this protocol

Cite this protocol

Download citation

Publish with us

Search

Navigation