Adjusting heterogeneous ascertainment bias for genetic association analysis with extended families

Park, Suyeon; Lee, Sungyoung; Lee, Young; Herold, Christine; Hooli, Basavaraj; Mullin, Kristina; Park, Taesung; Park, Changsoon; Bertram, Lars; Lange, Christoph; Tanzi, Rudolph; Won, Sungho

doi:10.1186/s12881-015-0198-6

Adjusting heterogeneous ascertainment bias for genetic association analysis with extended families

Research article
Open access
Published: 19 August 2015

Volume 16, article number 62, (2015)
Cite this article

Download PDF

You have full access to this open access article

BMC Medical Genetics

Adjusting heterogeneous ascertainment bias for genetic association analysis with extended families

Download PDF

Suyeon Park^1,2,3,
Sungyoung Lee⁴,
Young Lee^1,2,
Christine Herold^5,6,
Basavaraj Hooli⁷,
Kristina Mullin⁷,
Taesung Park⁸,
Changsoon Park¹,
Lars Bertram^7,9,10,
Christoph Lange^6,11,12,13,
Rudolph Tanzi⁷ &
…
Sungho Won^14,15,16

2224 Accesses
7 Citations
10 Altmetric
1 Mention
Explore all metrics

Abstract

Background

In family-based association analysis, each family is typically ascertained from a single proband, which renders the effects of ascertainment bias heterogeneous among family members. This is contrary to case–control studies, and may introduce sample or ascertainment bias. Statistical efficiency is affected by ascertainment bias, and careful adjustment can lead to substantial improvements in statistical power. However, genetic association analysis has often been conducted using family-based designs, without addressing the fact that each proband in a family has had a great influence on the probability for each family member to be affected.

Method

We propose a powerful and efficient statistic for genetic association analysis that considered the heterogeneity of ascertainment bias among family members, under the assumption that both prevalence and heritability of disease are available. With extensive simulation studies, we showed that the proposed method performed better than the existing methods, particularly for diseases with large heritability.

Results

We applied the proposed method to the genome-wide association analysis of Alzheimer’s disease. Four significant associations with the proposed method were found.

Conclusion

Our significant findings illustrated the practical importance of this new analysis method.

View this article's peer review reports

Case–control association map** by proxy using family history of disease

Article 16 January 2017

Family-based association analyses of imputed genotypes reveal genome-wide significant association of Alzheimer’s disease with OSBPL6, PTPRG, and PDCL3

Article 02 February 2016

Novel genetic matching methods for handling population stratification in genome-wide association studies

Article Open access 14 March 2015

Background

Genome-wide association studies (GWASs) have been used to identify many genes involved in human diseases, and during the last decade, many disease-susceptibility variants have been identified. However, despite these successes, we have found that variants discovered from GWASs often explain only a small proportion of the heritability of diseases [1, 2]. For example, SNPs significantly associated with human height explain only about 5 % of phenotypic variance, despite studies of tens of thousands of people [3]. Many reasons, such as rare causal variants and gene/gene interactions, have been attributed to this so-called “missing heritability”. However, the low power induced by the multiple-testing problem is still an intractable issue in GWASs, and further investigations of the most efficient strategies for genetic association analysis are necessary.

Careful selection of samples based on phenotypes can lead to improved power for the discovery of risk variants [4–11]. One such example is the extreme discordant sib-pair design in linkage analysis, which may result in a substantial increase in statistical power when compared to other sib-pair designs [11, 12]. Similarly, ascertaining the extremes of quantitative phenotypes from large population cohorts has also been shown to increase the power to identify associated variants [13–15]. In such a design, the effect of ascertainment conditions are homogeneous between individuals, and existing methods, such as the Cochran-Armitage(CA) trend test [16], can be an efficient choice. However, in association analysis using extended families, the effects of ascertainment bias are often heterogeneous among family members, and depending on their relationships with probands, different magnitudes of ascertainment bias may be generated. In particular, the probability of each individual being affected when his or her relatives are affected is similar to the prevalence, if the heritability is small, which indicates that the heterogeneous effect of the ascertainment bias depends on the magnitude of heritability. However, the heterogeneous effects of ascertainment conditions and the influence of heritability on it have not yet been investigated, and should therefore be taken into account for association analysis.

Recently, the CA trend test was extended for association analysis of dichotomous phenotypes with family-based samples [17, 18]. These statistics compares the genotype frequencies between affected and unaffected individuals, and the genetic association with family-based samples is tested by building a genotype correlation matrix with either kinship coefficients or an empirical correlation matrix estimated from large-scale genetic data. This approach has been extended to include family members with known phenotypes and missing genotypes or vice versa. By the nature of these statistics, it performs well for ascertained family-based samples and it can be an efficient choice, even for a case–control design, if the relatives’ phenotype information is available. However, their statistical efficiency is affected by the heterogeneous effect of the ascertainment bias on family members, and for extended families, its effects on statistical efficiency can be substantial.

In this report, we consider the heterogeneous effects of the ascertainment bias on family members for dichotomous phenotypes. By the nature of the proposed methods, individuals with missing genotypes and non-missing phenotypes can be utilized, and incorporation of the estimated kinship matrix to the proposed statistic provided robustness against the population substructure. The proposed method consists of two steps; the probability for each family member to be affected was calculated using a latent continuous liability [19], and then this probability is incorporated into a quasi-likelihood score test. With an extensive simulation, we showed that the proposed method performed better than the existing methods, particularly for a disease with large heritability. Application of our method to Alzheimer’s disease (AD) demonstrated its practical use in the detection of genetic associations in ascertained family-based samples.

Methods

Notations and statistic

We assumed that there were n families and n_i family members in each family. We considered the situation where the family of size n_i was ascertained because it contained a particular set of p_i members, and we let q_i = n_i – p_i. We called the members of the set of p_i family members “probands”, and the remaining q_i individuals “non-probands”. To provide a clearer motivation on this concept, we randomly selected two families, family 1 and 2, from our AD data (see Fig. 1). In family 1 (Fig. 1-(a)), individual 9 was diagnosed as AD and individuals 3–8 were selected as her relatives for genetic analysis. In family 2 (Fig. 1-(b)), individual 3 was diagnosed as AD, and individuals 4–6 were selected. Therefore p₁ = p₂ = 1, q₁ = 6 and q₂ = 3 in this example. In real data analysis, p_i is often 1 and q_i = n_i – 1. We assumed that N individuals were available and thus N = ∑_in_i. The genotypes were coded as 0, 1, or 2, according to the number of disease alleles. x ^P_ij and x ^N_{i ' j '} were defined as the genotypes of proband j and non-proband j' in family i and family i', respectively. Phenotypes were coded as 0 for an unaffected individual and 1 for an affected individual. If we let the prevalence of the disease be p, a missing phenotype was coded as p. We denoted the phenotypes of a proband and non-proband by y ^P_ij and y ^N_{i ' j '} , respectively, and the vectors for genotypes and phenotypes in family i were defined by

$$ {\mathbf{X}}_i^P=\left(\kern1em \begin{array}{c}{x}_{i1}^P\kern1em \\ {}\kern1em {x}_{i2}^P\kern1em \\ {}\kern1em \vdots \kern1em \\ {}\kern1em {x}_{i{p}_i}^P\end{array}\kern1em \right),\kern0.5em {\mathbf{X}}_i^N=\left(\kern1em \begin{array}{c}{x}_{i1}^N\kern1em \\ {}\kern1em {x}_{i2}^N\kern1em \\ {}\kern1em \vdots \kern1em \\ {}\kern1em {x}_{i{q}_i}^N\end{array}\kern1em \right)\kern0.5em ,\kern0.5em {\mathbf{X}}_i=\left(\kern1em \begin{array}{c}{\mathbf{X}}_i^P\kern1em \\ {}\kern1em {\mathbf{X}}_i^N\end{array}\kern1em \right),\kern0.5em {\mathbf{Y}}_i^P=\left(\kern1em \begin{array}{c}{y}_{i1}^P\kern1em \\ {}\kern1em {y}_{i2}^P\kern1em \\ {}\kern1em \vdots \kern1em \\ {}\kern1em {y}_{i{p}_i}^P\end{array}\kern1em \right),\kern0.5em {\mathbf{Y}}_i^N=\left(\kern1em \begin{array}{c}{y}_{i1}^N\kern1em \\ {}\kern1em {y}_{i2}^N\kern1em \\ {}\kern1em \vdots \kern1em \\ {}\kern1em {y}_{i{q}_i}^N\end{array}\kern1em \right),\kern0.5em \mathrm{and}\kern0.5em {\mathbf{Y}}_i=\left(\kern1em \begin{array}{c}{\mathbf{Y}}_i^P\kern1em \\ {}\kern1em {\mathbf{Y}}_i^N\end{array}\kern1em \right) $$

We also denoted the w × w identity matrix by I_w, and the w × 1 column vector 1_w indicated a vector in which all elements were 1. Let π ^P_ijj ' and π ^N_ijj ' be the kinship coefficient between probands j and j' in family i, and non-proband j and j' in family i, respectively. In addition, we let π ^PN_ijj ' be the kinship coefficient between proband j and non-proband j' in family i, and let d_ij^P and d_ij'^N be the inbreeding coefficient for proband j and non-proband j' in family i, respectively. The inbreeding coefficient is the parameter that quantifies the departure from Hardy-Weinberg equilibrium (HWE) and ranges from 0 to 1. Several approaches [20, 21] that can estimate d_ij have been proposed. We let

$$ {\mathbf{R}}_i^P=\left(\kern1em \begin{array}{ccc}1+{d}_{i1}^P\kern1em & \kern1em \cdots \kern1em & \kern1em 2{\pi}_{i1{p}_i}^P\kern1em \\ {}\kern1em \vdots \kern1em & \kern1em \ddots \kern1em & \kern1em \vdots \kern1em \\ {}\kern1em 2{\pi}_{i{p}_i1}^P\kern1em & \kern1em \dots \kern1em & \kern1em 1+{d}_{i{p}_i}^P\end{array}\kern1em \right),\kern0.5em {\mathbf{R}}_i^N=\left(\kern1em \begin{array}{ccc}1+{d}_{i1}^N\kern1em & \kern1em \cdots \kern1em & \kern1em 2{\pi}_{i1{q}_i}^N\kern1em \\ {}\kern1em \vdots \kern1em & \kern1em \ddots \kern1em & \kern1em \vdots \kern1em \\ {}\kern1em 2{\pi}_{i{q}_i1}^N\kern1em & \kern1em \dots \kern1em & \kern1em 1+{d}_{i{q}_i}^N\end{array}\kern1em \right),\kern0.5em {\mathbf{R}}_i^{PN}=\left(\kern1em \begin{array}{ccc}2{\pi}_{i11}^{PN}\kern1em & \kern1em \cdots \kern1em & \kern1em 2{\pi}_{i1{q}_i}^{PN}\kern1em \\ {}\kern1em \vdots \kern1em & \kern1em \ddots \kern1em & \kern1em \vdots \kern1em \\ {}\kern1em 2{\pi}_{i{p}_i1}^{PN}\kern1em & \kern1em \dots \kern1em & \kern1em 2{\pi}_{i{p}_i{q}_i}^{PN}\end{array}\kern1em \right), $$

and R_i is defined by

$$ {\mathbf{R}}_i=\left(\kern1em \begin{array}{c}{\mathbf{R}}_i^P\kern1em \\ {}\kern1em {\left({\mathbf{R}}_i^{PN}\right)}^t\kern1em \end{array}\kern0.5em \begin{array}{c}\kern1em {\mathbf{R}}_i^{PN}\kern1em \\ {}\kern1em {\mathbf{R}}_i^N\end{array}\kern1em \right) $$

If we let q_A be the disease allele frequency, E(X_i) was $ 2{q}_A{1}_{n_i} $, and q_A is estimated with the best linear unbiased estimator (BLUE). var(X_i) is expressed by σ²R_i, and σ² is equal to 2q_A(1 –q_A) under HWE.

When we analyzes the distribution of genotypes as in the FBAT approach, the statistical efficiency of the test statistic could be improved by adjustments of the phenotype with the so-called offset [22]. If we let μ_ij^P and μ_i'j'^N be offsets for proband j and non-proband j' in family i and family i', respectively, the offset vector for family i is defined as

$$ {\upmu}_i^P=\left(\kern1em \begin{array}{c}{\mu}_{i1}^P\kern1em \\ {}\kern1em {\mu}_{i2}^P\kern1em \\ {}\kern1em \vdots \kern1em \\ {}\kern1em {\mu}_{i{p}_i}^P\end{array}\kern1em \right),\kern0.5em {\upmu}_i^N=\left(\kern1em \begin{array}{c}{\mu}_{i1}^N\kern1em \\ {}\kern1em {\mu}_{i2}^N\kern1em \\ {}\kern1em \vdots \kern1em \\ {}\kern1em {\mu}_{i{q}_i}^N\end{array}\kern1em \right),\kern0.5em \mathrm{and}\kern0.5em {\upmu}_i=\left(\kern1em \begin{array}{c}{\upmu}_i^P\kern1em \\ {}\kern1em {\upmu}_i^N\end{array}\kern1em \right) $$

Setting T_i = Y_i–μ_i, we can define

$$ \mathbf{X}=\left(\kern1em \begin{array}{c}{\mathbf{X}}_1\kern1em \\ {}\kern1em {\mathbf{X}}_2\kern1em \\ {}\kern1em \vdots \end{array}\kern1em \right),\kern0.5em \mathbf{Y}=\left(\kern1em \begin{array}{c}{\mathbf{Y}}_1\kern1em \\ {}\kern1em {\mathbf{Y}}_2\kern1em \\ {}\kern1em \vdots \end{array}\kern1em \right),\kern0.5em \mathbf{T}=\left(\kern1em \begin{array}{c}{\mathbf{T}}_1\kern1em \\ {}\kern1em {\mathbf{T}}_2\kern1em \\ {}\kern1em \vdots \end{array}\kern1em \right),\kern0.5em \mathrm{and}\kern0.5em \mathbf{R}=\left(\kern1em \begin{array}{c}{\mathbf{R}}_1\kern1em \\ {}\kern1em 0\kern1em \\ {}\kern1em \vdots \kern1em \end{array}\begin{array}{c}\kern1em 0\kern1em \\ {}\kern1em {\mathbf{R}}_2\kern1em \\ {}\kern1em \vdots \kern1em \end{array}\begin{array}{c}\kern1em \cdots \kern1em \\ {}\kern1em \cdots \kern1em \\ {}\kern1em \ddots \end{array}\kern1em \right). $$

We denoted a minor allele frequency (MAF) of a variant in unaffected individuals by q. We assumed [18] that for a constant γ,

$$ E\left(\mathbf{X}\Big|\mathbf{T}\right)=2p{\mathbf{1}}_N+\gamma \mathbf{T}, $$

where 0 < 2p + γ < 1. Then, the score for a variant [18, 23] can be defined by

$$ S={\mathbf{T}}^t\left(\mathbf{X}-\widehat{E}\left(\mathbf{X}\right)\right)\kern0.5em \mathrm{and}\kern0.5em \widehat{E}\left(\mathbf{X}\right)={\mathbf{1}}_N{\left({\mathbf{1}}_N^t{\mathbf{R}}^{-1}{\mathbf{1}}_N\right)}^{-1}{\mathbf{1}}_N^t{\mathbf{R}}^{-1}\mathbf{X}. $$

The variance of S is

$$ var(S)={\sigma}^2{\mathbf{T}}^t{\mathbf{V}}^{-1}\left(\mathbf{R}-{\mathbf{1}}_N{\left({\mathbf{1}}_N^t{\mathbf{R}}^{-1}{\mathbf{1}}_N\right)}^{-1}{\mathbf{1}}_N^t\right){\mathbf{V}}^{-1}\mathbf{T}, $$

and we considered the following statistic [17, 18]:

$$ \frac{{\mathbf{T}}^t\left({\mathbf{I}}_N-{\left({\mathbf{1}}_N^t{\mathbf{R}}^{-1}{\mathbf{1}}_N\right)}^{-1}{\mathbf{1}}_N^t{\mathbf{R}}^{-1}\right)\mathbf{X}}{\sqrt{\sigma^2{\mathbf{T}}^t\left(\mathbf{R}-{\mathbf{1}}_N{\left({\mathbf{1}}_N^t{\mathbf{R}}^{-1}{\mathbf{1}}_N\right)}^{-1}{\mathbf{1}}_N^t\right)\mathbf{T}}}\sim N\left(0,1\right)\mathrm{if}\gamma =0. $$

This statistic will be denoted by WL in the remainder of this report.

Adjusting the heterogeneous ascertainment bias

Families are often selected based on some probands, and the probability for family members to be affected depends on their relationship with the probands. Additional file 1 shows that the incorporation of conditional probability of each individual being affected to WL as offset lead to asymptotically smaller variance and therefore the adjustment of heterogeneous ascertainment bias is required to improve the statistical power of WL. This probability could be estimated with the liability model if the heritabilities, h², and prevalence, p, were available. We let l ^P_ij and l ^N_{i ' j '} be the liability of proband j and non-proband j' in family i and family i', respectively, and let $ {\mathbf{L}}_i^P=\left({l}_{i1}^P\kern0.5em ,\dots, \kern0.5em {l}_{i{p}_i}^P\right) $ and $ {\mathbf{L}}_i^N=\left({l}_{i1}^N\kern0.5em ,\dots, \kern0.5em {l}_{i{q}_i}^N\right) $. We assumed that each liability followed the standard normal distribution, and their joint distributions were

$$ \left(\kern0.75em \begin{array}{c}{L}_i^P\kern0.75em \\ {}\kern0.5em {L}_i^N\end{array}\kern0.75em \right)\sim MVN\left(0,{h}^2\left(\kern0.75em \begin{array}{c}{\mathbf{R}}_i^P\kern1em \\ {}\kern0.75em {\left({\mathbf{R}}_i^{PN}\right)}^t\kern0.62em \end{array}\begin{array}{c}\kern1em {\mathbf{R}}_i^{PN}\kern1em \\ {}\kern1em {\mathbf{R}}_i^N\end{array}\right)\kern1em +\left(1-{h}^2\right){\mathbf{I}}_{n_i}\right). $$

Benchek and Morris [24] reported that significant asymptotic biases are likely to arise when the multivariate normal (MVN) liability assumption is not met and in such a case, different assumptions should be considered. We assume that M ^P *_i and V ^P *_i are the expectation and variances of L_i^P when their disease statuses are conditioned. If all probands are affected, they becomes

$$ {{\mathbf{M}}_i}^{P^{\ast }}\equiv E\left({L_i}^P\Big|{l_{i1}}^P>c,\dots, {l_{i{p}_i}}^P>c\right) $$

and

$$ {{\mathbf{V}}_i}^{P^{\ast }}\equiv var\left({L_i}^P\Big|{l_{i1}}^P>c,\dots, {l_{i{p}_i}}^P>c\right). $$

They can be calculated with the numerical algorithms [25]. If p_i is 1, both can be simply calculated. We denote the cumulative and probability density function of standard normal distribution by Ф(·) and ϕ(·). If we let c be the (1–p)th quantile of the standard normal distribution, M ^P *_i and V ^P *_i becomes

$$ {\mathbf{M}}_i^{P\ast}\Big\{\kern1em \begin{array}{c}\phi (c)/\left[1-\varPhi (x)\right]\kern1em \\ {}\kern1em -\phi (c)/\varPhi (x)\kern1em \end{array}\kern2em \begin{array}{c}\kern1em \mathrm{if}\kern0.5em {y}_{i1}^P=1\kern1em \\ {}\kern1em \mathrm{if}\kern0.5em {y}_{i1}^P=0\kern1em \end{array},\kern1em \mathrm{and}\kern0.5em {\mathbf{V}}_i^{P\ast}\kern0.5em =1-{\left({\mathbf{M}}_i^{P\ast}\right)}^2+{\mathbf{M}}_i^{P\ast }c. $$

With Pearson-Aitken formula [26, 27], we could obtain the conditional mean and variance-covariance matrix of L_i^N given $ {\mathbf{L}}_i^P>{1}_{p_i}\kern0.5em \cdot c $ as follows:

$$ {\mathbf{M}}_i^{N\ast }=0+{\left({\mathbf{R}}_i^{PN}\right)}^t{\left({\mathbf{V}}_i^P\right)}^{-1}\left({\mathbf{M}}_i^{P\ast }-0\right) $$

and

$$ {\mathbf{V}}_i^{N\ast }={\mathbf{V}}_i^N+{\left({\mathbf{R}}_i^{PN}\right)}^t\left({\left({\mathbf{V}}_i^P\right)}^{-1}-{\left({\mathbf{V}}_i^P\right)}^{-1}{\mathbf{V}}_i^{P\ast }{\left({\mathbf{V}}_i^P\right)}^{-1}\right){\mathbf{R}}_i^{PN}. $$

We denoted the jth element in M ^N *_i by m ^N *_j and the jth diagonal element in V ^N *_i by v ^N *_j . Then the probability of being affected for a non-proband under multivariate normality of the liabilities could be calculated as

$$ \varPhi \left(\frac{c-{m}_j^{N\ast }}{v_j^{N\ast }}\right), $$

and this will be incorporated into the proposed statistic as offset. Thus far, we have assumed that there was a well-designed set of p_i individuals who were “probands”, and for this situation, we calculated the statistic as indicated and denoted FQLS₁. However in practice, different ascertainment condition such as sequential sampling frame [28] are often utilized, and the set of p_i individuals will not be well defined. For this situation, we calculated the probability for each individual to be affected under the assumption that all the other family members were “probands”, and thus p_i = n_i – 1 and q_i = 1. The statistic calculated this way was denoted by FQLS₂.

Results

The simulation model

In our simulation studies, we considered two types of family structures; nuclear families with five offspring and the extended families that consist of 13 individuals along 3 generations (see Fig. 2). The latter will be called extended families in the remainder of this report. The disease allele frequency, p, was assumed to be 0.2. If we denoted the disease allele frequency by q_A, the genotype frequencies for AA, Aa, and aa became q_A², 2q_A(1 – q_A), and (1 – q_A)² under HWE, respectively, and founders’ genotypes were generated under the corresponding multinomial distribution. The genotypes for non-founders were generated with randomly generated Mendelian transmission. The disease status was generated with the liability threshold model. Once continuous liabilities that consisted of polygenic effects and random errors were generated, they were transformed to being affected if they were larger than the threshold; and otherwise, they were considered to be unaffected. The threshold was chosen to preserve the prevalence, and prevalence was assumed to be 0.2. Continuous liability was determined by combining the phenotypic mean, polygenic effect, main genetic effect, and random error. The main genetic effect for each individual was the product of β and the number of disease alleles. If we denoted the relative proportion of the phenotypic variance attributable to the main disease gene by h_a², and h² was a heritability for continuous liability, β was calculated by

$$ \beta =\sqrt{\frac{h_a^2}{2{q}_A\left(1-{q}_A\right)\left(1-{h}^2\right)}}. $$

For the evaluation of type-1 errors and power, h_a² was assumed to be 0 and 0.005, respectively. Phenotypic correlations between family-members were explained by the polygenic effects. Parental polygenic effects were generated from N(0, h²), and h² was assumed to be 0.2, 0.5, or 0.8. For non-founders, the average of maternal and paternal polygenic effects was combined with the values independently sampled from N(0, 0.5 h²) for the polygenic effects of offspring. Random errors were generated from N(0, σ_e² = 1–h²). For each replicate, sampling was repeated until a given number of ascertained families was generated. Type-1 error estimates were calculated with 5000 replicates, and empirical power estimates were calculated with 1000 replicates.

Evaluation of the proposed methods with simulated data

The empirical type-1 errors for FQLS₁ and FQLS₂ were evaluated from 5000 replicates under the situation of no association (h_a² = 0), and 900 nuclear families with five offspring in Fig. 2 were generated for each replicate. Fig. 3 shows the quantile quantile (QQ) plots from 5000 replicates, and the nominal significance levels for both methods were preserved for various significance levels. We also estimated the empirical type-1 error rates at the 0.01 and 0.05 significance levels; the empirical type-1 error estimates of FQLS₁ and FQLS₂ preserved these nominal significance levels (Table 1). These results verified that the use of the approximation to the standard normal distribution resulted in an accurate assessment of significance for the proposed methods.

Table 1 Empirical type-1 error estimates. The empirical type-1 error rates and their 95 % confidence intervals were estimated with 5000 replicates at the 0.01 and 0.05 significance level for h² = 0.2, 0.5, and 0.8. The number of families was assumed to be 900, and the disease allele frequency was 0.2

Full size table

The empirical powers at the various significance levels were measured based on 1000 replicates at the 0.01 and 0.001 significance levels. The relative proportion, h_a², of phenotypic variance attributable to the main disease gene, 2p_A(1 – p_A)β², was assumed to be 0.005, and nuclear and extended families in Fig. 2 were considered for the power comparison. In the first simulation setting, the numbers of nuclear families were assumed to be 100, 300, 600, 900, 1200, and 1400, and half of the families were ascertained if the number of affected family members was larger than or equal to n_proband, and the other half of the families were ascertained if the number of unaffected family members was larger than or equal to n_proband. Therefore, if 100 nuclear families were generated, half of nuclear families should have more than or equal to n_proband affected family members, and the other half should have at least n_proband unaffected family members. We assumed that the heritabilities were 0.2, 0.5, and 0.8, and results are shown in Tables 2, 3, 4, respectively. In the second simulation setting, the numbers of extended families were assumed to be 100, 300, 600, and 900, and all families were ascertained if the number of affected amily members was larger than or equal to n_proband. Empirical power estimates for scenario 2 were calculated when h² = 0.2, 0.5, and 0.8, and the data are shown in Tables 5, 6, 7, respectively. Our results showed that either FQLS₁ or FQLS₂ was usually the most efficient statistic, and the least efficiency was provided from WL. In particular, the power gap between the proposed methods and WL was largest if h² was 0.8, which indicates that power improvement may be proportional to the heritability. If h² was 0.2, the proposed methods were only slightly better than WL. While all methods in our power comparison focused on the distribution of genotypes to calculate statistics, the proposed methods uniquely considered the heterogeneous effects of ascertainment bias among family members which were proportional to the magnitude of heritability; this explained the power improvement of the proposed methods. Furthermore the differences of empirical power estimates from WL and the proposed methods are larger for Tables 5, 6, 7 than Tables 2, 3, 4, which indicates that the heterogeneity of ascertainment condition may be positively related with family size and the proposed methods become more efficient for large families. Last our simulation results show that FQLS₂ was slightly better than FQLS₁, and this may be induced by the uncertainty of probands in our simulation studies. Therefore, we concluded that the incorporation of a sampling scheme to the offset could make a substantial difference, and test statistic should be carefully selected depending on type of sampling scheme.

Table 2 Empirical power estimates for scenario 1 when h² is 0.2. The empirical power estimates for scenario 1 were calculated with 1000 replicates at the both 0.01 and 0.001 significance levels. The disease allele frequency was assumed to be 0.2, and the prevalence was assumed to 0.2. The relative phenotypic variance attributable to the main disease gene was assumed to be 0.005

Full size table