Introduction

To investigate linkage between a quantitative trait and a marker locus, Haseman and Elston proposed the seminal “Haseman-Elston regression” (HE-reg) method (Haseman and Elston 1972). Since its inception, HE-reg has been a rich garden, which fertilizes many other methods and leads to various improvements (Elston et al. 2000). It has been adapted as a major tool for linkage scans (Cardon and Fulker 1994; Fulker and Cardon 1994; Xu et al. 2000; Sham and Purcell 2001), or for estimating heritability at a genome-wide scale using sib pairs (Visscher et al. 2006). Since the rise of genome-wide association studies, HE-reg has been reinvigorated for addressing the problem of “missing heritability” (Yang et al. 2010). Through the linkage kernel (identical-by-descent, IBD) of HE-reg being replaced by IBS (identity-in-state, IIS, also known as IBS), HE-reg can estimate SNP-heritability with good accuracy (Chen 2014; Golan et al. 2014; Bulik-Sullivan 2015; Wu and Sankararaman 2018). Recently, because of the possibility of accumulating new pedigree data (Kong et al. 2018; Kaplanis et al. 2018), HE-reg has been further modified to deal with linkage analysis at an unprecedented scale (Young et al. 2018; Zajac et al. 2023).

Due to technical difficulties, there has been some debate on the properties of HE-reg (DeFries 2010). Nevertheless, there is a consensus that HE-reg is a model-free framework to detect an additive genetic variance component. We here report a long overlooked ambiguity that HE-reg captures if there is any genotype-environment (GxE) variance component besides the additive genetic variance. Following the original approach of HE-reg, we derive the analytical result for the regression coefficient of HE-reg when GxE is present. Due to its historical importance, this reinterpretation for HE-reg will shed light on a forking path in the garden of HE-reg, in which both linkage and association studies have deep roots.

GxE plays a role in sha** complex traits, and, if ignored, it can affect the marginal effect estimation of a locus (Zhu et al. 2024). This study focuses on an unexplored GxE mechanism, which can yield different heritability in linkage and in association studies, respectively. This paper is organized around two kinds of heritability: \({h}^{2}\), which is consistent with the conventional definition of heritability, say \(2pq{\mathcal{a}}^{2}\), and \({\mathcalligra{h\hskip .48pc}}^{2}\), which additionally allows for GxE in the model. We briefly review HE-reg and introduce GxE for a labile trait that is subject to the norm of reaction. GxE is then found to be naturally integrated into HE-reg without the need to include any additional parameters, and this leads to a new interpretation of HE-reg. As HE-reg has been widely used for both linkage and association studies, either in a single-locus or whole-genome context (Sham and Purcell 2001; Visscher et al. 2006; Chen 2014), we here pursue symmetric and asymmetric forms between linkage and association studies under the new interpretation of HE-reg. For linkage and association, two pairs of symmetries are developed: Symmetry I for single-locus estimation of heritability, and Symmetry II for whole-genome estimation of heritability. These two symmetry groups are represented by their respective test statistics—t-tests here. On the basis on the good analytical properties of HE-reg, we envisage linkage and association in a complementary manner to each other and empower the investigation of genetic architecture. When there is GxE, these two symmetries should not hold, but a large sample size—especially if there are only sib pairs—is needed to detect possible asymmetry between linkage and association studies.

Methods

We adapt the original sib pair design of HE-reg, and have a linear model for the additive effect of a casual variant

$${Y}_{ij}=\mu +{\beta }_{l}{\pi }_{ij}+{e}_{ij}$$
(1)

in which \({Y}_{ij}={\left({y}_{i}-{y}_{j}\right)}^{2}\) for a pair of individuals \(i\) and \(j\) who are a sib pair, \(\mu\) the mean of the model, \(\beta_l\), the regression coefficient, \({\pi }_{ij}\) the identity-by-descent (IBD) of the sib pair, and \({e}_{ij}\) the residual. We consider \(K\) pairs of sibs from \(K\) independent families, so \({Y}_{ij}\) is a \(K\times 1\) vector. Now we take a close look at its signature result, the interpretation of the regression coefficient \({\beta }_{l}\), which is written as

$${\beta }_{l}=-2{\left(1-2c\right)}^{2}{h}^{2}$$
(2)

in which \(c\) is the recombination fraction between the \({l}^{th}\) marker and a causal variant, and \({h}^{2}\) the variance explained by the causal variant. Equation 2 is the commonly accepted standard interpretation for HE-reg (Lynch and Walsh 1998). Here we have not distinguished \({h}^{2}\) from \({\sigma }_{a}^{2}\) (additive genetic variance of a locus), which can be reconciled by scaling. In the context below, we choose the appropriate notation either \({h}^{2}\) or \({\sigma }_{a}^{2}\).

Norm of reaction for a labile trait

Sibling differences such as one drinks but the other not makes a form of personalized “environment”, but we only focus on environmental factors that can apply to the whole family, such as highland habitants vs lowland habitants. A typical non-removable GxE is present if both genotype effects “crossover” (Wang et al. 2010), as the norm of reaction defined below for a causal variant linked the \({l}^{th}\) locus:

Scheme I (highland habitants): genotypes \(BB\), \(Bb\) and \(bb\) with their corresponding effects \({G}_{BB}=a\), \({G}_{Bb}=d\) and \({G}_{bb}=-a\);

Scheme II (lowland habitants): genotypes \(BB\), \(Bb\) and \(bb\) with their corresponding effects \({G}_{BB}=-a\), \({G}_{Bb}=d\) and \({G}_{bb}=a\).

When high- or low-land habitants are clearly defined, it is easy to include this as a covariate in a conventional GxE GWAS model \(y=\mu +b\times SNP+c\times Habitant+d\times \left(SNP\times Habitant\right)+e\), and \(d\) captures the linear by linear proportion of GxE. However, environmental factors can be, if not infinite, many, each of which can lead to norm of reaction for each locus; for the \({l}^{th}\) locus, the acting environmental factor could be habitation, but could be water quality for the \({(l+1)}^{th}\) locus. In other words, if not completely impossible, it is both logistically and computationally impractical to explore such models exhaustively. Here, we propose an alternative route via HE-reg, which does not need to articulate the environmental interaction, but is able to capture its effect.

We assume \(K={K}_{1}+{K}_{2}\) sib pairs, and \({K}_{1}\) sib pairs are highland habitants and \({K}_{2}\) sib pairs lowland. Now, we have Table 1, which is a modification of the original HE paper (please refer to “Table I Conditional Distribution of \({Y}_{j}\)” in Haseman and Elston’s original paper). According to Table I, under Scheme I, we have

$$\mathbf{S}\mathbf{c}\mathbf{h}\mathbf{e}\mathbf{m}\mathbf{e}\mathbf{I}:\left\{\begin{array}{c}E\left({Y}_{j}|{\pi }_{j}=1\right)=E\left({e}_{j}^{2}\right)={\sigma }_{e}^{2}\\ E\left({Y}_{j}|{\pi }_{j}=\frac{1}{2}\right)={\sigma }_{e}^{2}+{\sigma }_{a}^{2}+2{\sigma }_{d}^{2}\\ E\left({Y}_{j}|{\pi }_{j}=0\right)={\sigma }_{e}^{2}+2{\sigma }_{a}^{2}+2{\sigma }_{d}^{2}\end{array}\right.$$
(3)

in which \({\sigma }_{a}^{2}=2pq{[a-\left(p-q\right)d]}^{2}\) and \({\sigma }_{d}^{2}=4{p}^{2}{q}^{2}{d}^{2}\).

Table 1 A rework of the original Table I for HE-reg when the target locus harbors genotype-environment interaction

Now turn to Scheme II,

$$\mathbf{S}\mathbf{c}\mathbf{h}\mathbf{e}\mathbf{m}\mathbf{e}\mathbf{I}\mathbf{I}:\left\{\begin{array}{c}E\left({Y}_{j}|{\pi }_{j}=1\right)=E\left({e}_{j}^{2}\right)={\sigma }_{e}^{2}\\ E\left({Y}_{j}|{\pi }_{j}=\frac{1}{2}\right)={\sigma }_{e}^{2}+{\sigma }_{a}^{2}+2{\sigma }_{d}^{2}\\ E\left({Y}_{j}|{\pi }_{j}=0\right)={\sigma }_{e}^{2}+2{\sigma }_{a}^{2}+2{\sigma }_{d}^{2}\end{array}\right.$$
(4)

Here we have \({\sigma }_{a}^{2}=2pq{[a-\left(p-q\right)(-d)]}^{2},\) which leads to the identical result like Eq. 3. As in the original HE paper, we drop off \({\sigma }_{d}^{2}=4{p}^{2}{q}^{2}{d}^{2}\) because of ignorable dominance genetic variance, and Eq. 4 is identical to the Eq. 3, indicating that HE-reg does not distinguish GxE, and consequently does not distinguish GxE from a pure additive model. Of the \({l}^{th}\) locus, we arrive at a generalized expression for the HE-reg regression coefficient:

$${\beta }_{l}=-2{\left(1-2c\right)}^{2}\sum_{f=1}^{\mathcal{E}}{\omega }_{f}2{p}_{f}{q}_{f}{a}_{f}^{2}$$
(5)

in which \(\mathcal{E}\) is the number of different environments. Equation 5 actually slices the original HE-reg into much thinner layers, each of which covers a fraction \({\omega }_{f}\) of sib pairs, who are sampled from environment \(f\). For the \({l}^{th}\) locus, we assume \({a}_{l}\) follows \({a}_{l}\sim N({{a}}_{l},{E}_{l}^{2})\), in which \({{a}}_{l}\) the inter-family additive effect (marginal effect) and \({E}_{l}^{2}\) is the variation due to familial or environmental origin—an analogue of GxE. \(E\left({a}_{l}^{2}\right)={{a}}_{l}^{2}+{E}_{l}^{2}\), and Eq. 5 can be written in an even more general form as

$${\beta }_{l}=-2{\left(1-2c\right)}^{2}\left[\sum_{f=1}^{\mathcal{E}}{\omega }_{l.f}2{p}_{l.f}{q}_{l.f}\left({{a}}_{l}^{2}+{E}_{l}^{2}\right)\right]=-2{\left(1-2c\right)}^{2}{\mathcalligra{h}}_{l}^{2}$$
(6)

Even though the model is identical to the original HE-reg, Eq. 6 indicates that \({\mathcalligra{h\hskip .48pc}}^{2}\) has an additive variance component, which can be detected via an additive effect model, together with a GxE component. As discovered by recent large-sample GWAS, causal variants are very saturated (Yengo 2022). We now omit the factor \({\left(1-2c\right)}^{2}\) in Eq. 6 in the text below. The complexity of the original HE-reg paper arises largely to deal with \({\left(1-2c\right)}^{2}\).

In particular, for HE-reg we have its null and alternative hypotheses: \({H}_{0}:\beta =0;{H}_{1}:\beta \ne 0\), so for Eq. 6 we construct a t-test \({t}_{LS}\) for single-locus linkage analysis:

$${t}_{LS}=\frac{-2{\mathcalligra{h\hskip .48pc}}^{2}}{\sqrt{\frac{1}{\left(n-2\right)}}\sqrt{\frac{var({Y}_{ij}^{2})}{var(x)}-4{\mathcalligra{h\hskip .48pc}}^{2}}}=\frac{-\sqrt{n-2}}{\sqrt{[16{\left(1-\theta \right)}^{2}-1]+16\frac{\left(1-{\mathcalligra{h\hskip .48pc}}^{2}\right)}{{\mathcalligra{h\hskip .48pc}}^{2}}}}=\frac{-\sqrt{n-2}}{\sqrt{3+16\frac{\left(1-{\mathcalligra{h\hskip .48pc}}^{2}\right)}{{\mathcalligra{h\hskip .48pc}}^{2}}}}\approx -\frac{{\mathcalligra{h\hskip .48pc}}^{2}}{4}\sqrt{n}$$
(7)

For sib pairs, \(\theta =0.5\) in Eq. 7, and the approximation holds when \({\mathcalligra{h\hskip .48pc}}^{2}\) is small. The detailed elements for constructing Eq. 7 are given in Table 2.

Table 2 The test statistics for various Haseman-Elston regressions

Nevertheless, for a single-locus GWAS linear model \({\varvec{y}}=\mu +{{a}}_{l}{{\varvec{z}}}_{l}+e\), in which \({\varvec{z}}\) is the standardized genotype for the \({l}^{th}\) locus, the t-test is \({t}_{GWAS}=\sqrt{(n-2)\frac{{h}^{2}}{(1-{h}^{2})}}\), in which \({h}^{2}=2{p}_{l}{q}_{l}{{a}}_{l}^{2}\) does not include a GxE component. This casts a discrepancy between linkage and association studies.

Replacing IBD with IBS, for \(n\) unrelated pairs HE-reg is \({\left({y}_{i}-{y}_{j}\right)}^{2}=a+\mathcal{b}{s}_{ij}+{e}_{ij}\). \({y}_{i}\) and \({y}_{j}\) are the standardized phenotypes for unrelated individuals \(i\) and \(j\), \({s}_{ij}\) the genetic relatedness score, \({e}_{ij}\) the residual and \(\mathcalligra{b}\) is the regression coefficient. Given an unrelated GWAS sample, the analytical solution for SNP heritability is

$$E(\mathcalligra{b})=\frac{1}{m}\frac{-2{\sum }_{k}^{m}{\sum }_{{l}_{1}}^{\mathcal{Q}}{\sum }_{{l}_{2}}^{\mathcal{Q}}{\rho }_{k{l}_{1}}{\rho }_{k{l}_{2}}{\beta }_{{l}_{1}}{\beta }_{{l}_{2}}}{{\overline{\rho }}_{mm}^{2}}$$
(8)

in which \(\mathcal{Q}\) is the number of causal variants, often unknown, and \({\overline{\rho }}_{mm}^{2}=\frac{1}{{m}^{2}}{\sum }_{{l}_{1}}^{m}{\sum }_{{l}_{2}}^{m}{\rho }_{{l}_{1}{l}_{2}}^{2}\). The sampling variance of \({\sigma }_{\mathcalligra{b}}^{2}\approx \frac{16}{n\left(n-1\right)}\frac{1}{{\overline{\rho }}_{mm}^{2}}\) for HE-reg (Chen 2014).

There are various unexplored paths for possible variation in Eq. 8, but we here direct our attention to a collapsed form

$$E\left(\mathcalligra{b}\right)\approx {-2h}^{2}\frac{{\overline{\rho }}_{m\mathcal{Q}}^{2}}{{\overline{\rho }}_{mm}^{2}}\approx -2{\widetilde{h}}^{2},$$
(9)

in which \({\overline{\rho }}_{m\mathcal{Q}}^{2}=\frac{1}{m\mathcal{Q}}{\sum }_{{l}_{1}}^{m}{\sum }_{{l}_{2}}^{\mathcal{Q}}{\rho }_{{l}_{1}{l}_{2}}^{2}\) and \({\widetilde{h}}^{2}=\sum_{k=1}^{\mathcal{Q}}{h}_{k}^{2}\) because the summation of the covariances of all pair of variables is zeroed out. A pair of assumptions are needed for Eq. 9: I) the first approximation occurs under the general polygenic assumption that causal variants are randomly allocated along the genome, and II) the second approximation holds when the markers are saturated enough, indicating that nearly every SNP is possibly causal (Yengo 2022). So, the approximation should be retained for Eq. 8 in view of these two assumptions. Of note, albeit pathologically, counter-examples can be found in which these two assumptions do not hold: see discussion in other studies (Chen 2014). The corresponding t-test \({t}_{AW}\) for whole-genome association analysis is (Table 2)

$${t}_{AW}=-\frac{\sqrt{\frac{n\left(n-1\right)}{2}-2}}{\sqrt{\frac{2}{{\widetilde{h}}^{4}}\frac{1}{{\overline{\rho }}_{mm}^{2}}-1}}\approx -\frac{n}{2}{\widetilde{h}}^{2}\sqrt{{\overline{\rho }}_{mm}^{2}}$$
(10)

However, when there is only one marker \({\overline{\rho }}_{mm}^{2}=1\), and Eq. 10 leads to a single-locus test for association t-test, \({t}_{AS}\), which is about (Table 2)

$${t}_{AS}\approx -\frac{n}{2}{\widetilde{h}}^{2}$$
(11)

If we replace the \(\pi\) in Eq. 1 with aggregated genome-wide IBD \(\widetilde{\uppi }=\frac{1}{m}\sum_{k=1}^{m}{\pi }_{k}\), it leads to whole-genome estimation of heritability for a linkage design (Visscher et al. 2006):

$${Y}_{ij}=\mu +\mathcal{B}{\widetilde{\pi }}_{ij}+{e}_{ij}$$
(12)

Using the same two assumptions for Eq. 8, the expectation of \(\mathcal{B}\approx -2{\widetilde{\mathcalligra{h}}}^{2}=\sum_{k=1}^{\mathcal{Q}}{\mathcalligra{h}}_{k}^{2}\), and its sampling variance is about \(\sqrt{\frac{1}{\left(n-2\right)}}\sqrt{16\mathcal{L}var({Y}_{ij}^{2})-4{\widetilde{\mathcalligra{h}}}^{4}}\), in which \(\mathcal{L}\approx 32\) Morgan is the genetic length of the 22 human autosomes. However, the number of markers in the linkage study determines the sampling variance of \(\mathcal{B}\). Consequently, the t-test \({t}_{LW}\) for whole-genome linkage analysis is (Table 2)

$${t}_{LW}=\frac{-\sqrt{n-2}}{\sqrt{16\mathcal{L}\left[2{\left(1-\theta \right)}^{2}+2\frac{\left(1-{\widetilde{\mathcalligra{h}}}^{2}\right)}{{\widetilde{\mathcalligra{h}}}^{4}}\right]-1}}$$
(13)

The statistics for the forking path of HE-reg

As derived above and summarized in Table 2, each t-test statistic can be approximately converted to \({\chi }_{1}^{2}\) with non-centrality parameter (NCP) of the squared t-test statistic as shown below:

$$NCP\left\{\begin{array}{c}\text{Symmetry I}:\text{ single locus}\left\{\begin{array}{cc}\text{Linkage}& \frac{n}{16}{\mathcalligra{h}}^{4}\\ \text{Association}& \frac{{n}^{2}}{4}{h}^{4}\end{array}\right.\\ \text{Symmetry II}:\text{ whole genome}\left\{\begin{array}{cc}\text{Linkage}& \frac{n-2}{8\mathcal{L}+32\mathcal{L}\frac{(1-{\widetilde{\mathcalligra{h}}}^{2})}{{\widetilde{\mathcalligra{h}}}^{4}}-1}\\ \text{Association}& \frac{{n}^{2}}{4}{\widetilde{h}}^{4}{\overline{\rho }}_{mm}^{2}\end{array}\right.\end{array}\right.$$
(14)

For each paired test statistic, Symmetry I or Symmetry II, under the pure additive model, the test derived from linkage often has less power than that from association; otherwise GxE is available and captured by HE-reg.

Simulation results

Validation for HE-reg in the presence of GxE

We want to validate the accuracy of the t-test statistics in Eq. 7 (\({t}_{LS}=\frac{-\sqrt{n-2}}{\sqrt{3+16\frac{\left(1-{\widetilde{h}}^{2}\right)}{{\widetilde{h}}^{4}}}}\)) in the presence of GxE. We simulated \(K\) nuclear families, each of which consisted of a pair of unrelated parents and a sib pair. The genotypic effects were set as 1, 0, and ‒1 for \(BB\), \(Bb\), and \(bb\) (Scheme I) for the first \(\frac{K}{2}\) families, and as ‒1, 0, and 1 for \(bb\), \(Bb\), and \(BB\) (Scheme II) for the second \(\frac{K}{2}\) families. The reference allele frequency was uniformly sampled from 0.01 ~ 0.5; \({h}^{2}=0.1, 0.25\), and 0.5, respectively; \(K=200\) and 500 respectively. For each simulated sib pair, \(\pi\) (1, 0.5, and 0, and \(var\left(\pi \right)=\frac{1}{8}\)) was directly known for the simulation data, and the marker itself was causal so that \({\left(1-2c\right)}^{2}=1\). We resampled each scenario 200 times. As a contrast, for \(K\) families, we also pooled their \(2K\) unrelated parents together, which made a sample for GWAS.

As illustrated in Fig. 1, since Scheme I and Scheme II were similar scenarios—but opposite effect size—they had very consistent results within HE-reg. When \(K\) increased from 200 to 500, the consistency remained. As expected, HE-reg was not as powerful as GWAS in catching a single Scheme-I or Scheme-II locus. However, when the Scheme I and II samples were pooled together and an unremovable GxE was present, HE-reg was more powerful to detect GxE, not just because of the doubled sample size; in contrast, GWAS had no power at all to detect the GxE effect because the effects cancel out under a GWAS model.

Fig. 1
figure 1

Validation for test statistics for HE-reg and GWAS. 200 simulations for each scenario, and sib pairs were analysed by HE-reg (first row) and their parents were pooled together and analysed by GWAS (second row), such as \({\varvec{y}}=a+b{\varvec{z}}+e\). In each cluster, the three coloured bars were for scheme I, scheme II, and scheme I + II samples. Of each pair of the same colour bars, the first one was the expected t-test statistic calculated by Eq. 7 (\({t}_{HE}=|\frac{-\sqrt{n-2}}{\sqrt{3+16\frac{\left(1-{\widetilde{h}}^{2}\right)}{{\widetilde{h}}^{4}}}}|\)) for HE-reg (first row) and \({t}_{GWAS}=\sqrt{\left(\frac{{h}^{2}}{1-{h}^{2}}\right)\left(n-2\right)}\) for GWAS (second row), and the second bar was the mean of the test statistic after 200 simulations and its standard error of mean was represented by the black lines atop. We took absolute values of Eqs. 7 and 8, and ignored the positive or negative signs

One criticism might be: why not include GxE in a GWAS model? A GWAS model could be more powerful for detecting a GxE locus, if we knew what the environment was. However, the number of environments is infinite in practice.

Power calculation for GxE using single-locus HE-reg for a linkage design

It should be noticed that HE-reg has much lower power in linkage than in the comparable association analysis, so it is trivial to compare the statistical power for linkage and association. We considered the statistical power for HE-reg to detect a single locus, which could be in the context of either \({h}^{2}\) or \({\mathcalligra{h\hskip .48pc}}^{2}\). In the simulations, we set the Type I Error rate \(\alpha =\frac{0.05}{\text{1,000,000}}\) after 1 million genotyped SNPs, and we set \({\mathcalligra{h\hskip .48pc}}^{2}=0.005\), 0.0075, 0.01, 0.015 and 0.02, respectively; the total number of sib pairs ranged from 500,000 to 10,000,000, with an increment of 500,000. Given type I error rate \(\alpha =\frac{0.05}{\text{1,000,000}}\) and if the statistical power 0.85 is considered acceptable in practice, for \({\mathcalligra{h\hskip .48pc}}^{2}=0.015\), \(n\approx \text{3,000,000}\) sib pairs are required, and for \({\mathcalligra{h\hskip .48pc}}^{2}=0.01\), \(n\approx \text{7,000,000}\) sib pairs are required (Fig. 2).

Fig. 2
figure 2

Statistical power to detect a single locus using HE-reg. y-axis represents the statistical power, and x-axis represents the number of sib pairs for HE-reg linkage. Different colour points represent different heritability (\({\mathcalligra{h\hskip .48pc}}^{2}=0.005\) in black, \({\mathcalligra{h\hskip .48pc}}^{2}=0.0075\) in red, \({\mathcalligra{h\hskip .48pc}}^{2}=0.01\) in green, \({\mathcalligra{h\hskip .48pc}}^{2}=0.015\) in blue, and \({\mathcalligra{h\hskip .48pc}}^{2}=0.02\) in cyan). The type I error rate \(\alpha =\frac{0.05}{1,\text{000,000}}\), given the number of genotyped SNPs is 1,000,000

Statistical test for symmetry I

Now we considered Symmetry I for a pair of t-tests \({t}_{LS}\) against \({t}_{AS}\) for single-locus tests in linkage and association designs, respectively. The elements for \({t}_{LS}\) are \({\widehat{\beta }}_{LS}\), \({\widehat{\sigma }}_{{\mathcalligra{h}}_{LS}^{2}}^{2}\) and sample size \({n}_{LS}\) sib pairs; and for \({t}_{AS}\) they are \({\widehat{b}}_{AS}\), \({\widehat{\sigma }}_{{h}_{AS}^{2}}^{2}\), and sample size \({n}_{AS}\) (Table 2). Using Welch’s modified two-sample t-test we have

$${t}_{I}=\frac{{\widehat{\beta }}_{LS}-{\widehat{b}}_{AS}}{\sqrt{{\widehat{\sigma }}_{{\mathcalligra{h}}_{LS}^{2}}^{2}+{\widehat{\sigma }}_{{h}_{AS}^{2}}^{2}}}$$
(15)

After further approximation \({t}_{I}^{2}\) follows \({\chi }_{1}^{2}\) with NCP \({t}_{I}^{2}\approx \frac{{\left(2{\widehat{h}}_{LS}^{2}-2{\widehat{h}}_{AS}^{2}\right)}^{2}}{\left\{\frac{1}{{n}_{LS}}\left[64\left(1-{h}^{2}\right)+12{\mathcalligra{h}}^{4}\right]\right\}+\left\{\frac{2}{{n}_{AS}^{2}}\left(8-4{h}^{4}\right)\right\}}\). We evaluated statistical power for Eq. 15 by setting a locus with GxE (\({\mathcalligra{h}}_{\text{LS}}^{2}\) from 0.005 to 0.05) but no additive effect, \({h}_{\text{AS}}^{2}=0\), as illustrated in Fig. 3. The number of SNPs was 1,000,000, and the Type I error rate was \(\alpha =\frac{0.05}{\text{1,000,000}}\). The sib pairs for \({n}_{LS}\)=50,000, 150,000, and 300,000, respectively, and \({n}_{AS}=\) 10,000 and 100,000, respectively. It was obvious that the statistical power benefited from a larger \({n}_{LS}\) (Fig. 3). Of note, Symmetry I could be implemented purely on a pair of summary statistics from single-locus linkage and single-locus GWAS.

Fig. 3
figure 3

Statistical power for Symmetry I single-locus test for GE. The NCP of the test statistic is \(\frac{{\left(2{\widehat{h}}_{LS}^{2}-2{\widehat{h}}_{AS}^{2}\right)}^{2}}{\left\{\frac{1}{{n}_{LS}}\left[64\left(1-{\mathcalligra{h\hskip .48pc}}^{2}\right)+12{\mathcalligra{h}}^{4}\right]\right\}+\left\{\frac{2}{{n}_{AS}^{2}}\left(8-4{h}^{4}\right)\right\}}\) for the transfomred pair of t-tests in Eq. 15. The y asix is statstical power given \({\mathcalligra{h}}_{LS}^{2}\) ranged from 0.005 to 0.05 (x-axis), evenly broken into ten values, and \({h}^{2}=0\) in association. The two horizontal lines were for statistical power of 0.25, 0.5, and 0.85, respectively. The sample sizes for linkage (\({n}_{LS}\)) and for association (\({n}_{AS}\)) are as shown each subtitle

Statistical test for symmetry II

Analogously, we could test for Symmetry II, and its t-test \({t}_{II}\) is

$${t}_{II}=\frac{{\widehat{\beta }}_{LW}-{\widehat{b}}_{AW}}{\sqrt{{\widehat{\sigma }}_{{\mathcalligra{h}}_{LW}^{2}}^{2}+{\widehat{\sigma }}_{{h}_{AW}^{2}}^{2}}}$$
(16)

which has NCP \({t}_{II}^{2}\approx \frac{{\left(2{\widehat{h}}_{LW}^{2}-2{\widehat{h}}_{AW}^{2}\right)}^{2}}{\left\{\frac{1}{{n}_{LW}}\left[16\mathcal{L}\left[2{\mathcalligra{h}}^{4}+8\left(1-{\mathcalligra{h\hskip .48pc}}^{2}\right)\right]-4{\mathcalligra{h}}^{4}\right]\right\}+\left\{\frac{2}{{n}_{AW}^{2}}\frac{8-4{\widetilde{h}}^{4}{\overline{\rho }}_{mm}^{2}}{{\overline{\rho }}_{mm}^{2}}\right\}}\). We conducted simulation for Symmetry II, and key parameters were set as \(\mathcal{L}=32\) Morgan and \({\overline{\rho }}_{mm}^{2}=\frac{1}{\text{100,000}}\). Two traits were considered, the first trait had \({h}^{2}=0.25\) and the second \({h}^{2}=0.45\), which could be detected by whole-genome association HE-reg; and in contrast, with the inclusion of GxE (from 0 to 0.2, with increments of 0.02), its \({\mathcalligra{h\hskip .48pc}}^{2}={h}^{2}+GxE\), which could be detected by whole-genome linkage HE-reg. The sample size for association was \({n}_{AW}=\) 10,000, and 100,000, respectively, and for linkage it was \({n}_{LW}=\text{50,000}\), 150,000, and 300,000, respectively. The sample size should be larger than 50,000 for linkage, otherwise hardly a sensible difference could be statistically detected.

Discussion

Both marginal environmental effect and GxE interactions can shape genetic architecture (Zhu et al. 2024). As revealed in this study, the original HE-reg actually captures not only an additive effect but also GxE. However, a GWAS model is very likely to miss such a GxE location if the environmental variable is unknown. So, given the intriguing behaviour as investigated between HE-reg and GWAS, it naturally leads to a discrepancy between linkage and association studies as long as genes harbor familial variation. The real advantage of our reinterpretation for HE is that a gene desert in GWAS may lead to rich results in single-locus HE-reg linkage, which can detect such a locus without specifying the environment (Fig. 1). If there are unexplored hotspots for GxE across the genome, it is likely to observe a GxE landscape for various traits and enrich our understanding of genetic architecture. According to our power calculation (Fig. 2) more sib pairs may enhance the statistical power of HE-reg linkage for detecting GxE. Symmetry I and II tests can be implemented on summary statistics (Figs. 3 and 4), so if we have a large sibling data resource, it is possible to construct summary statistic tests for GxE. So far, the publicly available sib pairs may be drawn from UK Biobank (Bycroft et al. 2018), but unfortunately there are far less, say about 20,000 sib pairs, than a sample size that leads to practical statistical power. As shown in the simulation results, a linkage study would have lower power to detect a signal than a GWAS, given the same sample size, which could be about 280,000 unrelated UKB samples, and consequently we anticipate finding only few signals in a linkage study. However, one technical difficulty may be how to distinguish between an independent signal and the synthetic signal of a LD block.

Fig. 4
figure 4

Statistical power for Symmetry II whole-genome test for GE. Numerical evaluation for statistical power of Eq. 16, and its converted NCP is \({t}_{II}^{2}\approx \frac{{\left(2{\widehat{h}}_{LW}^{2}-2{\widehat{h}}_{AW}^{2}\right)}^{2}}{\left\{\frac{1}{{n}_{LW}}\left[16\mathcal{L}\left[2{\mathcalligra{h}}^{4}+8\left(1-{\mathcalligra{h\hskip .48pc}}^{2}\right)\right]-4{\mathcalligra{h}}^{4}\right]\right\}+\left\{\frac{2}{{n}_{AW}^{2}}\frac{8-4{\widetilde{h}}^{4}{\overline{\rho }}_{mm}^{2}}{{\overline{\rho }}_{mm}^{2}}\right\}}\). \(\mathcal{L}=32\) Morgan, and \({\overline{\rho }}_{mm}^{2}=\frac{1}{\text{100,000}}\). The grey bar is for \({h}^{2}=0.25\) and \({\mathcalligra{h\hskip .48pc}}^{2}=0.25+\text{GE}\), and the cyan bar is for \({h}^{2}=0.6\) and \({\mathcalligra{h\hskip .48pc}}^{2}=0.6+\text{GE}\). GE is always taken value from 0 to 0.2 with increment of 0.02 for each step. The horizontal reference lines are of0.2, 0.5 and 0.85, respectively in each plot. The sample sizes for linkage (\({n}_{LW}\)) and for association (\({n}_{AW}\)) are as shown each subplot

One of the agenda items in addressing the “missing heritability” question is to narrow down the heritability gap between the family-based studies (linkage) and unrelated GWAS samples (association). In the absence of GxE, increasing the number of variants gives a possible way to solve this issue. The trait most studied is height, the heritability of which is estimated in European descendants using various markers. The benchmark source of family-based heritability estimate of \({\widehat{h}}_{FAM}^{2}=0.8\pm 0.1\) was based on 950 quasi-independent full-sib pairs, and their whole-genome IBD was estimated using 791 autosome microsatellite markers (Visscher et al. 2006); the compared SNP-heritability was \({\widehat{h}}_{SNP}^{2}=0.45\pm 0.083\), which was estimated from 3,925 unrelated samples and 294,831 SNP markers (Yang et al. 2010). The \({\widehat{h}}_{SNP}^{2}\) so far has approached \(0.68\pm 0.10\) using 25,465 unrelated European descendants on the whole-genome sequenced TopMed data with \(33.7\times {10}^{6}\) variants (Wainschtein et al. 2022). Given such a small sample size for the compared family study, it was not likely to reach statistical significance. Symmetry II test indicates another way for interpreting the heritability gap. The gap is not necessarily about inclusion of more variants but rather about using a family-based design instead.