Background

Asthma is a chronic disease with a strong genetic component in its etiology, having heritability estimates ranging from 25 to 73 % [1, 2]. However, the specific genetic factors that underlie this heritability remain unclear, as most factors identified thus far tend to have weak effect sizes and are poorly replicated across studies [3, 4]. Gene-gene interaction, the phenomenon where the phenotypic effect of a variant at one genetic locus depends on variants at other loci [5, 6], is one possible explanation for this “missing heritability” and the lack of replication of genetic effects for many complex diseases [79]. Thus, the identification of gene-gene interactions could enable a greatly improved understanding of the genetic etiology of asthma.

More than 45 studies have attempted to identify gene-gene interactions in asthma, with over 190 significant interactions reported thus far [10]. However, these studies have tended to use candidate gene approaches to narrow the genetic space in which to search for interactions, typically focusing on loci that have been reported to show significant marginal effects (i.e., effects that manifest when loci are assessed individually). To our knowledge, only one asthma association study involving a genome-wide search for gene-gene interactions has been published, with null findings [11]. Although this may be due in part to the lower likelihood of publication of null findings, it appears that the search space of pairwise interactions has been sparsely explored in asthma.

In order to identify interactions involving loci that might not otherwise be considered, we aimed to conduct a genome-wide search for gene-gene interactions in asthma susceptibility. Data from the GABRIEL Consortium genome-wide association study of asthma [12] were obtained for this purpose, which included a total of 4186 cases and 3916 controls. To maintain power in detecting an interaction, we used a two-stage screening and follow-up approach. In this design, the screening stage was used to conduct the exhaustive genome-wide search for interactions, and the follow-up stage was used to assess all interactions with a p-value < 10−5. In addition, a preliminary attempt at replication was made for the most significant interaction identified from the follow-up.

Methods

Study datasets

All data in the current study were obtained from the GABRIEL Consortium [12] (European Genome-phenome Archive Study ID: EGAS00000000077). Approval for the use of this data was obtained from the Gabriel Consortium Data Access Committee. Data from 12 of the GABRIEL study centers were included (Table 1). In this paper, we refer to any dataset particular to an individual study center as a “sub-study” dataset. Case/control status, sex, and genotype information were available, while age and all other information were not available. In general, cases were subjects with self-reported doctor-diagnosed asthma, although some sub-studies had alternative definitions (Table 1). In particular, the AUGOSA and SEVERE sub-studies involved subjects with severe asthma. The MAS, AUGOSA, and SEVERE sub-studies only had cases available. All subjects were genotyped on the Illumina Human610 quad array (N = 582,892 single nucleotide polymorphisms [SNPs]).

Table 1 Sub-studies included from the GABRIEL Consortium

Quality control (QC)

Subjects were excluded if any of the following were true: having a subject call rate < 95 %; having an ambiguous SNP-estimated sex (X-chromosome homozygosity, F, between 0.2 and 0.8); being flagged for removal due to relatedness with other subjects; and being flagged as a principal components outlier. Subject relatedness was determined by calculating pi-hat (proportion identical by descent) between every pair of subjects; one randomly chosen subject of every pair with pi-hat ≥ 0.20 was flagged for removal. Principal components analysis was performed using EIGENSTRAT; a subject was flagged as an outlier if he or she had a value >6 standard deviations on the top 10 components. Both subject relatedness and principal components analysis were assessed in linkage-disequilibrium (LD) pruned datasets (based on pairwise SNP correlations; r2 threshold of 0.5).

SNPs were excluded if any of the following were true: unable to convert from hg18 to hg19 reference genome coordinates; having a SNP call rate <98 % in either cases or controls (for MAS, AUGOSA, and SEVERE, only cases were considered); and having a test for deviation from Hardy-Weinberg equilibrium (HWE) with p-value < 10–4 among controls (for MAS, AUGOSA, and SEVERE, the criteria was p-value < 10−7 among cases).

All QC procedures were performed within each sub-study individually. QC exclusion counts and final subject characteristics are shown in Additional file 1: Tables S1, S2 and S3.

Population stratification

To check for test statistic inflation due to population stratification or other sources, quantile-quantile (QQ) plots of observed vs. expected-under-the-null p-values were evaluated. This was performed for both marginal effect p-values and interaction p-values, within each sub-study (for case-only studies, only interaction QQ plots were evaluated). P-values from the follow-up interaction meta-analysis were also examined via QQ plots. Little or no inflation was noted (Additional file 1: Figure S1). At most, a minor amount of inflation was noted for UFA in the marginal effects, but this did not appear to manifest in the interaction effects.

Analytical methods

Screening analysis

The largest sub-study available, GABRIEL-AS (GABRIEL Advanced Surveys), was used as a screening dataset. All pairwise interactions between SNPS with minor allele frequency (MAF) ≥ 0.20, only including chromosomes 1 through 22, were analyzed for the outcome of asthma. This included N = 301,547 SNPs and N = 4.55 × 1010 interactions. The search space was not trimmed for LD, and thus a portion of these interactions was likely statistically redundant. The analysis was performed using the logistic regression (−lr) function in the CASSI Genome-Wide Interaction Analysis Software package, due to its computational speed advantages when performing epistasis (gene-gene interaction) analyses (version 2.50; https://www.staff.ncl.ac.uk/richard.howey/cassi/index.html). All interactions with a p-value < 10−5 in this analysis were tested again using the --epistasis function in PLINK (version 1.07; http://pngu.mgh.harvard.edu/~purcell/plink/) [13]. This was done in order to confirm that interactions met the same p-value threshold using both methods, since minor differences in p-values between the methods were noted. All interactions with a p-value < 10−5 from both methods (N = 345,034 interactions) were included for the follow-up analysis. At this significance level, the GABRIEL-AS study has greater than 90 % power to detect interactions with an interaction odds ratio (OR) 2.5 or more, for SNPs with MAF 0.20 or more (for additional power estimates, see Additional file 1: Table S4).

Follow-up analysis

The seven next-largest studies containing both cases and controls, including BAMSE, BUSSELTON, EGEA, KMSU, PIAMA, SAPALDIA, and UFA, were used as the follow-up datasets. All interactions that passed the screening analysis were analyzed in each of these datasets, individually, using the --epistasis function in PLINK. Each such analysis involved a logistic regression model that included a main effect variable for each SNP and an interaction variable (defined as the product of the main effects variables), with each SNP variable coded additively (taking on the values 0, 1, or 2, depending on the number of copies of the non-referent allele present). Asthma (log odds of being a case) was the outcome. All studies used the same non-referent allele for each SNP; this was the minor allele among controls in GABRIEL-AS. Interaction p-values originated from a test of the null hypothesis that the interaction variable parameter is equal to zero, after adjusting for the main effects. Gene-gene interaction is represented in these models as deviation from additivity of combined SNP effects on the log-odds scale of disease. SAS version 9.3 (SAS Institute; Cary, NC) was used to obtain full parameter estimates (i.e., including both main effects and interaction terms) of the most significant interactions. Parameter estimates were meta-analyzed via fixed effects or random effects models, across the N = 7 follow-up studies. Heterogeneity was assessed via calculations of I2 and the Q test. If an interaction had an I2 < 50 % and a Q test p-value ≥ 0.10, then fixed effects estimates were used; otherwise, random effects estimates were used. Meta-analyses were performed using R version 3.11 (http://www.r-project.org) and the package “meta”, version 3.7-1 (http://cran.r-project.org/web/packages/meta/). The top ten most significant interactions from the follow-up, as ranked by meta-analysis interaction p-value, are listed in Additional file 1: Table S5. For the follow-up meta-analysis, statistical significance was defined as 0.05 corrected for 345,034 tests of interaction (p < 1.45 × 10−7). At this significance level, an approximate estimation of power indicates that the follow-up analysis has greater than 90 % power to detect interactions with an interaction odds ratio 1.75 or more, for SNPs with MAF 0.20 or more (for additional power estimates, see Additional file 1: Table S4).

Replication analysis

The smallest case-control study (KARELIA) and the case-only studies (MAS, AUGOSA, and SEVERE) were used as replication datasets. In KARELIA, the interactions chosen for replication were analyzed using logistic regression modeling, as described above. In MAS, AUGOSA, and SEVERE, the interactions were analyzed using the adjusted fast epistasis case-only test (−afe-co-only) in CASSI, which is a test of correlation between alleles at two different SNPs. Under the assumption that such a correlation does not exist in controls, the case-only test reflects a test for gene-gene interaction, since allelic co-occurrence might indicate that the alleles jointly affect disease risk [14]. More specifically, an interaction term from a logistic regression model involving binary predictors can be seen to be equivalent to the ratio of the OR of association between alleles in cases to the OR of association between alleles in controls. Therefore, if the OR of association between alleles in controls is one, then the OR of association between alleles in cases is equal to the interaction odds ratio [14]. However, findings from case-only tests must be interpreted with abundant caution, since the “no correlation among controls” assumption is potentially suspect given the effects of linkage disequilibrium [6]. Results from the four replication datasets were not meta-analyzed, due to heterogeneity in the tests used and the phenotypes assessed (AUGOSA and SEVERE involved severe asthma cases, whereas KARELIA and MAS cases were not defined based on severity), as well as due to the unidentified sources of heterogeneity that led to discordance in the directions of effects (Table 2). For the replication analysis, significance was defined as p < 0.05. Given the use of the case-only approach for three of the four replication datasets, results from these analyses were considered preliminary, as any significant findings will require further study using a more robust case-control approach.

Table 2 Model estimates of interaction between 20p13 and 2q14

Assessment of interactions surrounding the top hit in the follow-up analysis

Variants located within intervals spanning 100 kilobases (Kb) upstream and downstream of rs11684871 (chromosome 2) and rs910652 (chromosome 20) were imputed using IMPUTE2 (https://mathgen.stats.ox.ac.uk/impute/impute_v2.html) and 1000 Genomes reference haplotype panels (Phase 1 integrated variant set release v3) [15]. This was done within each sub-study individually, after quality control procedures had been applied. The buffer size was set as 250 Kb, and reference SNPs with MAF < 0.01 among 1000 Genomes subjects of European descent were filtered (ignored) during the imputation. In the screening and follow-up datasets, all QC-passed directly genotyped SNPs within the interval specified above, and all imputed SNPs with certainty ≥ 0.80 and info ≥ 0.50, were included for analysis. SNPs were excluded from analysis if they had a MAF < 0.10 (based on frequencies in GABRIEL-AS controls). Pairwise interactions between chromosome 2 and 20 SNPs were assessed using the PLINK --epistasis function (N = 124,502 interactions). Statistical significance for the follow-up studies in analyzing interactions involving these regions was set as 1.06 × 10−7 (0.05 corrected for 345,034 + 124,502 = 469,536 interactions). For replication in the case-only sub-studies, the CASSI case-only analysis was used, as described above. Since the imputation quality was high (median concordance across the 12 sub-studies: 95.6 for chromosome 2 SNPs, 96.5 for chromosome 20 SNPs), SNPs were analyzed using rounded genotypes, where the highest probability genotype was called as 1 if its probability was ≥ 0.90, and the other two genotypes were called as 0; if no probability was ≥ 0.90, then all three genotypes were set as missing.

Regulatory annotation

In order to identify any possible functional, regulatory sites within the genomic intervals surrounding the SNPs involved in the top interaction (as specified above), we used data from RegulomeDB [16] and GTEx [17]. Specifically, RegulomeDB Version 1.1 was used to identify all genomic positions having a RegulomeDB score of 1 or 2 (i.e., sites that are likely to affect transcription factor binding). GTEx Analysis Version 4 was used to identify all significant eQTL SNPs for the gene targets of ADAM33, HSPA12B, SIGLEC1, and GLI2 (as determined from any tissue type). Only three eQTL SNPs, all for ADAM33, were found (rs609203, rs2853211, and rs554743). These annotations were used to illustrate Fig. 1.

Fig. 1
figure 1

Gene-gene interactions between 20p13 and 2q14. P-values for interactions between variants within 0.1 megabases (Mb) of rs910652 and rs11684871 are represented. Gene annotations are from LocusZoom (http://locuszoom.sph.umich.edu/locuszoom/); regulatory annotations are from RegulomeDB and GTEx (see the Methods section). The side plots show LD structure; stronger colors on these plots indicate higher pairwise LD as measured by D’ (generated via Haploview [http://www.broadinstitute.org/haploview]). Interaction p-values and LD measures from the screening dataset (a) and follow-up datasets (meta-analysis p-values) (b) are shown. The green lines identify correspondence between genomic positions and LD plot positions, for the highlighted variants

Results

Screening analysis

In the screening analysis, all 4.55 × 1010 pairwise interactions among the 301,547 SNPs with MAF ≥ 0.20 were assessed. The most significant interaction from this analysis (rs12289833 [11q25] × rs5027694 [13q13]) had a p-value of 3.43 × 10−10. A total of 345,034 interactions with a p-value < 10−5 from the screening analysis were included for further evaluation in the follow-up analysis.

Follow-up analysis

In the follow-up analysis, no interaction exceeded the pre-specified significance threshold of p < 1.45 × 10−7 after being meta-analyzed across the seven follow-up sub-studies. The top-ranked interaction in the follow-up (rs910652 [20p13] × rs11684871 [2q14]) had a meta-analysis p-value of 1.58 × 10−6, although the interaction odds ratio was not consistent with that of the screening analysis (Fig. 2; Table 2). Interestingly, rs910652 is located approximately 78 kilobases upstream of ADAM33, which is one of the most well-studied susceptibility loci for asthma. The other SNP, rs11684871, is either upstream, or in an intron, of GLI2, depending on which splice variant is considered. Due to the possible involvement of the ADAM33 locus in this interaction, a more detailed investigation of the surrounding pairwise chromosomal space was performed. Specifically, we imputed variants 100 kilobases upstream and downstream of each SNP, and assessed all interactions between these areas (N = 124,502 interactions; Fig. 1). In these areas, the strongest interaction (rs1018491 × rs67959717) had a p-value of 5.78 × 10−8, which exceeded a conservative significance threshold of 1.06 × 10−7 (Table 2; Fig. 1).

Fig. 2
figure 2

Forest plot of effect estimates for the interaction between rs910652 and rs11684871. Parts of this figure were made with RevMan 5.3 (http://tech.cochrane.org/revman)

Replication analysis

Next, we evaluated the rs910652 × rs11684871 and rs1018491 × rs67959717 interactions in the replication analysis (Fig. 2; Table 2). A case-only interaction analysis was performed for MAS, AUGOSA, and SEVERE, since these datasets did not contain controls. Two of these sub-studies, AUGOSA and SEVERE, involved subjects with severe asthma. The rs910652 × rs11684871 interaction was significant at p < 0.05 in MAS (p = 0.042) and approached significance in AUGOSA (p = 0.073). The case-only odds ratios for these studies were consistent in direction with the follow-up studies (confirmed by also performing case-only analyses in the follow-up studies; not shown). The interaction was not significant and in opposite direction in KARELIA and SEVERE. The rs1018491 × rs67959717 interaction was not significant in any replication dataset; it approached significance (p = 0.061) in SEVERE, although the direction was opposite that of the direction in the follow-up studies.

Discussion

The top-ranked gene-gene interaction in the follow-up analysis (rs910652 × rs11684871) involved a SNP (rs910652) located near the ADAM33 gene. ADAM33 was the first gene to be linked to asthma susceptibility, and numerous studies have confirmed associations between this gene and asthma, although the specific causal variants are unclear and there are notable inconsistencies across studies [18]. Despite the fact that rs910652 is located in a flanking gene (HSPA12B), and does not appear to be in strong LD with ADAM33 (Fig. 1), one possible explanation for our findings is that this SNP could be tagging a variant that influences the regulation of nearby genes. This is indirectly suggested by our observation that the strongest interaction in the surrounding region (rs1018491 × rs67959717) involves a SNP (rs1018491) that is located intergenically, closer to ADAM33 and between SIGLEC1 and HSPA12B. This SNP is 38 bases away from a genomic position (chr20:3701170; itself not imputed in the current study) that may affect transcription factor binding (RegulomeDB score: 2b; putative bound protein: TCF12). However, the rs1018491 × rs67959717 interaction was not replicated, which could suggest that the functional variant is actually located elsewhere on chromosome 20. The closest known ADAM33 eQTL (rs609203) is in an intron of the SIGLEC1 gene (Fig. 1), but the strongest interaction involving this SNP was of modest significance (rs609203 × rs4848623; follow-up interaction p-value: 0.0026; data not shown). An alternative explanation is that HSPA12B itself is the functional gene of the interaction on chromosome 20. This gene does not have a clear role in asthma, but one study has reported its association with lung function at the same SNP we identified (rs910652) [19].

The other gene in the identified interaction, GLI2, has been reported to be involved in T-cell function [20, 21] and TGFβ expression [22], which are both well-known components of asthma pathogenesis [23, 24]. In addition, this gene is known to play a role in lung development [25, 26]. Thus, it is highly plausible that GLI2 has biological relevance in asthma. Transcriptional activation of GLI2 depends on SMAD3 [27], which was one of the top asthma-associated genes in the original GABRIEL study [12]. The functional interaction that underlies our observed statistical interaction is unclear, as such an interaction between GLI2 and either ADAM33, SIGLEC1, or HSPA12B has not been described. However, suggestions of a biological relationship between GLI2 and ADAM33 can perhaps be seen in the fact that both of these genes are regulated by TGFβ [18, 27].

A limitation to our results is that the directions of effect of the identified interactions were inconsistent between the screening and follow-up analyses (Table 2). The simplest explanation for this is that the detected signal arose by chance in the screening analysis. An alternative, mechanistic explanation is that the direction of the observed gene-gene interaction depends on an unknown environmental exposure (e.g., when a subject is exposed, the gene-gene interaction is epistatically protective, and when unexposed, it is epistatically deleterious) that has differences in prevalence between the screening and follow-up studies. Although most of the included GABRIEL sub-studies are relatively similar to one another, one notable difference between the GABRIEL-AS (screening) sub-study and the others is that, in GABRIEL-AS, subjects were recruited from rural areas [28], which may have an overall protective effect on asthma [29]. It is possible that different variants on chromosome 20 are involved in this hypothetical environmental dependency, given that the gene-gene interaction signal maximizes at different chromosomal locations between the screening and follow-up studies (Fig. 1).

An additional limitation is that we relied on case-only analysis for replication. This method assumes that correlation between alleles only exists in cases (due to gene-gene interaction affecting the phenotype) and not in controls; this assumption could be violated, particularly if linkage disequilibrium exists between the SNPs. Since the interacting SNPs are on different chromosomes, we believe that this assumption is not unreasonable, but nevertheless a case-control approach is required to confirm the replication. Finally, the screening analysis was powered to detect interaction ORs of 2.50 and above (for MAF 0.2 and above at each SNP); this may be an unrealistically high effect estimate, and the study likely missed interactions of weaker effect and/or less common SNPs from the screening stage.

Conclusions

In summary, using a genome-wide approach of screening for gene-gene interactions, we identified and found suggestive evidence of replication for an interaction in asthma (rs910652 × rs11684871), where one of the participating SNPs (rs910652) is in close proximity to ADAM33 and the other (rs11684871) is located in or near GLI2. Both SNPs are highly plausible asthma risk variants, given known associations between ADAM33 and asthma, a previously reported association between rs910652 and lung function, and the roles of GLI2 in T-cell regulation and lung development. To our knowledge, GLI2 has not been previously implicated in asthma susceptibility. This could be explained by our observations that the GLI2 SNPs did not manifest any marginal effects (Additional file 1: Tables S6-S7), suggesting that these SNPs would be undetected in the traditional association study design that examines each locus individually.

Further investigation is required to confirm the putative interaction identified in the present study, as well as to identify the possible causal loci underlying the interaction. If the interaction is genuine, it could provide new insight into the biological role of ADAM33 or its neighboring genes in asthma susceptibility, and may also emphasize GLI2 as a new gene of investigation for asthma. This study may also inform general strategies for epistasis investigations: given that one putatively interacting locus is in a region previously associated with asthma or lung function (ADAM33/HSPA12B), and the other is highly biologically plausible (GLI2), this study provides evidence in support of investigational strategies that restrict the epistasis search space to features with prior disease relevance.

Abbreviations

CI, confidence interval; Kb, kilobases; MAF, minor allele frequency; Mb, megabases; OR, odds ratio; QC, quality control; QQ, quantile-quantile; SNP, single nucleotide polymorphism