Introduction

To identify genes affecting phenotypes including diseases, animal models are very useful. Experimental studies in animal models (e.g., mouse) have an advantage in identifying phenotype-related genes and clarifying their functional roles because experiments can be done with intervention controlling for genetic background, age and environments of the animals. There are several approaches for clarifying phenotypic effects of genes (transgenic or knock-out animals, mutagenesis with ENU, RNAi experiment, transcriptome, etc.) (Gondo 2008). Many types of animal models for human disease were constructed to examine the functional roles of genes. A limitation of this approach is that it is usually uncertain whether the human orthologue of the identified gene has the same functional role in a real human body.

Genome-wide association study (GWAS) is a powerful tool for dissecting unknown complex traits by identifying loci associated with particular diseases, and the number of GWAS reports has been rapidly increasing. The identified genes or loci could be seeds for functional analysis, risk prediction and personalized medicine. However, the roles of the identified genes in the pathogenesis have typically not been clarified, and further study is required (Hindorff et al. 2009). Another limitation in GWAS is that statistical analysis with only common SNPs may miss some pathological genes for which individual genetic difference cannot be captured with proxy common variants. For example, the power to detect causal rare variants would be too small because of low linkage disequilibrium (LD) between the causal and the proxy common variants. It should be useful to inspect the genes with moderate p values while simultaneously looking at other information such as biological pathway, gene expression, and evidence in animal models. Therefore, a translational approach of integrating genetic association study in human and experiments in mouse has a potential value to facilitate finding additional disease-related genes, by taking advantages of both the approaches.

Alzheimer’s disease (AD) is a common neurological disease that causes dementia in humans. Aβ accumulation is the central pathology of Alzheimer’s disease. Molecular pathogenesis of Aβ accumulation for familial AD has been explained by the causative genes, APP, PSEN1 and PSEN2 (Hardy and Selkoe 2002; Rogaev et al. 1995; Sherrington et al. 1995). Genetic risk factors have been reported for sporadic AD (APOE, etc) (Bertram et al. 2007; Lambert et al. 2013; Saunders and Roses 1993; Saunders et al. 1993a, b; Strittmatter et al. 1993). However, the mechanism, which leads to the accumulation of Aβ in the early stage of the AD, is not well understood (Gaiteri et al. 2016).

Among approaches in the mouse model of human diseases, transcriptome analysis has an advantage: the transcriptome between human and mouse brains is well preserved (Miller et al. 2010), and this may facilitate translational research from mouse to human. APP Tg mice that reproduce Aβ accumulation in brain are widely used as model animals of AD. Taking advantage of transcriptome analysis in the mouse model, our previous study (Gan et al. 2015; Morihara et al. 2014) used a genome-wide transcriptome analysis with various mouse strains with different susceptibilities to Alzheimer’s disease. Genes detected by conventional transcriptome analysis include both causative genes and genes affected by disease pathogenesis. To ensure we detect genes affecting AD pathology, we implemented a two-step approach in our transcriptome analysis. First, we used non-transgenic mice strains with no Alzheimer pathology and selected the genes with differential expression compared to the low-susceptibility strain. This use of non-transgenic mice selects genes for which differences in expression are based on the genetic backgrounds and not secondary effects caused by Aβ accumulation. Second, we used APP transgenic mice with mixed genetic backgrounds to find genes associated with accumulation of Aβ. The top genes whose expression levels were highly correlated with accumulation of Aβ may have roles in the accumulation of Aβ in brain. A further examination of those genes in human or an integrated analysis with human data was desired.

To identify novel AD-related genes that cause Aβ accumulation in the current study, we took an integrated approach by combining statistics from human GWAS and mouse transcriptome experiments (Fig. 1). First, using the correlation between gene expression level and accumulation of Aβ in the mouse model (Morihara et al. 2014), we obtained a p value for each mouse gene as the significance of correlation. Second, by utilizing SNP-based statistics in a previous GWAS of human subjects with AD (Hirano et al. 2015), we obtained gene-based statistics from the SNP-based statistics. Third, we combined the results of the two types of analyses using orthologous gene pairs between human and mouse. Then, each gene was evaluated for the susceptibility of AD by the combined p value calculated from the two types of p values. This integrated analysis detected five significant genes as candidate genes for AD pathogenesis. We examined gene expression level of those genes in human AD subjects, which were independent subjects from the GWAS subjects. Two of the five genes showed lower expression levels with statistical significance in human AD patients than in controls, which is consistent with their mouse orthologues which showed a negative correlation between gene expression level and Aβ accumulation.

Fig. 1
figure 1

Scheme of integrated analysis of mouse transcriptome and human GWAS. To detect genes affecting AD pathology, we implemented two steps in our transcriptome analysis (green). First, we used non-transgenic mice strains with no Alzheimer pathology and selected the genes with differential expression in the low-susceptibility strain (DBA/2). This use of non-transgenic mice means that differences in gene expression are based on the genetic backgrounds and not secondary effects caused by Aβ accumulation. Second, we used APP transgenic mice with mixed genetic backgrounds to find genes associated with accumulation of Aβ (middle left). In mouse brain, the relationship of Aβ accumulation and gene expression was examined, and p value of correlation was obtained. Genome-wide association with AD was conducted with human subjects (Hirano et al. 2015), and SNP-based GWAS statistics were converted into gene-based statistics (blue). Both types of gene-based statistics from mouse and human were integrated through orthologous gene pairs, and a combined p value was calculated by the inverse-normal method (also known as Stouffer’s Z score method) without weighting (magenta, see “Materials and methods”). Candidate genes were prioritized by the combined p values. The significant genes were selected for further evaluation. Human hippocampus postmortem samples were used to determine whether the gene is expressed differently between AD patients and controls

Materials and methods

Gene expression and Aβ accumulation in transgenic mice

Aβ levels in mouse brains and two sets of genome-wide gene expression data in mouse hippocampus were obtained in a previous study (Morihara et al. 2014) (Fig. 1). The first set of genome-wide gene expression data (12 arrays) was from three inbred non-Transgenic (non-Tg) mouse strains. We choose the genes that were differentially expressed in the mouse strain (DBA/2) with lower susceptibility to AD compared to the other strains (C57BL/6 and SJL). Because these mice carry no APP transgene and have no Aβ pathology, the difference in expression levels is based on their genetic background and not secondary effects caused by Aβ accumulation. From the original transcriptome data containing 13,309 probes for 9964 genes, we selected 373 genes which had significant differential expression (Student’s two-tailed t test p < 0.001, FDR = 3.05%) in the DBA strain compared to the B6 and SJL strains. These 373 genes reflect physiological changes in neurodegeneration and some may be disease-causing.

The second set of genome-wide gene expression data (28 arrays) was from APP transgenic (Tg) mice with mixed genetic backgrounds from different strains (DBA/2, C57BL/6 and SJL). This APP transgene causes Aβ accumulation in the brain and APP transgenic mice (Tg2576) are widely accepted as AD model animals. The accumulated Aβ levels in these mouse brains were measured by ELISA. Statistical significance of correlation between gene expression levels of the 373 genes selected as above and accumulation of Aβ was tested, and the obtained p values were used in the following integrated analysis (see below). All mouse transcriptome datasets used in this study have been deposited in the Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo) under accession GSE40330.

Gene-based statistics from human GWAS

We used the GWAS statistics of a previous study (Hirano et al. 2015) of 811 AD case individuals and 7504 control individuals with 583,884 autosomal SNPs (Supplementary Fig. 1a). In that study, they used samples belonging to the Hondo cluster (Yamaguchi-Kabata et al. 2008) of the Japanese population, and association analysis was adjusted for age and gender. By checking the distribution of the obtained p values (Supplementary Fig. 1b for Q–Q plot; lambda (the genomic inflation factor) = 1.078), any significant confounding effects by ancestry of subjects were not observed. We conducted the principal component analysis with this dataset and obtained principal components (PCs) for their genetic backgrounds. However, we did not include any PCs as covariates for the association analysis, because including them did not reduce lambda (1.076).

From the SNP-based GWAS statistics, gene-based statistics were obtained to conduct the integrated analysis with the other gene-based data. There are several available methods for generating gene-based statistics (Bacanu 2012; Christoforou et al. 2012; Lehne et al. 2011; Li et al. 2011; Liu et al. 2010; Neale and Sham 2004). Basically, they address two issues, (1) the number of SNPs varies among genes and (2) SNPs within the gene are not independent because of local LD. The GATES (gene-based association test using extended Simes procedure) method (Li et al. 2011) is one of these methods to calculate gene-based statistics and is implemented in KGG system (http://grass.cgs.hku.hk/limx/kgg/). This method does not require simulation and the KGG system works with a list of SNP p values and LD data. We calculated the gene-based p value from the SNP p value list using KGG system and LD data using HapMap JPT (Japanese from Tokyo) genotype data. After an examination of how the defined gene regions and LD influence the assignment of SNPs to genes (Supplementary Table 1), each SNP was assigned to a gene (genes) if the SNP is located within the mapped region of the mRNA of the gene including the 3 kb surrounding the 5′ and 3′ flanking regions. In addition, SNPs outside of gene region were assigned to a gene if they were in high LD (r2 > 0.8) with SNPs within the gene. Using the KGG system, gene-based statistics were obtained for 30,584 transcripts, a set of all human transcripts. On average, 18.0 SNPs were assigned to each gene.

To enroll another type of association study for accumulation of rare and common variants within gene, we used the SKAT_CommonRare function of SKAT (version 1.3.2.1) (Ionita-Laza et al. 2013; Wu et al. 2011) with default parameters for SKAT-C and Burden-C. For each gene, we used the same SNP set described above: SNPs within the gene region including 3 kb upstream and downstream and SNPs under LD with those within the gene region. We included age and gender as covariates for calculating the statistics. Obtained p values were integrated with the mouse gene expression p values for each gene (see “Integrated analysis”).

GWAS statistics data from the International Genomics of Alzheimer’s Project (IGAP)

As another dataset for evaluating our methodology, we downloaded GWAS statistics data from the International Genomics of Alzheimer’s Project (IGAP) (Lambert et al. 2013). This GWAS was based on cases and controls of European ancestry. The p value list for the combined set of the GWAS (stage 1; 17,008 cases and 37,154 controls) and a follow-up study (stage 2; 8,572 cases and 11,312 controls for 11,632 SNPs after quality-control filtering) was used after gene-based annotation using Annovar (Wang et al. 2010) with the “refGene” table. We selected 954 genes linked to the top SNPs with P < 0.001 for further examination. Eleven of these genes were in common with the 373 genes selected from the mouse expression experiment data. For these eleven genes, SNPs in the IGAP stage 1 set were assigned to genes in the same way as described above, and we obtained gene-based statistics using the GATES method implemented in KGG system.

Integrated analysis

The data of mouse transcriptome and the gene-based statistics from human GWAS were combined using orthologous gene pairs between human and mouse. The orthologous table from Mouse Genome Informatics (http://www.informatics.jax.org) (Shaw 2004) was used to identify orthologous genes between human and mouse. The human and the mouse data were combined for the 373 genes (409 probes for the mouse data). To obtain the combined p value for each gene, we used the inverse-normal method (also known as Stouffer’s z score method) (Stouffer 1949) without weighting. First, z scores for mouse and human p values (one tailed) were obtained by the inverse function of standard normal distribution cumulative function, then the averaged z score was calculated:

$${Z_{\text{C}}}=\frac{1}{{\sqrt 2 }}\left( {{Z_{{\text{MEXP}}}}+{Z_{{\text{HGWAS}}}}} \right),$$

where ZC, ZMEXP, and ZHGWAS are z scores for combined, mouse expression, and gene-based human statistics of GWAS, respectively. Then, the combined p value (one tailed) was obtained by the standard normal distribution with ZC. Lastly, the combined p value was doubled (two tailed). The R programming language (version 3.5.0) was used for this calculation.

eQTL analysis

We checked whether the SNPs used in this study for each gene were reported eQTLs—SNPs with alleles associated with the expression level of a gene. For this, we used data in GTEx (The Genotype-Tissue Expression (GTEx) project 2013; https://www.gtexportal.org/home/; version 7; Caucasian) (Aguet et al. 2006), have shown that Aβ accumulation in APP Tg mice with DBA/2 genetic background was significantly lower than those with C57BL/6 and/or SJL. This fact clearly suggests that some genes in DBA/2 suppress Aβ accumulation. To identify these Aβ controlling genes in DBA/2, we first used non-Tg mice. Using non-Tg mice means that any change in gene expression is based on the genetic background and not secondary effects caused by Aβ accumulation. In this study, we selected 373 genes whose expression levels were significantly different (Student’s two-tailed t test p < 0.001, FDR = 3.05%) in DBA/2 compared with SJL or C57BL/6 (“Materials and methods”) as potential candidate genes controlling Aβ accumulation.

In addition to these three non-Tg inbred mouse strains, we previously prepared APP Tg mice with mixed genetic background of DBA/2 (lower susceptibility to AD), C57BL/6 and SJL (Morihara et al. 2014). We measured the gene expression profile and levels of Aβ in their brains. In this study, we examined the correlation between the expression levels of these 373 genes and Aβ levels in these APP Tg mice. The p values of these correlations were used for the subsequent integrated analysis.

Gene-based statistics

By the conventional approach of genome-wide association study (Hirano et al. 2015) (811 AD case individuals and 7504 control individuals with 583,884 SNPs on autosomes), we observed six significant SNPs with genome-wide significance (p < 5.0 × 10−8) on 19q13 including the APOE gene (Supplementary Fig. 1a), a well-known risk factor of AD (Saunders and Roses 1993; Saunders et al. 1993a, b), and several adjacent genes. In addition to this strong APOE signal of chromosome 19, there were also a substantial number of SNPs with moderate p values (512 SNPs, p < 0.001), which may include additional causative genes for AD.

To conduct gene-based integrated analysis with mouse data, we obtained gene-based statistics from SNP-based GWAS statistics by, first, using GATES method implemented in KGG system (Li et al. 2011) (Table 1; “Materials and methods”). With GWAS alone, we did not observe any significant gene other than APOE (2.71 × 10−19) and the surrounding genes (TOMM40 and PVRL2), under LD with APOE, although there were additional possible genetic signals of association. Among the 373 candidate genes expressing differently in AD-resistant mouse strain, ST6GALNAC4, ARRB1, KCNS1, TNNT1, EBNA1BP2, CSRNP3, and C5orf51 showed smallest p values (Table 1).

Table 1 Gene-based statistics from human AD GWAS for the 373 genes

As another independent method of obtaining gene-based statistics from SNP-based GWAS, we also conducted SKAT (Ionita-Laza et al. 2013; Wu et al. 2011) with the option of combining common and low-frequency variants together. Among the 373 candidate genes expressing differently in AD-resistant mouse strain, BOK, ELOVL4, THAP4, ARRB1, ARSJ, TRIM3, and PTPN11 showed smallest p values (Supplementary Table 2).

We also used GWAS statistics from the International Genomics of Alzheimer’s Project (IGAP) (Lambert et al. 2009) to evaluate the effectiveness of our approach. We selected 954 genes linked to top SNPs with p < 0.001 for examination, 11 of which were in common with the 373 genes selected from the mouse expression experiment data (Supplementary Table 3).

Integrated analysis

First, to evaluate the feasibility of our methodology, we analyzed IGAP data considering our mouse experiment data. We took the intersection of the two gene sets, 373 genes from the mouse expression analysis and 954 genes that are linked to top SNPs (p < 0.001) in the IGAP dataset (1st and 2nd combined), and obtained 11 shared genes. For these genes, we looked at the results from our integrated analysis with GATES (Supplementary Table 3). By combining the mouse expression data, these genes which are top hits in IGAP, obtained much better results, and showed significant/moderate p values in our results also. This result supports the validity of our approach. Therefore, we proceeded to the next analysis: integration of our human GWAS and mouse experiment data for the remaining genes.

Next, we took the results from our GATES analysis of the original GWAS (Hirano et al. 2015) dataset and the mouse expression data and obtained a combined p value for each gene from the two p values (mouse transcriptome analysis and human genetic association) through the inverse-normal method. Five genes showed significant combined p values with a significance level of p < 0.000067 (= 0.05/373/2): LBH (limb bud and heart development), ST6GALNAC4 (ST6-N-acetylgalactosaminide alpha-2,6-sialyltransferase 4), ARSJ (arylsulfatase family, member J), C5orf51, and SHF (Src homology 2 domain-containing F) (Table 2). These five genes had nominal p values through GWAS alone (gene-based p values ranged from 0.011 to 0.046), and multiple SNPs whose p values were very different (Supplementary Table 4; Supplementary Fig. 2). However, they were the top significant genes when human genetic association and mouse transcriptome data were integrated. When we compared our results to the GTEx data, we found that many SNPs, particularly those with p < 0.05 in our human GWAS, are eQTLs linked to LBH and SHF (Supplementary Table 5). Furthermore, they are more relevant in anterior cingulate cortex BA24, cortex, and frontal cortex BA9 tissues, where Aβ accumulation tends to be observed more frequently than in other tissues.

Table 2 Top genes in the integrated analysis

Also, as another integrated analysis approach, we integrated p values from our SKAT analysis of the same GWAS and the mouse gene expression analysis for each of the 373 genes (Supplementary Table 6). Nine genes: ARSJ, ELOVL4, THAP4, EXOC2, KLK8, ATXN1, ARRB1, RPS3, and RPAIN had p < 0.000067 (= 0.05/373/2). Note that these genes have p < 0.05 for both the human GWAS SKAT and mouse gene expression results. Also, by checking GTEx, we found that most of these genes have multiple eQTLs within them (Supplementary Table 7). Within and surrounding the ARSJ gene region, we found multiple promising eQTLs linked to these genes, although none were significant in the GWAS. Most SNPs within and surrounding the ELOVL4 gene region are, interestingly, promising eQTLs of this gene, although this gene itself does not have GWAS hit SNPs. All significant GWAS SNPs near THAP4 are also eQTLs of this gene. For EXOC2, all nearby significant GWAS SNPs are eQTLs, and most SNPs within and surrounding this gene are, interestingly, very strong eQTLs of this gene. For KJK8, we did not observe eQTLs for GWAS hits or SNPs around this gene. The expression level of this gene might be irrelevant in human, or eQTLs might exist outside of the analyzed region. Although the ATXN1 gene had no significant GWAS SNPs, there are many eQTLs associated with this gene. The ARRB1 gene has a strong overlap between eQTLs and GWAS SNPs with p < 0.05, and RPS3 and RPAIN have several very strong eQTLs, although they did not overlap with the GWAS results.

Gene expression level in human autopsy subjects

To validate biological roles of the five genes identified by our GATES–GWAS and mouse integrated analysis (LBH, ST6GALNAC4, ARSJ, C5orf51, and SHF) in human brain, we examined gene expression level of these genes in the hippocampus of AD patients and control autopsy individuals (sample sizes are 13 and 10, respectively), who were independent of the GWAS subjects. Among the five genes tested, gene expression levels of LBH and SHF were significantly different (FDR < 0.05) (Fig. 2). In both LBH and SHF, gene expression levels were lower in AD patients than control individuals. This observation was in accordance with the expression levels of these genes, which were negatively correlated with the levels of Aβ accumulation in mouse.

Fig. 2
figure 2

Comparison of gene expression levels in human hippocampus. Gene expression levels for the five significant genes were examined in postmortem human subjects (10 AD patients and 13 control individuals), who were not included in the AD GWAS. Difference in average expression levels between the AD group and the control group was tested with the Student’s t test (two-tailed test)

Discussion

To identify genes that cause Aβ accumulation, we conducted an integrated analysis of human genetic association and mouse transcriptome studies, and our results showed that two genes, LBH and SHF, are suggested to be novel AD-associated genes. Our results suggested that expression level in LBH and SHF are negatively associated with Aβ accumulation. Both of LBH and SHF showed lower expression levels in the human hippocampus of pathologically diagnosed AD patients with confirmed levels of excessive Aβ than those of control individuals (Fig. 2). Also, DBA mouse strain which suppresses Aβ accumulation (Jackson et al. 2015; Morihara et al. 2014; Sebastiani et al. 2006) had higher gene expression levels of both Lbh and Shf than the other strains. Gene expression levels of both genes in App Tg mice with mixed genetic backgrounds were negatively correlated with accumulation of Aβ (Table 2).

LBH (limb bud and heart development) is a homolog of mouse Lbh, which is a transcription factor and is involved in development of limb bud and heart (Ai et al. 2008; Briegel and Joyner 2001). LBH was reported as a direct target of the Wnt signaling pathway (Rieger et al. 2010). Though the mechanisms are still unclear, cross-talk between the Wnt pathway and Alzheimer’s disease has been reported (Inestrosa and Arenas 2010). The levels of Wnt signaling in AD patients are low, suggesting that reduced Wnt signaling could be the triggering factor for Aβ production (Inestrosa and Arenas 2010). From a previous GWAS, human LBH has been reported to be associated with autoimmune disease such as rheumatoid arthritis (Okada et al. 2018), biological pathways and epigenetic data (Gjoneska et al. 2015) would be useful for prioritizing disease-related genes. Then, detected genes would have functional insights that are important for develo** therapeutic targets.