Introduction

Suicide death is a major public health problem and leading cause of death [1]. Complex and heterogeneous risk factors for suicide death include environmental exposures, comorbid clinical conditions, and genetic variation [1,2,3,4,5]. Accumulated evidence suggests that genetic factors play a critical role in suicide risk, with heritability estimated to be 30–55% from twin and family studies [6, 7]. Thus, genetic investigations could advance our understanding of the biological basis of suicide risk, leading to development of more effective prevention strategies.

Well-powered large-scale genome-wide association studies (GWAS) have begun to identify genetic variants significantly associated with suicidal thoughts and behaviors including death [1, 8,9,10]. Additional independent GWAS studies have also identified several potential genetic susceptibility loci for suicidal behaviors in genes including NCAN [9] and SOX5 [1] that are related to psychiatric conditions (e.g., schizophrenia and depression). Although GWAS have aided in identifying suicide-related genetic loci, how these identified loci contribute to suicide risk remains elusive [11].

Regulation of gene expression is critical for brain function [12, 13], with widespread dysregulated gene expression observed in psychiatric disorders associated with suicide [14,15,16]. For instance, a previous study reported that five key genes related to psychiatric diseases have decreased brain expression in individuals who died by suicide [17]. The vast majority of disease-associated genetic variants from human disease GWAS are located in non-coding regulatory regions, some of which may be associated with gene expression, which represent expression quantitative trait loci (eQTLs) [18]. That is, suicide-risk associated single nucleotide polymorphisms (SNPs) may play a significant role in risk of suicide by influencing gene expression in the brain as eQTLs, potentially leading to altered behavior or dysregulating other complex processes.

Integrative studies of GWAS and eQTLs have proven to be a powerful approach to identify novel genetic susceptibility loci with modest effects on various complex diseases [19,20,21,22,23]. The stringent significance thresholds required for GWAS to avoid detecting false positive genetic loci due to multiple testing limit genetic discovery to SNPs with small-to-moderate effects on complex diseases, potentially missing heritability [22]. Genomic association tests prioritizing eQTLs in regulatory regions can be useful in increasing analytic power and allowing discovery of actual mechanisms of risk through investigating only the subset of genome-wide SNPs that are associated with changes in gene expression [22]. The eQTL SNPs can play critical roles in complex trait phenotypes. Indeed, studies of psychiatric disorders integrating GWAS and eQTLs have successfully identified novel genetic loci that were not detected with GWAS alone (e.g., major depressive disorder (MDD) and schizophrenia) [20, Full size image

Utah suicide death cohort ascertainment

The Utah Suicide Genetic Risk Study (USGRS) has a sample of >8000 DNAs from population-ascertained suicide deaths. Suicide deaths have been ascertained through a long-term collaboration with the centralized statewide Utah Office of the Medical Examiner (OME). DNA has been extracted from whole blood by using the state-of-the-art methods (https://ctsi.utah.edu/cores-and-services/ctrc/dna-extraction-facility). This study is approved by Institutional Review Boards from the University of Utah, Intermountain Health, and the Utah Department of Health and Human Services.

Phenotypic electronic health records (EHR) data

Identifiers from suicide deaths were securely transferred from the OME directly to personnel at the Utah Population Database (UPDB, https://uofuhealth.utah.edu/huntsman/utah-population-database). The UPDB is a state-wide database that contains records on over 12 million individuals, including demographics, two decades of health records data, and deep genealogical data. After linking suicide deaths, identifiers were stripped before data were given to the research team to protect privacy and confidentiality. Linked diagnostic electronic health records were from statewide inpatient and ambulatory care encounters through Utah State Health Department records in addition to data from outpatient encounters from the largest two clinical data providers in the state (University of Utah Healthcare and Intermountain Health), representing ~85% of the state’s outpatient encounters. The inpatient and outpatient International Classification of Diseases (ICD-9; https://www.cdc.gov/nchs/icd/icd9.htm and ICD-10; https://www.cdc.gov/nchs/icd/icd10cm.htm) codes were curated within the UPDB to eliminate duplication. For efficient characterization of diagnoses, we collapsed the diagnostic data into interpretable categories using hierarchical classification derived through expert clinical adjudication (Drs. Keeshin, Docherty, and Monson). For this study, we included categories with prior evidence for association with suicide risk (alcohol related disorders, asthma, anxiety, neurodegenerative disorders, bipolar disorder, depression in a broad and narrow sense, all drug related disorders, specific opioid misuse, eating disorders, schizophrenia, pain, sleep disorders, and suicidal ideation).

Whole-genome sequence data of suicide deaths and controls

WGS data was generated on 1053 Utah suicide deaths by using Illumina NGS technology with an average read depth of at least 20×. Alignment and variant calling and joint genoty** of suicide deaths and control WGS datasets was performed at the Utah Center for Genetic Discovery (UCGD) Core Facility, part of the Health Sciences Center Cores at University of Utah. The UCGD pipeline called variants using the Sentieon software package [31] which incorporates GATK best practices [32]. Sequence reads were aligned to GRCh38 (Genome Reference Consortium Human Build 38) using BWA-MEM (Burrows-Wheeler Aligner) [33]. The Haplotyper algorithm in Sentieon was used to produce genomic Variant Call Format (gVCF) files. Suicide death gVCF files were combined and jointly genotyped with 1241 control samples from three sources. 622 individuals were from the 1000 Genomes Project cohort (1000G) [34]. Five hundred and twelve individuals were from multigenerational Centre d’Etude du polymorphisme humain (CEPH) families [35]. Ninety-six individuals were from a study of longevity of healthy elderly individuals form Utah [36]. The final VCF file with suicide deaths and controls was recalibrated to limit false positive calls.

Ancestry estimation and sample relatedness

We confined our analyses to unrelated suicide deaths and controls that had estimates of at least 90% non-Finnish European (NFE) ancestry. This threshold represents a conservative ancestry estimate as most USGRS samples are predominately European. We estimated the ancestry of the samples as a composition of five ethnicities (European, African, East Asian, Native American, South Asian) using the 1000 Genomes Project data (https://www.internationalgenome.org/data/) as a reference. We used a modified version of the pipeline presented by Giulio Genovese at https://github.com/freeseek/kgp2anc. First, our dataset was combined with the 1000G phase 3 dataset. SNPs were then pruned using the “--indep-pairphase” command in plink 1.9 [37]. PCA was run on the set of pruned SNPs with plink 2.0 [38]. Using the known estimated ancestry for AMR [34] and presumed ancestry for most other samples as the basis, we estimated the ancestry of every other sample as a combination of the 5 known ancestries using linear regression on the space of top 10 PCs with Mahalanobis distance defined by those top 10 PCs. Estimates of pairwise identity by descent (IBD) were calculated using Plink 1.9. Pairs of related individuals (third degree or closer) were identified with pi-hat values greater than 0.12. One member of each of the identified related pairs was randomly removed. After filtering our dataset included 986 suicide deaths and 415 control samples (1000G 332, longevity 61, CEPH 22).

PsychArray genoty** data for confirmation analyses

Additional independent suicide deaths (n = 4657) were genotyped using the Illumina Infinium PsychArray platform (https://www.illumina.com/techniques/microarrays/array-data-analysis-experimentaldesign/genomestudio.htm), which assesses 593,260 single nucleotide polymorphisms (SNPs). Generation, processing, quality control and imputation of genoty** array data from suicide deaths in USGRS has been previously described [1, 5, 9]. We explored the imputed array data to confirm the results of our genomic analysis with WGS data using analysis methods described below.

Brain eQTL data

Comprehensive brain eQTL data analyzed in this study were derived from the GTEx database (Supplementary Table S1). GTEx is a public resource for the study of gene expression and its regulation by analyzing WGS, whole-exome seq, and RNA-seq [39]. It provides a comprehensive eQTL resource observed from 54 healthy tissue sites from approximately 1000 individuals throughout the human body, including the brain. More detailed information of these data is described in the original study. We considered statistically significant eQTLs according to the criterion of adjusted p-value with false discovery rate (FDR) < 0.05 for each of 13 brain regions as described in Fig. 1.

Annotation of regulatory regions

To obtain eQTLs in regulatory regions, we integrated 13 histone modification ChIP-seq (i.e., H2AFZ, H3F3A, H3K27ac, H3K27me3, H3K36me3, H3K4me1, H3K4me2, H3K4me3, H3K79me2, H3K9ac, H3K9me2, H3K9me3, and H4K20me1), ATAC-seq, and DNase-seq data processed by Encyclopedia of DNA Elements (ENCODE) project [40]. We first searched and downloaded experimental result data in a bed file format for narrow peaks observed from the histone modification data of the human brain described in the ENCODE project. These peaks include chromatin structure dynamic information that refers to regulatory regions. Furthermore, we combined high-throughput chromosome conformation capture (Hi-C) data that capture genome-wide chromatin interactions in cell nuclei to annotate enhancer regions that are regulatory regions distal from transcription start sites [41]. We obtained the comprehensive Hi-C data results of various cell types including the brain from the 3D genome browser. This browser collects independent studies on chromatin conformation (Hi-C) data [42]. Finally, we annotated robust regulatory regions by overlap** ENCODE histone modification peaks and enhancer regions. We included eQTLs within these annotated regulatory regions as an association test set in this study.

Single genetic association test

Our primary analysis in this study was with WGS data. Although this includes a smaller number of samples compared to the genoty** array data, WGS data provides much higher resolution and covers nearly all possible eQTLs, such as those in regulatory regions, compared with array data. Unconditional generalized logistic regression model (GLM) was formulated to test for variant association with suicide death for each eQTL from WGS data, estimating p-values, odds ratio (ORs), and 95% confidence intervals (CIs) by using R. This association test was performed using an additive effect model, adjusting for sex and ancestry principal components (PCs) to account for possible residual effects of population stratification and genomic relatedness. We tested only eQTLs with biallelic genotypes and minor allelic frequency (MAF) > 0.05. We eliminated any eQTLs where genotypes were missing in >10% of individuals (missing call rate > 0.1 were excluded). Furthermore, for each association test, we retained genotypes only from individuals with average read depth 20 and genomic quality score (GQ) > 30.

After association tests for all eQTLs, we obtained significant index eQTLs with a statistical criterion (FDR < 0.1) after LD clum** that retained eQTLs with the lowest p-value in each linkage disequilibrium (LD; r2 0.6) block. Next, to verify eQTLs associated with suicide death, we additionally explored genoty** array data from independent USGRS suicide deaths and an independent control sample from the Genome Aggregation Database (gnomAD; v3.1.2) [43]. GnomAD contains aggregated frequency data from various large-scale WGS reference studies including 76,156 whole genomes [43]. We assessed if (1) allele frequencies from the array data of suicide-eQTLs identified by WGS data were consistently different between suicide deaths and controls, (2) suicide-eQTLs which were found from both WGS and array data were also consistently replicated using gnomAD control frequency data. Since gnomAD provides only allele frequencies of the aggregated WGS data, individual genotypes and demographic information were not available from this source. The frequencies in gnomAD were calculated from individuals of non-Finnish European ancestry, selecting for those deemed as non-neuropsychiatric (NFE-NN) to avoid possible confounding originating from data from individuals of other ancestry and/or from individuals with neuropsychiatric conditions.

Gene-based analysis using Sherlock integrative analysis

We performed genomic analyses to identify suicide-associated eQTLs in regulatory regions that potentially confer suicide risk by affecting gene expression of their gene targets. The Sherlock integrative framework explores potentially causal relationships between gene expression affected by eQTLs and disease. This strategy has previously identified novel gene associations with psychiatric disorders [15, 44]. The method integrates summary-based results of eQTLs and SNP association signals from genomic analyses through a Bayesian statistical framework. We utilized the Sherlock integrative analysis to further evaluate suicide risk-gene expression affected by eQTLs through integrating our genetic association and GTEx eQTLs results. For each gene, the Sherlock integrative analysis tool provides a score as LBF (logarithm of Bayes factor, which estimates the probability of a gene-suicide relationship) and p-value. A positive LBF indicates that a specific gene affected by eQTLs is likely associated with suicide risk, while a negative LBF suggests that the gene does not have an association. For each genomic analysis result from WGS and array data, we comprehensively identified genes associated with suicide based on the criteria of LBF > 0 and p < 5e−3. We then defined only HUGO protein coding genes where our results replicated across WGS and array data.

Expression analysis of suicide susceptibility genes

The Sherlock integrative analysis method discovers trait-associated genes that have a predicted causality through the linkage between gene expression changes and suicide risk. Therefore, gene expression analysis of suicide deaths compared with control samples could theoretically allow us to verify the genes identified by this gene-based analysis.

There are RNA-seq datasets measured from different psychiatric disorders generated by two independent datasets: (1) PsychENCODE [45, 46] including brain samples of autistic individuals (autistic, n = 43) and non-autistic matched controls (n = 65) and of Caucasian individuals with bipolar disorder (BD, n = 145), schizophrenia (SCZ, n = 346), and BD-SCZ matched controls (n = 559), (2) Korean mental health (KMH) disorder genomics study [16] for whole-blood samples of individuals with major depressive disorder (MDD, n = 39) and suicide attempters (SA, n = 56), and healthy controls (n = 87) (Supplementary Table S2). PsychENCODE provides a public resource of transcriptomic data by aggregating RNA-seq generated from different projects. We analyzed the ASD and its matched control data generated from UCLA-autism spectrum disorder (ASD) and Yale-ASD projects, and BD, SCZ, and their matched control data generated from BrainGVEX, CMC, and CMC-HBCC projects. For PsychENCODE expression data, we downloaded and analyzed the normalized expression matrix file based on fragments per kilobase of exon per million mapped fragments (FPKM) values that are provided from the PsychENCODE database. For the second dataset (KMH), we obtained raw fastq files of all samples which were individually mapped to the human reference genome (GRCh38). Next, gene expression was estimated as TPM values by using RSEM (v.1.3.0) [47]. After that, we compared expression levels for each group with controls: ASD vs. control, BD vs. control, SCZ vs. control, MDD vs. control, and SA vs. control by using logistic regression with sex and age as covariates. Project study variables (e.g., BrainGVEX and CMC) were additionally considered as a covariate to avoid a potential bias from different studies. We defined statistical significance for differential expressed genes with FDR < 0.05.

In addition, we investigated transcriptomic expression datasets measured from brains of individuals that died from suicide generated by two independent cohorts (Supplementary Table S2): (1) transcriptomic array data measured from four different brain regions of suicide deaths and decreased controls; 10 suicide deaths and 7 controls for each of amygdala, prefrontal cortex, and thalamus regions, and 9 suicide deaths and 7 controls for hippocampus region (GEO id: GSE66937) and (2) RNA-seq data of suicide deaths (n = 21) and controls (n = 29) (GEO id: GSE101521 [14]). For the array data, we downloaded and analyzed normalized expressions. For the RNA-seq data, data were processed with the GRCh38 human reference genome using the same methods as with the KMH dataset, described above. Due to the relatively small sample size, we considered significant differentially expressed genes to be those with p-value < 0.05 as determined empirically through 1000 repeated randomizations of the data.

Investigation of demographic and phenotypic characteristics of suicide death samples with suicide-risk genetic variants

To further evaluate if there were clinically relevant characteristics in suicide deaths associated with identified genetic variants, such as a specific suicide subtype, we explored the International Classification of Diseases (ICD) diagnostic codes (ICD-9/ICD-10) in EHR data of our analyzed individuals who died from suicide. Details of cohorts that have EHR data are presented in Table 1. We characterized psychiatric phenotypes by aggregating ICD codes in EHR data as previously described [9] for relevant exposures and psychiatric diagnoses. We compared demographic and diagnostic information between suicide deaths with and without any of the genetic findings identified from the previous analyses.

Table 1 Demographic and clinical information for Utah suicide deaths analyzed in this study.

Sex differences

Since gene expression differences in brain in psychiatric phenotypes and suicide deaths have been characterized by substantial sex differences [48], we performed a secondary expression analysis stratified by sex in two psychiatric disorder datasets and two SD datasets to identify additional differentially expressed genes in females vs. males, specifically. We defined male-specific genes as those with FDR < 0.05 in males but > 0.05 in females, and female-specific genes as those with FDR < 0.05 in females but > 0.05 in males.