Background

Comorbidity, or the co-occurrence of two or more diseases with each other, is a widespread phenomenon with estimates suggesting that 42% of all patients have at least one comorbidity [1]. For example, comorbid metabolic syndrome, presenting at least two diseases from a cluster of metabolic underlying medical conditions, afflicts 47 million people in the US alone [2]. Comorbidities do not occur randomly, and the excess observation of specific disease-disease co-occurrence in clinical records can imply shared underlying pathophysiological mechanisms [3]. Another recent study showed that 120 disease-trait pairs (e.g., acute lymphoblastic leukemia-mean corpuscular volume) with shared genetic architecture using curated gene association studies significantly co-occurred in electronic health records [4].

Big data science approaches have begun to characterize biomolecular mechanisms that lead to cross-disease relationships; however, these have primarily focused on specific disease-disease hypotheses or required biochemically well-described disease-protein associations as input (e.g., using protein-protein interactions [Full size image

Datasets

The study integrates nine data sources (Table 1). GWAS diseases and their reproducible disease-associated SNPs were downloaded from the NHGRI-EBI GWAS catalog. To extract eQTL associations linking SNPs to expressed genes (RNAs), we downloaded files calculated by Fagny and colleagues [17] from the GTEx project V6 [18] spanning 19 tissues. The authors determined cis- and trans-eQTL at p-value< 0.2 for 19 tissues. These files were chosen to conduct a more in-depth analysis of trans-eQTLs. For clinical data, we acquired the Healthcare Cost and Utilization Project (HCUP) claim datasets. Both the HCUP National Inpatient Sample (NIS13) and the HCUP Nationwide Emergency Department Sample (NEDS13) datasets were employed to compute and ensure the reproducibility of comorbidities. For the definition of the disease bundles and for map** purposes, we integrated several phenotype datasets, including EMBL-EBI Experimental Factor Ontology (EFO) [14], the SNOMED-CT, and the Unified Medical Language System (UMLS) MRCONSO file. Finally, other datasets were downloaded as needed for biological study, such as the Single Nucleotide Polymorphism database (dbSNP) [25] for intragenic (within gene coordinates) and intergenic (between genes) categorization, and HapMap [26], 1000 Genomes Project [27], and LDlink [28] for linkage disequilibrium (LD) in case studies.

Table 1 Data sources

Data preprocessing to define disease bundles and map heterogeneous diseases representation

The HCUP clinical datasets use ICD-9-CM diagnosis codes while the genetic GWAS dataset uses heterogeneous descriptive language for disease names that vary even in related studies of the same trait. To bridge the heterogeneous disease names between datasets, as a first step (Fig. 1a), we performed semi-automated curation and normalization of the disease names into the SNOMED-CT concepts. The use of SNOMED-CT, i.e., a comprehensive ontology organized as a directed acyclic graph (DAG), allowed for the automatic calculation of a semantic relatedness/similarity between the diseases by leveraging the SNOMED-CT hierarchy [23]. Thanks to this procedure, all disease names in the GWAS Catalog were first regrouped into SNOMED-CT classes of proper semantic granularity, i.e., disease-bundles (Methods- Creation of the SNOMED-coded disease-bundles from GWAS terms). Similarly, each ICD-9-CM code [29] relating to a disease in the HCUP datasets was also mapped to the corresponding SNOMED-CT code, facilitating the comparison with the diseases represented in the GWAS studies (Methods- Map** HCUP diseases to disease-bundles).

Creation of the SNOMED-coded disease-bundles from GWAS terms

To focus only on diseases from the GWAS Catalog corresponding to disease phenotypes (e.g., removing phenotypes associated with response to therapy or non-disease traits such as skin color), we kept only GWAS traits under the disease branch of the EMBL-EBI EFO, reducing the number of unique traits from 1622 to 533. Next, a first round of curation was performed by physicians, which further reduced these 533 GWAS disease traits to 481 of interest for purposes of the study. Through the disease-EFO map**, we linked (whenever available) each GWAS disease-trait to external terminology IDs, i.e., SNOMED-CT and ICD-9-CM. To automatically augment the coverage of GWAS trait-to-SNOMED-CT map**, we first mapped the EFO-linked external IDs to UMLS Concept Unique Identifiers (CUIs) using the UMLS MRCONSO table and, second, we obtained the final SNOMED-CT list by retaining all the SNOMED-CT IDs covered under each CUI. After the augmented map**, we linked 431 disease terms to at least one SNOMED-CT ID.

Next, starting from the resulting map**, we defined a list of non-redundant and clinically meaningful disease-bundle candidates by applying two criteria: 1) merging pairs of diseases into a disease-bundle if they were mapped to an identical set of SNOMED IDs; and 2) merging pairs of diseases into a disease-bundle if the disease names were identical after removing the ending parenthetical qualifier of the GWAS term, e.g., “Glaucoma” and “Glaucoma (high intraocular pressure)”. For quality control, the disease-bundles underwent iterative curation by a physician, a geneticist, and a clinical informatician. Following this procedure, we finally selected from the GWAS Catalog a total of 238 disease-bundles associated with 6175 SNPs.

Map** HCUP diseases to disease-bundles (coded in SNOMED-CT).

To compare the disease comorbidity obtained using the HCUP datasets with the disease pairs sharing convergent mechanisms via GWAS and eQTL associations, we had to convert the ICD-9-CM diagnosis codes in the HCUP datasets to the previously identified disease-bundles. Since ICD-9-CM is a classification and SNOMED-CT is a nomenclature, our mapped SNOMED-CT concepts are more comprehensive than ICD-9-CM codes [29]. Therefore, we proceeded by identifying all descendant terms of a SNOMED-CT term associated with a GWAS disease-bundle. Next, the most precise correspondence between SNOMED-CT codes and ICD-9-CM codes were identified using the UMLS dataset. Inconsistent SNOMED-CT-to-ICD-9-CM map**s were detected through a final curation by experts. This procedure enabled the map** of 2454 ICD-9-CM codes into 188 bundles that occur in the two HCUP datasets.

Statistical overlap of eQTL-associated RNAs between distinct disease-associated SNPs (eQTL RNA overlap)

To identify the effect size and significance of shared genetic risk mechanisms between two diseases, we assessed the statistical overlap of expressed RNAs associated with disease-associated SNPs by eQTL studies (eQTL RNA overlap model, proxy for shared transcript mechanisms). As our focus was specifically on interpretable downstream mechanistic insight through the incorporation of eQTL, we investigated beyond the simple model of SNPs or loci shared by two different diseases (eQTL SNP overlap model). Examples of studies on disease-disease relationships through common shared risk loci (e.g., HLA) can be found in [30,31,1b) and derive a statistical significance (p-value) between any pair of diseases for each tissue. P-values were adjusted for multiple comparisons by False Discovery Rate (FDReRNA) using the Benjamini–Hochberg procedure [34].

In our approach, we retained all disease pairs surpassing the stringent cutoff of 0.05 for the FDR values.

Calculation of disease comorbidity based on HCUP

As mentioned, the comorbidity between bundled diseases from two HCUP datasets, the National Inpatient Sample (NIS13) and the Nationwide Emergency Department Sample (NEDS13), were assessed for robust findings. Two directional comorbidities for every pair of disease-bundles were evaluated separately (Fig. 1c). For each pair of disease-bundles, denoted D1 and D2, we tested the following logistic regression models (E: expectation):

$$ E\left( logit\left({D}_1\right)\right)={\beta}_{10}+{\beta}_{11}{D}_2+{\beta}_{12} race+{\beta}_{13} sex+{\beta}_{14} age $$
(1)
$$ E\left( logit\left({D}_2\right)\right)={\beta}_{20}+{\beta}_{21}{D}_1+{\beta}_{22} race+{\beta}_{23} sex+{\beta}_{24} age $$
(2)

The two models correspond to the comorbidity risk of D1 given D2 and D2 given D1 respectively. Covariates (confounders), such as race, sex, and age, were adjusted in both models. β ij are logistic coefficients of each variable to be estimated from the data. We chose not to use alternative methods, such as graphical modeling [35] and LASSO [36, 37] since graphical models are typically used to understand the joint dependence structure for a set of variables, while LASSO is commonly used for regularization when there are large numbers of potential effects in the model.

In implementation, we collected the most specific disease codes (all five ICD-9-CM digits) mapped to the disease-bundles. Then, for each patient, we determined whether the patient has a billing code among the mapped ICD-9-CM codes, through which the statuses for a pair of disease-bundles were determined. From all patients without missing required data (e.g., missing sex information), the parameters within the two models were estimated and the significance of the directional comorbidity (β11 and β21) was inferred. We adjusted multiple comparisons from both models and across all pairs of disease-bundles using Benjamini-Hochberg procedure to derive the False Discovery Rate [34], which derived two directional FDRs for each pair of disease-bundles (e.g., disease A may cause disease B and not the reverse). Disease pairs with FDR < 0.05 in either directional tests were considered as comorbidities. Larger odds ratio (OR) were estimated from the power of two coefficients (β11 and β21) as the OR of comorbidity between each pair of diseases. These calculations were conducted in both HCUP datasets separately and the best p-value is reported in Additional file 1 as FDRcomorbidity. We further confirmed the ORs from an additional bivariate logistic model within the Zelig R package for the retained disease pairs, where the bivariate logistic model is unidirectional and able to estimate the dependence of the two disease variables subjected to covariates, yielding a mean odds ratio between a pair of diseases [38].

Comparative studies between eQTLs and HCUP

To verify the hypothesis that diseases sharing convergent downstream are more likely to show comorbidities and that an association exists between disease comorbidity and genetic/genomic architectures, we investigated the concordance between disease pairs showing significant downstream eQTL convergence in GWAS (Methods- Statistical overlap of eQTL-associated RNAs between distinct disease-associated SNPs) and the pairs of diseases resulted prioritized as comorbid using the HCUP clinical data (Methods- Calculation of disease comorbidity based on HCUP). To this end, we performed a FET to test the enrichment of comorbid disease-pairs with the disease pairs sharing eQTL mechanisms (FETfinal, Fig. 1d). In the FETfinal test, we examined the number of disease pairs resulted statistically significant (FDR < 0.05)/not significant in the molecular dataset and in the clinical comorbid data. Therefore, the FETfinal test was conducted by counting the number of disease pairs under each of the four combinative conditions (contingency table shown in Fig. 1d). The signal robustness was verified across different conditions and datasets, different FDR cutoffs (ranging from 0.02 to 1) were evaluated for both comorbidity and eQTL RNA overlap, and the reproducibility and the enrichment trend were examined with respect to the strength of the cutoffs (Fig. 2). The reproducibility from multiple HCUP datasets was also examined using the comorbidity observed in multiple datasets.

Fig. 2
figure 2

Convergent downstream genetic mechanisms predicted from shared eQTL RNA between disease-pairs are enriched among comorbidities observed in clinical datasets. Vertical axis = odds ratio of overrepresentation of shared molecular mechanisms among clinical comorbidities (Results-Convergent genetic mechanisms between disease-pairs are enriched among comorbid disease). Left bottom axis = FDR cutoffs of comorbidities found in the HCUP clinical datasets (OR > 3; Results- Prioritized comorbidities), right bottom axis = FDR cutoffs of shared molecular mechanisms discovered between two diseases

In addition, we performed a further validation applying the FET test as previously described but excluded the disease pairs involving similar diseases. This way we could assess the significance of diseases not similar resulted sharing convergent mechanisms as well as comorbid.

To identify similar pairs, we computed a clinical-ontology-based semantic similarity (or distance) between the disease-bundle pairs applying Lin’s similarity metric [39] with Sanchez et al.’s information content estimation [40]. This approach takes into account the SNOMED-CT ontological structure, and SNOMED-CT are used as proxy for phenotypic relatedness of diseases. For each bundle pair, a score between 0 and 1 was derived. We considered two diseases similar if their similarity score was higher than 0.9.

Network visualization of the comorbidities sharing intergenic genetic risks through eQTL RNA overlap

First, we created a network representing disease pairs (Fig. 1e) including (i) disease pairs sharing genetic mechanism through eQTL RNA overlap of eQTL associations to disease-associated SNPs (Methods- Statistical overlap of eQTL-associated RNAs between distinct disease-associated SNPs), and (ii) disease comorbidities in at least one of the two HCUP datasets (Methods - Calculation of disease comorbidity based on HCUP). Here, network nodes represent diseases and two nodes (diseases) are linked (network edge) if the related disease pair meet both the above-mentioned comorbidity and eQTL RNA overlap criteria. To every edge, we assigned a weight corresponding to the number of distinct tissues that yielded the significance of the disease pairs. We colored the nodes according to the clinical organ-system classes as defined by Han et al. [19] and adjusted edge width according to the edge-related weight value.

Second, for any interesting overlap** disease pairs, we built a related network to represent the biomolecular mechanisms underlying the comorbidity between the two diseases. Therefore, a network was built for each comorbid pair by connecting each disease in the pair to the related SNPs via GWAS and then associating each SNP to the overlap** RNAs resulted via the eQTL associations. Nodes of the resulting network can represent a disease, a SNP, or a RNA, and edges can correspond to disease-to-SNP or SNP-to-eQTL RNA associations. The biomolecular network of a comorbid disease pair therefore includes the common downstream RNAs (prioritized at FDR < 0.05) between the corresponding prioritized and disease-associated SNPs (Methods- Calculation of disease comorbidity based on HCUP). Irrelevant information for the comorbidity (e.g., insignificant eQTL RNA overlap) that can be derived from other edges is not shown. Note, we grouped as a single locus (LD ≥ 0.8) the SNP pairs associated with the same disease and that are in Linkage Disequilibrium (LD). All networks were visualized using Cytoscape [41].

Curation of prioritized comorbidities

For the comorbidities discovered from HCUP datasets (FDR < 0.05, OR > 3) that have convergent eQTL downstream genes, we conducted a systematic curation of the literature using PUBMED and Google Scholar (Fig. 1f). Disease names were searched in PUBMED and Google Scholar and abstracts of the resulted papers were checked for comorbidity evidence. Full texts were examined if a conclusion of comorbidity could not be concluded in the abstracts. For quality control, three independent curators carried out the curation and resolved. In addition, 15% of random disease pairs (controls), selected among pairs with no comorbidity in either HCUP datasets nor convergent eQTL mechanisms in any GTEx tissue, were added to the curation list (blind to curators). An inter-rater agreement was computed using the Spearman correlation test, while disagreement was thereafter solved under the supervision of an expert physician. Next, we categorized the resulting curation evidence of a disease pair into six levels: 1 = well-performed controlled studies confirming the positive comorbidity, 2 = evidence from studies with important limitation (e.g., small sample size), 3 = anecdotal case reports, 4 = strong evidence of absence of association (well-controlled studies, no significance), 5 = absence of studies in the literature, and 6 = strong evidence for an anti-correlation or non-coexistence of the diseases. Finally, to assess our results, we compared the frequencies of levels of evidence confirming the “prioritized comorbidities sharing molecular mechanisms” with those of the random disease-pair controls using the Chi-square test since multiple levels were involved in the test.