Abstract
Background
Identification of functional non-coding variants and their mechanistic interpretation is a major challenge of modern genomics, especially for precision medicine. Transcription factor (TF) binding profiles and epigenomic landscapes in reference samples allow functional annotation of the genome, but do not provide ready answers regarding the effects of non-coding variants on phenotypes. A promising computational approach is to build models that predict TF-DNA binding from sequence, and use such models to score a variant’s impact on TF binding strength. Here, we asked if this mechanistic approach to variant interpretation can be combined with information on genotype-phenotype associations to discover transcription factors regulating phenotypic variation among individuals.
Results
We developed a statistical approach that integrates phenotype, genotype, gene expression, TF ChIP-seq, and Hi-C chromatin interaction data to answer this question. Using drug sensitivity of lymphoblastoid cell lines as the phenotype of interest, we tested if non-coding variants statistically linked to the phenotype are enriched for strong predicted impact on DNA binding strength of a TF and thus identified TFs regulating individual differences in the phenotype. Our approach relies on a new method for predicting variant impact on TF-DNA binding that uses a combination of biophysical modeling and machine learning. We report statistical and literature-based support for many of the TFs discovered here as regulators of drug response variation. We show that the use of mechanistically driven variant impact predictors can identify TF-drug associations that would otherwise be missed. We examined in depth one reported association—that of the transcription factor ELF1 with the drug doxorubicin—and identified several genes that may mediate this regulatory relationship.
Conclusion
Our work represents initial steps in utilizing predictions of variant impact on TF binding sites for discovery of regulatory mechanisms underlying phenotypic variation. Future advances on this topic will be greatly beneficial to the reconstruction of phenotype-associated gene regulatory networks.
Similar content being viewed by others
Background
A major open problem today is how differences in DNA sequence, e.g., single-nucleotide polymorphisms (SNPs) and variants (SNVs), lead to health-related and other phenotypic differences among individuals. A common approach is to find polymorphisms/variants that are statistically correlated with phenotypic differences, as in genome-wide association studies (GWAS) [1], family-based association tests [2], and expression quantitative trait loci (eQTLs) [3, 4] for phenotype-related genes. However, statistically identified variants may not be functionally related to phenotypes [5], due to a variety of factors including linkage disequilibrium (LD). This problem is particularly pronounced in the case of non-coding variants, which represent the vast majority of GWAS findings [6, 7] and often function by influencing gene regulation. Accurate contextual information about non-coding variants can improve our ability to disambiguate variants causally related to gene expression and phenotype [8, 9] from nearby non-functional SNPs. For example, if we have prior knowledge of a relevant transcription factor (TF), then the presence of a variant within a TF binding site (TFBS) may add to our confidence in the variant’s regulatory potential; the assumption here is that such a variant influences the TF’s binding to that site and consequently the gene regulatory impact of the TF. Advanced techniques for predicting in vivo TF-DNA binding potential from DNA sequence (gkm-SVM [10, 11], DeepBind [12], DeepSEA [13], DeFine [14], and Sasquatch [15]) can facilitate this approach by providing more accurate estimates of a variant’s impact on TF binding. In addition to providing a means for statistically prioritizing those non-coding variants by their likelihood of functionality, this strategy also offers a mechanistic explanation about their function, i.e., their impact on the TF-gene regulatory relationship. For example, Zhang et al. [5] adopted such a strategy: they combined a method for predicting changes in TF binding with multi-omics data to identify a SNP that impacts the binding strength of a TF called GATA3 to modulate breast cancer susceptibility.
It must be noted, however, that the above approach to identify phenotype-related non-coding variants along with their regulatory mechanism is still in its infancy and its sensitivity-specificity tradeoff is not well understood. Reliable mechanistic claims of a SNP mediating a TF’s influence on phenotypic variation often require multiple lines of “-omic” evidence as well as prior knowledge. A related but less-explored opportunity is to examine a collection of variants associated with a phenotype (e.g., from a GWAS study) and test the collection for enrichment of variants predicted to impact TF-DNA binding; such an enrichment can associate the TF, rather than individual variants, with the phenotype. This may give us mechanistic insights of a more global nature, with greater confidence than what the available data allows at the level of individual SNPs. In recent work, we adopted this general strategy to identify transcription factors regulating phenotypic variation across individuals, by combining genotype, gene expression, and phenotype information with genome-wide profiles of TF-DNA binding. The underlying principles were twofold: (1) If a gene’s expression is correlated with phenotype, and a SNP correlated with that gene’s expression (eQTL of the gene) is located within a TFBS, we treated this as (weak) evidence that the TF influences the phenotype via that gene; (2) if such evidence is observed significantly many times, i.e., across many genes, we hypothesized that the TF plays an important regulatory role in phenotypic variation. The assumption is that TF binding is affected by the SNP and this effect underlies the SNP’s correlation with gene expression, which in turn contributes to phenotypic variation. We pursued this line of reasoning in [16, 17] to systematically identify, through statistical testing and probabilistic graphical models, major TFs associated with a specific type of phenotypic variation, viz., differences in cytotoxic response to a particular drug in a panel of cell lines. Our goal in the current work is to test if information about the functional impact of variants on TF binding can improve inferences of TF-phenotype associations.
Two of the three pieces of information considered in the above scheme—(a) strength of SNP association with gene expression (eQTL) and (b) gene expression correlation with phenotype (a transcriptome-wide association study or “TWAS” [18])—are quantified by relatively established procedures. However, the third axis of information crucial to the approach—the evidence that a TF’s binding, and hence its regulatory influence on a gene, is affected by a SNP—is harder to assess. In previous studies, we treated the presence of a SNP inside a ChIP peak of the TF, located within the 50-kbp upstream region of the gene, as such evidence. However, this heuristic has obvious limitations. First, a SNP located within a ChIP peak may not necessarily impact the TF’s binding. This may be addressed by borrowing ideas from previous studies [19, 20] that have used motif and k-mer-based scans within ChIP peaks to identify regulatory SNPs likely to affect that TF’s binding. Second, a TF binding event located further than 50 kbp from the TSS may also exert regulatory influence on a gene, depending on chromatin loo** structures [21]; conversely, every TF binding event located within a modest distance (e.g., 50 kbp) of the TSS does not necessarily have a regulatory influence on the gene. Use of chromatin interaction data sets offers a resolution of this issue [21]. In this work, we address the above limitations of ascribing a regulatory relationship to a (TF, SNP, gene) triplet, through a combination of established and novel methods, with the express goal of aggregating such evidences and combining them with gene-phenotype correlations to discover regulatory mechanisms underlying phenotype variation.
We develop and use a new computational pipeline to identify TFs associated with drug response variation across individuals, building on the ideas outlined above, and performing integrative analysis of genotype, gene expression, and cytotoxicity data on a panel of ~ 300 cell lines, along with TF-ChIP data from ENCODE and TF binding motifs from various databases. We utilize a state-of-the-art, “k-mer”-based machine learning technique to predict the impact of a SNP on TF binding strength. We also develop an alternative method for this task, which uses one or more motifs known to represent a TF’s binding preferences, and combines biophysically inspired modeling and machine learning ideas. Through systematic benchmarking, we find that this motif-based method has a similar predictive ability as the “k-mer”-based technique for predicting allele-specific TF-DNA binding, in contrast to recent reports that leading k-mer-based approaches clearly outperform motif-based approaches [22]. Ultimately, using both k-mer-based and motif-based predictors and utilizing chromatin interaction domains and loops to link variants to genes, we show that modern tools of SNP impact prediction can lead to the discovery of novel regulatory mechanisms underlying phenotypic variation that are missed when not using SNP impact predictors. By aggregating evidence from many SNPs with putative effects on TF binding, we systematically identify TFs that influence individual-level differences in drug sensitivity, for several cytotoxic drugs. We examine one such discovered association more closely, viz., the predicted and experimentally confirmed effect of the TF “E74-like factor 1” (ELF1) on sensitivity to the drug doxorubicin. Our analysis suggests several genes that may be under ELF1 regulation and related to the doxorubicin response pathway.
Results
Selection of methods for predicting impact of SNPs on TF-DNA binding
We first sought a method to predict the impact of a SNP on TF binding (henceforth referred to as the “TFBS-SNP impact prediction task”), with the ultimate goal of utilizing such predictions to discover TF-phenotype relationships. This requires a sensitive method to quantify the strength of binding, since the effect of a typical SNP on a binding site is expected to be relatively modest. Several such methods have been reported in the literature [10, 13, 23], including some that utilize a variety of data types, such as chromatin state profiles [24] and high-resolution DNA accessibility [15, 24], for prediction [10, 13, 23]. To ensure wide applicability, we were specifically interested in a method that can predict TF binding strength from DNA sequence alone, while possibly using ChIP-seq data for the TF for model-training purposes. Existing tools for this scenario rely either on the k-mer composition of sequences [10, 13, 23, 25] or use pre-determined motifs for the TF [26,27,28,29]; recent evaluation [22] on allele-specific binding (ASB) data suggests that the k-mer-based methods have a clear advantage over motif-based methods. However, the motif-based methods tested by Wagih et al. [22] use a relatively rudimentary notion of motif matching, while past work by us [30] and others [29] has contributed more sophisticated biophysical models for this purpose. We compared a representative of leading k-mer-based methods (gkm-SVM [10, 11]) with an advanced motif-based method to determine their relative merits in predicting TF binding strengths and their changes due to SNPs.
We first used the thermodynamics-based method called Sequence To Affinity Prediction (STAP) [30] and trained it on ChIP-seq data for a TF, thereby learning to predict the strength of TF binding (ChIP signal strength) at a putative site from its sequence and the TF’s motif. STAP scores a genomic window, e.g., a few hundred base pairs long—the typical length of a ChIP peak—for its estimated occupancy by a TF, using the latter’s pre-determined motif in a position weight matrix (PWM) form. We have previously used this approach to accurately model ChIP data in D. melanogaster [30] and in mouse ESCs [31], as well as in the human cell line data sets of a recent “DREAM” challenge. However, we recognized that often there are multiple motifs for the same TF in the literature or databases and it is not clear which one of them, if any, is the optimal motif to use for the modeling of binding strengths. We therefore trained separate STAP models for each available motif for a TF and then used a support vector machine (SVM) classifier to combine the binding strength predictions of a TF at a given genomic window, made by those STAP models, into a single score (Fig. 1a). We call this the “MOP” (Motif-based Occupancy Prediction) score. With a means to score a window for its strength of TF binding, we were able to estimate the effect of a SNP by considering a 500-bp window centered on that SNP position, scoring two versions of the window, with the central position being set to either allele of the SNP, and computing the difference (Fig. 1b). We refer to this as the “Delta-MOP” score of the SNP for the TF. Note that this score is tied to the cell type from which ChIP data used in training were obtained.
Figure 1b illustrates the Delta-MOP score with an example. The SNP rs6717613 (G->A) is assigned a Delta-MOP score of 0.45 for the TF ATF2, with the MOP scores of the G and A alleles being 0.29 and 0.74 respectively. Note that six different motifs were available for this TF; for three of these ATF2 motifs, the SNP position coincides with an informative position of the motif and the two alleles define motif matches of differing strengths, while for the remaining three motifs, the two alleles present equally weak or equally strong sites. Hence, it is not clear a priori if this SNP should be considered as impacting binding strength or not, and it is instructive to have the Delta-MOP score provide an affirmative and quantitative answer.
Evaluations of SNP impact prediction scores
We first evaluated methods for prediction of TF binding strength from sequence, since this underlies the prediction of TFBS-SNP impacts. As noted above, the newly developed MOP score, which underlies Delta-MOP, is a generalization of the motif-based STAP method [30,31,32] for predicting a TF’s binding strength. We therefore hoped to confirm that this generalization indeed improves the prediction accuracy. We were also interested in a leading k-mer-based tool for predicting TF binding from sequence. We therefore considered the “gkm-SVM” method, which has been demonstrated to be among the best for this purpose—on par [33] with deep learning-based methods such as DeepBind [12] and DeepSEA [13].
We trained the three methods—STAP, MOP, and gkm-SVM—using the same training data set composed of 800 positive sequences (ChIP peaks of a TF) and 800 negative sequences (non-peaks), and cross-validated them on a set of 400 unseen sequences, balanced between the positive and negative classes. The negative sequences were randomly selected from the ChIP peaks of any other TF aside from the one under consideration (test TF); this is an important distinction from past benchmarks for the task (e.g., a recent “DREAM challenge” [34] and was designed to make the evaluation more specific to the unique binding behavior of the test TF rather than more general properties of TF binding implicit within ChIP data, such as DNA accessibility. Our tests were performed for each of 37 different TFs, selected based on the availability of ChIP-seq data for a well-studied lymphoblastoid cell line (LCL), GM12878, and other relevant criteria (see Additional file 1: Note S1). We noted that MOP and gkm-SVM produce similar accuracy (Fig. 2a, Additional file 2: Table S2) on average across the 37 data sets (TFs), while exhibiting some level of complementarity. MOP shows a clear improvement over STAP (Fig. 2b, paired T-test p value 0.0038, and Additional file 2: Table S2), demonstrating the value of using multiple motifs when available. (Additional file 2: Table S1 tabulates the number of motifs available for each TF.)
We next evaluated the above methods for the TFBS-SNP impact prediction task, by asking if the SNPs with strongest effects on predicted TF binding, henceforth called “binding-change SNPs,” are enriched for allele-specific binding sites (ASB), defined as sites where ChIP-seq read counts are significantly different between alleles [22]. The Delta-MOP score of the previous section is one way to predict binding-change SNPs, but analogous predictions can be made using STAP or gkm-SVM in place of MOP to score binding strengths of the two alleles. We refer to these as “Delta-STAP” and “Delta-gkm-SVM” [10, 11] scores respectively. As a baseline, we also evaluated a fourth method, called “Delta-PWM,” which is a PWM-based scoring metric included in the evaluations by Wagih et al. (We used the “delta raw score” provided by them as this baseline.) We obtained allele-specific binding (ASB) data for 16 TFs in lymphoblastoid cell lines from Wagih et al. [22], and tested whether the four abovementioned methods can accurately discriminate ASB SNPs from non-ASB SNPs (see “Methods”). Performance was measured using the area under the receiver operating characteristic curve (AUROC; ROC curve of RUNX3 is shown in Fig. 2c) and the area under precision-recall curve (AUPRC), following [35]. In AUROC comparisons (Fig. 2d, Additional file 2: Table S3), Delta-MOP appears to have better performance than Delta-STAP (average difference of 0.020, paired T-test p value 0.0013) and Delta-PWM (average difference of 0.025), but not as significantly different from Delta-gkm-SVM (average difference of 0.0043). The median AUROC using Delta-MOP is 0.60 and that using Delta-gkm-SVM is 0.58. Two of the 16 TFs—BHLHE40 and EGR1—had their ASB events predicted with AUROC of ~ 0.7 or greater when using Delta-MOP. These two methods exhibited a fair degree of complementarity in their performance on different TFs (Fig. 2d). In AUPRC comparisons (Fig. 2e), the performance of Delta-MOP is significantly better than that of Delta-PWM (average difference 0.095, paired T-test p value 0.00072), but similar to the other two methods, with the medians of Delta-MOP, Delta-gkm-SVM, and Delta-STAP being 0.39, 0.38, and 0.36 respectively.
To summarize the evaluations reported above, we found that the motif-based method MOP and the k-mer-based method gkm-SVM are equally good predictors of binding strength as well as of allele-specific binding events, with noticeable degree of complementarity to each other, while MOP shows clear improvements over the two other motif-based methods evaluated. We therefore selected Delta-MOP and Delta-gkm-SVM to predict TFBS-SNP impact for the next steps of analysis. It was instructive to find that a motif-based approach (Delta-MOP) is competitive with, and for some TFs better than, the k-mer-based Delta-gkm-SVM method (see “Discussion”). The same conclusions are supported by comparisons with a newer version of Delta-gkm-SVM, called “Delta-ls-gkm” [36], which yields better performance on the ASB prediction than Delta-gkm-SVM, but shows statistically insignificant difference from Delta-MOP (Additional file 1: Note S6).
Discovery of TFs regulating individual variation in cytotoxic drug response
To discover TFs associated with phenotypic variation, we adopted a statistical approach illustrated in Fig. 3. At its heart is a hypergeometric test of the overlap between two sets of SNPs, outlined below.
TF-phenotype association test:
-
(a)
We consider the collection of all SNPs that are located within accessible DNA in the cell type of interest; this is the “universe” set for the test. (Also, see Discussion about this choice.)
-
(b)
We define a subset of SNPs that are likely to impact phenotypic variation through a cis-regulatory effect on a gene’s expression; we refer to this as the “phenotype-associated” SNPs. Specifically, we identify phenotype-associated genes based on significant association between gene expression and the phenotype, and then determine significant eQTL SNPs in the regulatory regions (explained below and in “Methods”) of those genes.
-
(c)
We separately define a subset of SNPs that are likely to affect a particular TF’s binding strength, i.e., the “binding-change” SNPs. Although introduced above, these are now redefined as the SNPs with the greatest Delta-MOP or Delta-gkm-SVM score for that TF, among those located within the TF’s ChIP peaks for the cell type (see “Methods”).
-
(d)
A hypergeometric test is used to test the overlap between phenotype-associated SNPs and binding-change SNPs; a significant overlap is considered as evidence for the TF’s role in regulating phenotypic variation.
We note that the above test, conducted at the level of SNPs, is conceptually similar to that in Hanson et al. [16], with several key differences, the most prominent being our use of TFBS-SNP impact prediction scores as an additional criterion for designating binding-change SNPs. Hanson et al., in contrast, considered all SNPs within the TF’s ChIP peaks as binding-change SNPs. Other important differences are that Hanson et al. performed the statistical test at the gene level and did not use DNA accessibility or enhancer-promoter interaction data.
We used the TF-phenotype association test procedure on a data set of 284 lymphoblastoid cell lines (LCLs) that have previously been assayed for their cytotoxic response (EC50) to each of 24 different treatments, mostly cancer drugs [17]. Gene expression and genotype data are also available for these LCLs. We used ENCODE [37,38,39] ChIP-seq data for 37 TFs in the lymphoblastoid cell line GM12878 (see “Methods”), along with the abovementioned genotype data to identify binding-change SNPs, using Delta-MOP and Delta-gkm-SVM for TFBS-SNP impact prediction. We also repeated the analysis using only one or the other of these methods (see Additional file 2: Table S4, Table S5). To identify phenotype-associated SNPs, we considered genes whose expression levels correlated significantly with EC50 values (of a specific drug) across the panel of LCLs, and used Hi-C data [21] from the GM12878 cell line in step (b) of the above procedure. Here, we defined the regulatory region of a gene to include the chromatin interaction domain to which the gene belongs (see “Methods”), as well as more distal segments predicted to interact with the gene via chromatin “loops” [21].
Assessment of predicted TF-drug associations
A total of 888 TF-drug pairs (24 drugs × 37 TFs) were evaluated; we report in Table 1 all 38 pairs significant at false discovery rate (FDR) of 5% (nominal p value < 0.0021). (The full results are in Additional file 2: Table S6.) We also performed a variant of the above enrichment tests where TFBS-SNP impact prediction was not used; instead, a size-matched set of randomly selected SNPs within ChIP peaks (of the test TF) were chosen for consideration as binding-change SNPs, as was done by Hanson et al. [16]. (We used a size-matched random subset of within-peak SNPs, rather than all such SNPs, so that enrichment levels can be compared.) We repeated this “randomized control” test 100 times and noted how frequently each significant pair in the original analysis had a stronger p value in these randomized controls, reported in Table 1 (column “Impact predictor utility p score”). We note that 21 of the 38 reported pairs have only ≤ 10% chance of being discovered when not using TFBS-SNP impact prediction scores, thereby underscoring the value of such predictions in the procedure. This comparison establishes that impact prediction scores can indeed help identify novel statistical associations, though a rigorous assessment of the sensitivity-precision tradeoff due to their use is not attempted here. In another control experiment, we assigned to each TF a random set of SNPs (size-matched with the binding-change SNP sets above) from the universe of all SNPs within accessible regions and tested all 888 TF-drug pairs. We discovered that, on average, across 100 such randomized control tests, only 1.27 pairs (about 0.14% of the 888 tested) were significant at a nominal p value of 0.0021, the criterion used above for reporting (Table 1), providing further statistical evidence for the low proportion of false positives in our report.
While we showed above that the use of TFBS-SNP impact scores can help predict TF-drug associations that might otherwise not rise above statistical significance, we also needed to convince ourselves that the discovered statistical associations are likely to be biologically true. In the absence of any systematic benchmarks of causal relationships between TFs and drug response, we had to rely on extensive but ad hoc survey of the literature for supporting evidence, following guidelines established in [40]. Out of the 38 significant TF-drug pairs of Table 1, eight were found to have “direct” supporting evidence (Table 2). For seven of these 8 cases, knock-down of the TF has been shown to lead to a significant difference in sensitivity. In one case—the pair ELF1-CDDP—we found published evidence that DNA-bound ELF1 increases CDDP-induced DNA damage at the bound locations, thereby directly and mechanistically implicating the TF’s regulatory activity in response to the drug. Notably, three of the top seven significant pairs (based on p value) have such direct confirming evidence, and these three pairs would have not have been discovered if not using TFBS-SNP impact scores (impact predictor utility p score ≤ 0.1, Table 1). Among the eight pairs with direct evidence, only two (ELF1-CDDP and PML-doxorubicin) would have reasonable chance of being discovered without use of TFBS-SNP impact information (impact predictor utility p scores of 0.11 and 0.14 respectively).
We found six additional pairs to have strongly suggestive evidence of a biological relationship. This includes cases where the TF is a demonstrated regulatory mechanism of the drug’s action (evidence code “Regulation of drug action” in Table 2), is a known regulator of the drug’s target protein or pathway (“Regulatory target direct” in Table 2), or plays a role in sensitivity to a closely related drug (“Sibling drug evidence” in Table 2); see Additional file 1: Note S3 for details. As an example of “regulation of drug action,” SP1-mediated trans-activation of survivin has been shown to reduce doxorubicin sensitivity [41], supporting the pair SP1-doxorubicin. An instance of “regulatory target direct” evidence is provided by the pair REST-rapamycin: REST is known to exert regulatory control over the “mTOR” signaling pathway [42] and this pathway (mTOR = “mammalian target of rapamycin”) is the canonical target of the drug rapamycin [43]. An example of the evidence code “sibling drug evidence” is the pair PML-epirubicin, supported by direct evidence for the role of TF PML in response to the drug doxorubicin, which is closely related to epirubicin [17]: 6MP, 6TG, ARAC, arsenic, carboplatin, CDDP, cladribine, docetaxel, doxorubicin, epirubicin, everolimus, fludarabine, gemcitabine, hypoxia, metformin, MPA, MTX, NAPQI, oxaliplatin, paclitaxel, radiation, rapamycin, TCN, and TMZ. The phenotype, called EC50, represents the concentration at which the drug reduces the population of LCL cells to half of the initial population. Cytotoxicity assays were performed for every one of these drugs using the LCL panel. After initial optimization, cells were treated with a range of concentrations for any given drug tested, followed by incubation for 48 to 72 h. MTS cytotoxicity assays were then performed using Cell Titer 96 AQueous Non-Radioactive Cell Proliferation Assay kit (Promega Corporation, Madison, WI, USA), followed by absorbance measurement at 490 nm in a Safire2 microplate reader (Tecan AG, Switzerland). Cytotoxicity phenotypes were determined by the best fitting curve using the R package “drc” (dose–response curve) [74] based on a logistic model.
Transcription factor binding motifs
Two hundred twenty-five PWMs for 37 TFs were collected from three sources:
-
1)
Twenty-nine PWMs for 27 TFs were collected from ENCODE factor book motifs from http://hgdownload.soe.ucsc.edu/goldenPath/hg19/databasefactorbookMotifPwm.txt.gz [75].
-
2)
One hundred eighty-five PWMs based on ChIP data for 37 TFs from GM12878 cell line were downloaded from the Factorbook website (http://www.factorbook.org/human/chipseq/tf) [76].
-
3)
Twenty-five PWMs for 21 TFs were obtained from HOCOMOCO Human (v10) [77], via the motif library of the MEME software [78].
All motifs are included in Additional file 3.
ChIP-seq and accessibility data
We used ChIP-seq data from the ENCODE project, as summarized in the “Txn Factor” track at the UCSC genome browser (“wgEncodeRegTfbsClusteredWithCellsV3” bed files). Clustered peaks observed in GM12878 cell line were used in this study. We also used genome-wide profiles of ChIP-seq signal values from the ENCODE project (www.encodeproject.org). Signal values are used as numeric measurements of the TF binding strength for training and testing TF-DNA binding prediction. DNaseI hypersensitivity (DHS) uniform peaks for GM12878 cell line (ENCODE project) were downloaded from the UCSC website [37].
Training set generation
MOP, STAP, and gkm-SVM need to be trained on ChIP-seq data using DNA sequences and corresponding ChIP scores. For training purposes, we generated balanced training data sets for each TF, which is composed of positive sequences and the same number of negative sequences. We selected 1000 segments of 500-bp length from each TF’s ChIP peaks as the positive set. (We limited the selection to peaks located within 50 kbp upstream of a protein coding gene and excluded “High Occupancy Target” or HOT regions, i.e., peaks overlap** 6 or more TFs with at least 50% overlap.) We defined a large collection of “negative windows” for a TF to be 500-bp-long segments in the positive sets of other TFs but not bound by the test TF. We then randomly selected 1000 windows from this collection as the negative set for the test TF. DNA sequence and signal value for each window was extracted from the reference genome (hg19) and ChIP-seq data from the ENCODE project. Thus a balanced data set with 2000 windows was generated for each of the 37 TFs. These data sets were further separated into a balanced training set with 1600 windows and a balanced test set of 400 windows. (See Additional file 1: Note S1.)
Prediction of TF-DNA binding
-
1.
STAP: A separate STAP model [30] was trained for each of 225 PWMs (representing 37 distinct TFs) using the balanced training set. Cross validation (80% training, 20% testing) within these training data was used to learn a value for the site energy threshold (“eT”) hyperparameter.
-
2.
gkm-SVM: For each TF, a separate model was trained as recommended by authors [10, 11], with default settings (http://www.beerlab.org/gkmsvm/gkmsvm-tutorial.htm).
-
3.
MOP: The scores of a window reported by STAP models using different motifs of the TF were used as a feature vector representing the window and provided to a support vector machine (SVM). We trained an SVM model (package “e1071” in R [79]) to predict ChIP scores from such feature vectors, using the same training data as above.
To make the binding scores predicted by different methods fall on a comparable scale, we rescaled every score by a linear function so that the predicted binding scores for the 2000 windows in training and testing data range exactly from 0 to 1.
Prediction of TFBS-SNP impact
We first generated a reference genome specific to our LCL genotype data set by starting with the “hg19” reference genome and setting the nucleotide at each SNP location (in the LCL data set) to the major allele of that SNP in the data set. For each SNP, a 501-bp window centered on that SNP was extracted from this LCL-specific reference genome, and two versions of its sequence—one with the minor allele and another with the major allele of that SNP—were used as inputs for TF binding predictors. The absolute value of the difference between predicted binding scores of these two sequences was used as the TFBS-SNP impact score (Delta-STAP, Delta-gkm-SVM or Delta-MOP, depending on the binding prediction method used). In this step, binding predictors trained on all 2000 windows defined above were used. The fourth method for SNP impact prediction, called Delta-PWM, directly uses the “Delta raw scores” for “MEME signif PWM” provided by Wagih et al. [22].
Evaluations on allele-specific binding (ASB) data
ASB SNPs and non-ASB SNPs for lymphoblastoid cell lines were collected from Wagih et al. [22]. Twenty-two of the 37 TFs, for which we have binding predictors, have ASB data for these cell lines. Among these 22 TFs, MEF2A, NFYB, SRF, and USF1 have fewer than 150 annotated (ASB or non-ASB) SNPs, while SP1 and SPI1 did not have associated Delta-PWM data. For these reasons, these six TFs were excluded and we only used the data for the remaining 16 TFs in the ASB evaluation (Additional file 2: Table S3). The TFBS-SNP impact of each ASB and non-ASB SNP was measured by four methods (Delta-MOP, Delta-gkm-SVM, Delta-STAP, and Delta-PWM) as explained above. AUROC and AUPRC values were calculated for each TF-method combination, indicating how well the corresponding impact score can be used to label the ASB and non-ASB SNPs.
Identifying eQTLs in a gene’s regulatory region
We used Hi-C data [21] on 3-D chromatin architecture in the GM12878 cell line to construct the cis-regulatory region of each gene. First, the local “domain” that the gene overlaps with was included in such a region. Second, for each pair of loci connected by a loop, if the gene overlaps with one of the loci, the other locus was included in its cis-regulatory region. For each SNP located within the cis-regulatory region of a gene, the association between genotype and gene expression was calculated following [17], and SNPs with p value < 0.05 were considered as cis-eQTLs for the gene.
TF-drug association tests
Hypergeometric tests were used to identify TFs whose “binding-change SNPs” are enriched in drug response-associated SNPs. We used SNPs in GM12878 DNase-seq narrow peaks [80, 81] as the universe. For each drug, we defined genes whose expression correlates with drug response (EC50) with a correlation p value of 0.05 or lower as “drug response genes” and eQTLs assigned to these genes (see above) as the drug response-associated SNPs. A TF’s “binding-change SNPs” were defined as those with large TFBS-SNP impact score using either MOP or gkm-SVM. In particular, SNPs located within the TF’s ChIP peaks and ranked among the top 300 by Delta-MOP or among top 300 by Delta-gkm-SVM score were called “binding-change SNPs.”
Availability of data and materials
All data generated or analyzed during this study are included in this published article and its supplementary information files. The code and data used in this work are available on the GitHub repository [82]. This repository includes:
1) The STAP software, which is used to create features used by MOP;
2) Scripts that implement MOP using R (documentation on how to run on a provided sample data set is included as a README);
3) Training and test data used for evaluating MOP, included in the directory “paperData/bindingData”;
4) SNP-genotype data, i.e., SNPs predicted as “binding change SNPs” (among accessible SNPs) for each TF, included in the directory “paperData/bindingChangeSNPs”;
5) SNP-phenotype data, i.e., SNPs that are associated with (are cis-eQTLs of) drug-response genes for each drug, included in the directory “paperData/drugResponseSNPs”;
6) All SNPs used as “universe” in our hypergeometric tests (Fig. 3), i.e., all SNPs found within accessible regions of the genome in the lymphoblastoid cell line GM12878, included in the directory “paperData/allSNPs”.
References
Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, et al. The NHGRI GWAS catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 2014;42(Database issue):D1001–6.
Qian Q, Wang Y, Zhou R, Yang L, Faraone SV. Family-based and case-control association studies of DRD4 and DAT1 polymorphisms in Chinese attention deficit hyperactivity disorder patients suggest long repeats contribute to genetic risk for the disorder. Am J Med Genet B Neuropsychiatr Genet. 2004;128B(1):84–9.
Li Q, Seo JH, Stranger B, McKenna A, Pe'er I, Laframboise T, et al. Integrative eQTL-based analyses reveal the biology of breast cancer risk loci. Cell. 2013;152(3):633–41.
West MA, Kim K, Kliebenstein DJ, van Leeuwen H, Michelmore RW, Doerge RW, et al. Global eQTL map** reveals the complex genetic architecture of transcript-level variation in Arabidopsis. Genetics. 2007;175(3):1441–50.
Zhang Y, Manjunath M, Zhang S, Chasman D, Roy S, Song JS. Integrative genomic analysis predicts causative cis-regulatory mechanisms of the breast cancer-associated genetic variant rs4415084. Cancer Res. 2018;78(7):1579–91.
Ward LD, Kellis M. Interpreting noncoding genetic variation in complex traits and human disease. Nat Biotechnol. 2012;30(11):1095–106.
Kumar V, Westra HJ, Karjalainen J, Zhernakova DV, Esko T, Hrdlickova B, et al. Human disease-associated genetic variation impacts large intergenic non-coding RNA expression. PLoS Genet. 2013;9(1):e1003201.
Marbach D, Lamparter D, Quon G, Kellis M, Kutalik Z, Bergmann S. Tissue-specific regulatory circuits reveal variable modular perturbations across complex diseases. Nat Methods. 2016;13(4):366–70.
Zhu Z, Zhang F, Hu H, Bakshi A, Robinson MR, Powell JE, et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat Genet. 2016;48(5):481–7.
Ghandi M, Lee D, Mohammad-Noori M, Beer MA. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput Biol. 2014;10(7):e1003711.
Ghandi M, Mohammad-Noori M, Ghareghani N, Lee D, Garraway L, Beer MA. gkmSVM: an R package for gapped-kmer SVM. Bioinformatics. 2016;32(14):2205–7.
Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol. 2015;33(8):831–8.
Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods. 2015;12(10):931–4.
Wang M, Tai C, E W, Wei L. DeFine: deep convolutional neural networks accurately quantify intensities of transcription factor-DNA binding and facilitate evaluation of functional non-coding variants. Nucleic Acids Res. 2018;46(11):e69.
Schwessinger R, Suciu MC, McGowan SJ, Telenius J, Taylor S, Higgs DR, et al. Sasquatch: predicting the impact of regulatory SNPs on transcription factor binding from cell- and tissue-specific DNase footprints. Genome Res. 2017;27(10):1730–42.
Hanson C, Cairns J, Wang L, Sinha S. Computational discovery of transcription factors associated with drug response. Pharmacogenomics J. 2016;16(6):573–82.
Hanson C, Cairns J, Wang L, Sinha S. Principled multi-omic analysis reveals gene regulatory mechanisms of phenotype variation. Genome Res. 2018;28(8):1207–16.
Cloney R. Complex traits: integrating gene variation and expression to understand complex traits. Nat Rev Genet. 2016;17(4):194.
Cavalli M, Pan G, Nord H, Wallerman O, Wallen Arzt E, Berggren O, et al. Allele-specific transcription factor binding to common and rare variants associated with disease and gene expression. Hum Genet. 2016;135(5):485–97.
Huang D, Ovcharenko I. Identifying causal regulatory SNPs in ChIP-seq enhancers. Nucleic Acids Res. 2015;43(1):225–36.
Rao SS, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT, et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin loo**. Cell. 2014;159(7):1665–80.
Wagih O, Merico D, Delong A, Frey BJ. Allele-specific transcription factor binding as a benchmark for assessing variant impact predictors. Preprint at https://www.biorxiv.org/content/10.1101/253427v1 (2018).
Kelley DR, Snoek J, Rinn JL. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 2016;26(7):990–9.
Pique-Regi R, Degner JF, Pai AA, Gaffney DJ, Gilad Y, Pritchard JK. Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res. 2011;21(3):447–55.
Zeng H, Hashimoto T, Kang DD, Gifford DK. GERV: a statistical method for generative evaluation of regulatory variants for transcription factor binding. Bioinformatics. 2016;32(4):490–6.
Svetlichnyy D, Imrichova H, Fiers M, Kalender Atak Z, Aerts S. Identification of high-impact cis-regulatory mutations using transcription factor specific random Forest models. PLoS Comput Biol. 2015;11(11):e1004590.
Zuo C, Shin S, Keles S. atSNP: transcription factor binding affinity testing for regulatory SNP detection. Bioinformatics. 2015;31(20):3353–5.
Kumar S, Ambrosini G, Bucher P. SNP2TFBS - a database of regulatory SNPs affecting predicted transcription factor binding site affinity. Nucleic Acids Res. 2017;45(D1):D139–D44.
Thomas-Chollier M, Hufton A, Heinig M, O'Keeffe S, Masri NE, Roider HG, et al. Transcription factor binding predictions using TRAP for the analysis of ChIP-seq data and regulatory SNPs. Nat Protoc. 2011;6(12):1860–9.
Cheng Q, Kazemian M, Pham H, Blatti C, Celniker SE, Wolfe SA, et al. Computational identification of diverse mechanisms underlying transcription factor-DNA occupancy. PLoS Genet. 2013;9(8):e1003571.
He X, Chen CC, Hong F, Fang F, Sinha S, Ng HH, et al. A biophysical model for analysis of transcription factor interaction and binding site arrangement from genome-wide binding data. PLoS One. 2009;4(12):e8155.
Khoueiry P, Girardot C, Ciglar L, Peng PC, Gustafson EH, Sinha S, et al. Uncoupling evolutionary changes in DNA sequence, transcription factor occupancy and enhancer activity. eLife. 2017;6:e28440.
Qin Q, Feng J. Imputation for transcription factor binding predictions based on deep learning. PLoS Comput Biol. 2017;13(2):e1005403.
DREAM Challenges. ENCODE-DREAM in vivo transcription factor binding site prediction challenge 2016 [Available from: http://dreamchallenges.org/project/encode-dream-in-vivo-transcription-factor-binding-site-prediction-challenge.
Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics. 2011;12:77.
Lee D. LS-GKM: a new gkm-SVM for large-scale datasets. Bioinformatics. 2016;32(14):2196–8.
ENCODE. Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57.
Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D, et al. The UCSC Table Browser data retrieval tool. Nucleic acids research. 2004;32(suppl_1):D493–D6.
Kent WJ, Zweig AS, Barber G, Hinrichs AS, Karolchik D. BigWig and BigBed: enabling browsing of large distributed datasets. Bioinformatics. 2010;26(17):2204–7.
Richards S, Aziz N, Bale S, Bick D, Das S, Gastier-Foster J, et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med. 2015;17(5):405–24.
Lee BS, Oh J, Kang SK, Park S, Lee SH, Choi D, et al. Insulin protects cardiac myocytes from doxorubicin toxicity by Sp1-mediated transactivation of survivin. PLoS One. 2015;10(8):e0135438.
Cho E, Moon SM, Park BR, Kim DK, Lee BK, Kim CS. NRSF/REST regulates the mTOR signaling pathway in oral cancer cells. Oncol Rep. 2015;33(3):1459–64.
Ballou LM, Lin RZ. Rapamycin and mTOR kinase inhibitors. J Chem Biol. 2008;1(1–4):27–36.
Song C, Wang L, Wu X, Wang K, **e D, **ao Q, et al. PML recruits TET2 to regulate DNA modification and cell proliferation in response to chemotherapeutic agent. Cancer Res. 2018;78(10):2475–89.
Gupta K, Anand G, Yin X, Grove L, Prochownik EV. Mmip1: a novel leucine zipper protein that reverses the suppressive effects of Mad family members on c-myc. Oncogene. 1998;16(9):1149–59.
Zheng D, Wu W, Dong N, Jiang X, Xu J, Zhan X, et al. Mxd1 mediates hypoxia-induced cisplatin resistance in osteosarcoma cells by repression of the PTEN tumor suppressor gene. Mol Carcinog. 2017;56(10):2234–44.
Yang Y, Huang W, Qiu R, Liu R, Zeng Y, Gao J, et al. LSD1 coordinates with the SIN3A/HDAC complex and maintains sensitivity to chemotherapy in breast cancer. J Mol Cell Biol. 2018;10(4):285–301.
** KL, Park JY, Noh EJ, Hoe KL, Lee JH, Kim JH, et al. The effect of combined treatment with cisplatin and histone deacetylase inhibitors on HeLa cells. J Gynecol Oncol. 2010;21(4):262–8.
Beyer U, Kronung SK, Leha A, Walter L, Dobbelstein M. Comprehensive identification of genes driven by ERV9-LTRs reveals TNFRSF10B as a re-activatable mediator of testicular cancer cell death. Cell Death Differ. 2016;23(1):64–75.
Gueugnon F, Cartron PF, Charrier C, Bertrand P, Fonteneau JF, Gregoire M, et al. New histone deacetylase inhibitors improve cisplatin antitumor properties against thoracic cancer cells. Oncotarget. 2014;5(12):4504–15.
Steelman LS, Franklin RA, Abrams SL, Chappell W, Kempf CR, Basecke J, et al. Roles of the Ras/Raf/MEK/ERK pathway in leukemia therapy. Leukemia. 2011;25(7):1080–94.
Wang S, Konorev EA, Kotamraju S, Joseph J, Kalivendi S, Kalyanaraman B. Doxorubicin induces apoptosis in normal and tumor cells via distinctly different mechanisms. Intermediacy of H (2) O (2)- and p53-dependent pathways. J Biol Chem. 2004;279(24):25535–43.
Ravizza R, Gariboldi MB, Passarelli L, Monti E. Role of the p53/p21 system in the response of human colon carcinoma cells to doxorubicin. BMC Cancer. 2004;4:92.
Otsuka Y, Tanaka T, Uchida D, Noguchi Y, Saeki N, Saito Y, et al. Roles of cyclin-dependent kinase 4 and p53 in neuronal cell death induced by doxorubicin on cerebellar granule neurons in mouse. Neurosci Lett. 2004;365(3):180–5.
Fan C, Zheng W, Fu X, Li X, Wong YS, Chen T. Strategy to enhance the therapeutic effect of doxorubicin in human hepatocellular carcinoma by selenocystine, a synergistic agent that regulates the ROS-mediated signaling. Oncotarget. 2014;5(9):2853–63.
Kuida K. Caspase-9. Int J Biochem Cell Biol. 2000;32(2):121–4.
Kang MR, Park KH, Yang JO, Lee CW, Oh SJ, Yun J, et al. miR-6734 up-regulates p21 gene expression and induces cell cycle arrest and apoptosis in colon cancer cells. PLoS One. 2016;11(8):e0160961.
Zhou J, Wei Y, Liu D, Ge X, Zhou F, Jiang XY, et al. Identification of beta1,4GalT II as a target gene of p53-mediated HeLa cell apoptosis. J Biochem. 2008;143(4):547–54.
Kleinjan DA, van Heyningen V. Long-range control of gene expression: emerging mechanisms and disruption in disease. Am J Hum Genet. 2005;76(1):8–32.
Mirny LA. Nucleosome-mediated cooperativity between transcription factors. Proc Natl Acad Sci U S A. 2010;107(52):22534–9.
Karczewski KJ, Tatonetti NP, Landt SG, Yang X, Slifer T, Altman RB, et al. Cooperative transcription factor associations discovered using regulatory variation. Proc Natl Acad Sci U S A. 2011;108(32):13353–8.
Kasowski M, Grubert F, Heffelfinger C, Hariharan M, Asabere A, Waszak SM, et al. Variation in transcription factor binding among humans. Science. 2010;328(5975):232–5.
Machanick P, Bailey TL. MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics. 2011;27(12):1696–7.
Wang J, Zhuang J, Iyer S, Lin XY, Greven MC, Kim BH, et al. Factorbook.org: a wiki-based database for transcription factor-binding data generated by the ENCODE consortium. Nucleic Acids Res. 2013;41(Database issue):D171–6.
Kulakovskiy IV, Medvedeva YA, Schaefer U, Kasianov AS, Vorontsov IE, Bajic VB, et al. HOCOMOCO: a comprehensive collection of human transcription factor binding sites models. Nucleic Acids Res. 2013;41(Database issue):D195–202.
Lee D, Gorkin DU, Baker M, Strober BJ, Asoni AL, McCallion AS, et al. A method to predict the impact of regulatory variants from DNA sequence. Nat Genet. 2015;47(8):955–61.
Moyerbrailean GA, Kalita CA, Harvey CT, Wen X, Luca F, Pique-Regi R. Which genetics variants in DNase-Seq footprints are more likely to alter binding? PLoS Genet. 2016;12(2):e1005875.
Huynh-Thu VA, Irrthum A, Wehenkel L, Geurts P. Inferring regulatory networks from expression data using tree-based methods. PLoS One. 2010;5(9):e12776.
ENCODE project. ENCODE: Encyclopedia of DNA Elements [Available from: https://www.encodeproject.org.
Deplancke B, Alpern D, Gardeux V. The genetics of transcription factor DNA binding variation. Cell. 2016;166(3):538–54.
Gene Expression Omnibus. [Available from: http://www.ncbi.nlm.nih.gov/geo.
Wang L. Human Variation Panel: Gene Expression and Genotype 2010 [Available from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE24277.
Niu N, Qin Y, Fridley BL, Hou J, Kalari KR, Zhu M, et al. Radiation pharmacogenomics: a genome-wide association approach to identify radiation response biomarkers using human lymphoblastoid cell lines. Genome Res. 2010;20(11):1482–92.
Ritz C, Baty F, Streibig JC, Gerhard D. Dose-response analysis using R. PLoS One. 2015;10(12):e0146021.
UCSC Genome Browser. UCSC genome annotation database 2009 [Available from: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database.
Wang J, Zhuang J, Iyer S, Lin X, Whitfield TW, Greven MC, et al. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 2012;22(9):1798–812.
Kulakovskiy IV, Vorontsov IE, Yevshin IS, Soboleva AV, Kasianov AS, Ashoor H, et al. HOCOMOCO: expansion and enhancement of the collection of transcription factor binding sites models. Nucleic Acids Res. 2016;44(D1):D116–25.
Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, et al. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 2009;37(Web Server issue):W202–8.
Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F. e1071: misc functions of the department of statistics, probability theory group (formerly: E1071), TU Wien. R package version 1.6–7; 2015.
Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, et al. The human genome browser at UCSC. Genome Res. 2002;12(6):996–1006.
Rosenbloom KR, Sloan CA, Malladi VS, Dreszer TR, Learned K, Kirkup VM, et al. ENCODE data in the UCSC genome browser: year 5 update. Nucleic Acids Res. 2013;41(Database issue):D56–63.
**e X, Hanson C, Sinha S. MOP: motif-based occupancy prediction data sets. Github.https://github.com/UIUCSinhaLab/MOP
Acknowledgements
The authors wish to thank Keith Stewart and Neal Cohen, as well as the Mayo Clinic Center for Individualized Medicine, the Interdisciplinary Health Sciences Institute, the Todd and Karen Wanek Program for Hypoplastic Left Heart Syndrome, and the Mayo Clinic & Illinois Alliance for Technology-Based Healthcare.
Funding
This work was supported in part by Mayo Clinic/Illinois Grand Challenge and in part by the National Institutes of Health (grant R01GM114341 to SS).
Author information
Authors and Affiliations
Contributions
XX collected the data, performed the data analysis, developed software, and performed literature survey. CH helped with the data analysis. XX and SS wrote the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional files
Additional file 1:
Supplementary notes. This file contains seven supplementary notes, with each accompanied by short descriptions. (PDF 637 kb)
Additional file 2:
The file includes seven supplementary tables. All supplementary tables are included as sheets in this Excel file. Legends for these tables are provided here. (1) Table S1. The number of motifs for each TF. (2) Table S2. Accuracy values (correlation coefficient) of binding strength predictions for each TF. (3) Table S3. The AUROCs and AUPRCs of TFs in the ASB enrichment using four different TF binding predictors. (4) Table S4. TFs discovered regulating drug response using Delta-MOP alone to predict binding-change SNPs. (5) Table S5. TFs discovered regulating drug response using Delta-gkm-SVM alone to predict binding-change SNPs. (6) Table S6. The results of drug associated enrichment tests for all 888 TF-drug pairs. (7) Table S7. Mediator genes in ELF1-doxorubicin Analysis. SNPs in the intersection of TF-drug hypergeometric tests are: (1) eQTLs associated to drug response genes; (2) “binding-change SNPs” of the TF. DRGs associated with those SNPs are potentially mediators regulated by the TF. Here we list all mediator genes in ELF1-Docorubicin analysis. (XLS 149 kb)
Additional file 3:
Transcription factor binding motifs. We obtained 239 PWMs from three different data sources: (1) factor book motifs processed by ENCODE UCSC; (2) motifs downloaded from Factorbook; (3) HOCOMOCO Human (v10) motifs. STAP models are trained separately for each motif. We excluded a motif if the maximum prediction of the corresponding STAP model is too small (< 0.5). The remaining 225 PWMs, which were used in further analysis, are included in Additional file 3. (TXT 72 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
**e, X., Hanson, C. & Sinha, S. Mechanistic interpretation of non-coding variants for discovering transcriptional regulators of drug response. BMC Biol 17, 62 (2019). https://doi.org/10.1186/s12915-019-0679-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12915-019-0679-8