Background

A major open problem today is how differences in DNA sequence, e.g., single-nucleotide polymorphisms (SNPs) and variants (SNVs), lead to health-related and other phenotypic differences among individuals. A common approach is to find polymorphisms/variants that are statistically correlated with phenotypic differences, as in genome-wide association studies (GWAS) [1], family-based association tests [2], and expression quantitative trait loci (eQTLs) [3, 4] for phenotype-related genes. However, statistically identified variants may not be functionally related to phenotypes [5], due to a variety of factors including linkage disequilibrium (LD). This problem is particularly pronounced in the case of non-coding variants, which represent the vast majority of GWAS findings [6, 7] and often function by influencing gene regulation. Accurate contextual information about non-coding variants can improve our ability to disambiguate variants causally related to gene expression and phenotype [8, 9] from nearby non-functional SNPs. For example, if we have prior knowledge of a relevant transcription factor (TF), then the presence of a variant within a TF binding site (TFBS) may add to our confidence in the variant’s regulatory potential; the assumption here is that such a variant influences the TF’s binding to that site and consequently the gene regulatory impact of the TF. Advanced techniques for predicting in vivo TF-DNA binding potential from DNA sequence (gkm-SVM [10, 11], DeepBind [12], DeepSEA [13], DeFine [14], and Sasquatch [15]) can facilitate this approach by providing more accurate estimates of a variant’s impact on TF binding. In addition to providing a means for statistically prioritizing those non-coding variants by their likelihood of functionality, this strategy also offers a mechanistic explanation about their function, i.e., their impact on the TF-gene regulatory relationship. For example, Zhang et al. [5] adopted such a strategy: they combined a method for predicting changes in TF binding with multi-omics data to identify a SNP that impacts the binding strength of a TF called GATA3 to modulate breast cancer susceptibility.

It must be noted, however, that the above approach to identify phenotype-related non-coding variants along with their regulatory mechanism is still in its infancy and its sensitivity-specificity tradeoff is not well understood. Reliable mechanistic claims of a SNP mediating a TF’s influence on phenotypic variation often require multiple lines of “-omic” evidence as well as prior knowledge. A related but less-explored opportunity is to examine a collection of variants associated with a phenotype (e.g., from a GWAS study) and test the collection for enrichment of variants predicted to impact TF-DNA binding; such an enrichment can associate the TF, rather than individual variants, with the phenotype. This may give us mechanistic insights of a more global nature, with greater confidence than what the available data allows at the level of individual SNPs. In recent work, we adopted this general strategy to identify transcription factors regulating phenotypic variation across individuals, by combining genotype, gene expression, and phenotype information with genome-wide profiles of TF-DNA binding. The underlying principles were twofold: (1) If a gene’s expression is correlated with phenotype, and a SNP correlated with that gene’s expression (eQTL of the gene) is located within a TFBS, we treated this as (weak) evidence that the TF influences the phenotype via that gene; (2) if such evidence is observed significantly many times, i.e., across many genes, we hypothesized that the TF plays an important regulatory role in phenotypic variation. The assumption is that TF binding is affected by the SNP and this effect underlies the SNP’s correlation with gene expression, which in turn contributes to phenotypic variation. We pursued this line of reasoning in [16, 17] to systematically identify, through statistical testing and probabilistic graphical models, major TFs associated with a specific type of phenotypic variation, viz., differences in cytotoxic response to a particular drug in a panel of cell lines. Our goal in the current work is to test if information about the functional impact of variants on TF binding can improve inferences of TF-phenotype associations.

Two of the three pieces of information considered in the above scheme—(a) strength of SNP association with gene expression (eQTL) and (b) gene expression correlation with phenotype (a transcriptome-wide association study or “TWAS” [18])—are quantified by relatively established procedures. However, the third axis of information crucial to the approach—the evidence that a TF’s binding, and hence its regulatory influence on a gene, is affected by a SNP—is harder to assess. In previous studies, we treated the presence of a SNP inside a ChIP peak of the TF, located within the 50-kbp upstream region of the gene, as such evidence. However, this heuristic has obvious limitations. First, a SNP located within a ChIP peak may not necessarily impact the TF’s binding. This may be addressed by borrowing ideas from previous studies [19, 20] that have used motif and k-mer-based scans within ChIP peaks to identify regulatory SNPs likely to affect that TF’s binding. Second, a TF binding event located further than 50 kbp from the TSS may also exert regulatory influence on a gene, depending on chromatin loo** structures [21]; conversely, every TF binding event located within a modest distance (e.g., 50 kbp) of the TSS does not necessarily have a regulatory influence on the gene. Use of chromatin interaction data sets offers a resolution of this issue [21]. In this work, we address the above limitations of ascribing a regulatory relationship to a (TF, SNP, gene) triplet, through a combination of established and novel methods, with the express goal of aggregating such evidences and combining them with gene-phenotype correlations to discover regulatory mechanisms underlying phenotype variation.

We develop and use a new computational pipeline to identify TFs associated with drug response variation across individuals, building on the ideas outlined above, and performing integrative analysis of genotype, gene expression, and cytotoxicity data on a panel of ~ 300 cell lines, along with TF-ChIP data from ENCODE and TF binding motifs from various databases. We utilize a state-of-the-art, “k-mer”-based machine learning technique to predict the impact of a SNP on TF binding strength. We also develop an alternative method for this task, which uses one or more motifs known to represent a TF’s binding preferences, and combines biophysically inspired modeling and machine learning ideas. Through systematic benchmarking, we find that this motif-based method has a similar predictive ability as the “k-mer”-based technique for predicting allele-specific TF-DNA binding, in contrast to recent reports that leading k-mer-based approaches clearly outperform motif-based approaches [22]. Ultimately, using both k-mer-based and motif-based predictors and utilizing chromatin interaction domains and loops to link variants to genes, we show that modern tools of SNP impact prediction can lead to the discovery of novel regulatory mechanisms underlying phenotypic variation that are missed when not using SNP impact predictors. By aggregating evidence from many SNPs with putative effects on TF binding, we systematically identify TFs that influence individual-level differences in drug sensitivity, for several cytotoxic drugs. We examine one such discovered association more closely, viz., the predicted and experimentally confirmed effect of the TF “E74-like factor 1” (ELF1) on sensitivity to the drug doxorubicin. Our analysis suggests several genes that may be under ELF1 regulation and related to the doxorubicin response pathway.

Results

Selection of methods for predicting impact of SNPs on TF-DNA binding

We first sought a method to predict the impact of a SNP on TF binding (henceforth referred to as the “TFBS-SNP impact prediction task”), with the ultimate goal of utilizing such predictions to discover TF-phenotype relationships. This requires a sensitive method to quantify the strength of binding, since the effect of a typical SNP on a binding site is expected to be relatively modest. Several such methods have been reported in the literature [10, 13, 23], including some that utilize a variety of data types, such as chromatin state profiles [24] and high-resolution DNA accessibility [15, 24], for prediction [10, 13, 23]. To ensure wide applicability, we were specifically interested in a method that can predict TF binding strength from DNA sequence alone, while possibly using ChIP-seq data for the TF for model-training purposes. Existing tools for this scenario rely either on the k-mer composition of sequences [10, 13, 23, 25] or use pre-determined motifs for the TF [26,27,28,29]; recent evaluation [22] on allele-specific binding (ASB) data suggests that the k-mer-based methods have a clear advantage over motif-based methods. However, the motif-based methods tested by Wagih et al. [22] use a relatively rudimentary notion of motif matching, while past work by us [30] and others [29] has contributed more sophisticated biophysical models for this purpose. We compared a representative of leading k-mer-based methods (gkm-SVM [10, 11]) with an advanced motif-based method to determine their relative merits in predicting TF binding strengths and their changes due to SNPs.

We first used the thermodynamics-based method called Sequence To Affinity Prediction (STAP) [30] and trained it on ChIP-seq data for a TF, thereby learning to predict the strength of TF binding (ChIP signal strength) at a putative site from its sequence and the TF’s motif. STAP scores a genomic window, e.g., a few hundred base pairs long—the typical length of a ChIP peak—for its estimated occupancy by a TF, using the latter’s pre-determined motif in a position weight matrix (PWM) form. We have previously used this approach to accurately model ChIP data in D. melanogaster [30] and in mouse ESCs [31], as well as in the human cell line data sets of a recent “DREAM” challenge. However, we recognized that often there are multiple motifs for the same TF in the literature or databases and it is not clear which one of them, if any, is the optimal motif to use for the modeling of binding strengths. We therefore trained separate STAP models for each available motif for a TF and then used a support vector machine (SVM) classifier to combine the binding strength predictions of a TF at a given genomic window, made by those STAP models, into a single score (Fig. 1a). We call this the “MOP” (Motif-based Occupancy Prediction) score. With a means to score a window for its strength of TF binding, we were able to estimate the effect of a SNP by considering a 500-bp window centered on that SNP position, scoring two versions of the window, with the central position being set to either allele of the SNP, and computing the difference (Fig. 1b). We refer to this as the “Delta-MOP” score of the SNP for the TF. Note that this score is tied to the cell type from which ChIP data used in training were obtained.

Fig. 1
figure 1

Process of scoring TFBS-SNP impact and identifying a TF’s “binding change SNPs.” a We build a STAP model to predict TF binding at a DNA segment, separately for every available motif from ENCODE, FactorBook, and HOCOMOCO that represents the TF. For a given sequence, each motif-specific STAP model outputs a score indicating the occupancy of the TF on the sequence. An SVM model then combines STAP scores from all motifs of the TF to compute a combined score of the TF’s binding to the sequence; this is called the “MOP” score. b “Delta-MOP” score of a SNP is defined as the absolute value of the difference between the MOP scores of the major and minor allele sequences, constructed from the 501-bp sequence centered on the SNP location. In this example, SNP rs6717613 (G->A) is found to have a Delta-MOP score of 0.45 for the TF ATF2, which is the difference of MOP scores between the major and minor alleles (0.29 and 0.74 respectively). MOP scores were based on combining scores for six different ATF2 motifs (logos shown). The Delta-MOP score in this example can be qualitatively understood in terms of matches of the core binding site (top) to each of the six ATF2 motifs, whose STAP scores are shown separately for the two alleles (bottom). The core site’s match to motifs ATF2-1, ATF2-2, and ATF-6 changes in strength between the two alleles. For instance, the SNP falls on the 10th position of motif ATF2-1, which prefers an “A,” and the change from “G” (major allele) to “A” (minor allele) is interpreted as a change in strength of motif match. On the other hand, the core site does not have a strong match to ATF2-3 or ATF2-4, in either allelic form, while motif ATF2-5 overlaps the core site but not the SNP position. The Delta-MOP score combines these different pieces of information in a principled manner to compute an overall score of the impact of rs6717613 on ATF2 binding

Figure 1b illustrates the Delta-MOP score with an example. The SNP rs6717613 (G->A) is assigned a Delta-MOP score of 0.45 for the TF ATF2, with the MOP scores of the G and A alleles being 0.29 and 0.74 respectively. Note that six different motifs were available for this TF; for three of these ATF2 motifs, the SNP position coincides with an informative position of the motif and the two alleles define motif matches of differing strengths, while for the remaining three motifs, the two alleles present equally weak or equally strong sites. Hence, it is not clear a priori if this SNP should be considered as impacting binding strength or not, and it is instructive to have the Delta-MOP score provide an affirmative and quantitative answer.

Evaluations of SNP impact prediction scores

We first evaluated methods for prediction of TF binding strength from sequence, since this underlies the prediction of TFBS-SNP impacts. As noted above, the newly developed MOP score, which underlies Delta-MOP, is a generalization of the motif-based STAP method [30,31,32] for predicting a TF’s binding strength. We therefore hoped to confirm that this generalization indeed improves the prediction accuracy. We were also interested in a leading k-mer-based tool for predicting TF binding from sequence. We therefore considered the “gkm-SVM” method, which has been demonstrated to be among the best for this purpose—on par [33] with deep learning-based methods such as DeepBind [12] and DeepSEA [13].

We trained the three methods—STAP, MOP, and gkm-SVM—using the same training data set composed of 800 positive sequences (ChIP peaks of a TF) and 800 negative sequences (non-peaks), and cross-validated them on a set of 400 unseen sequences, balanced between the positive and negative classes. The negative sequences were randomly selected from the ChIP peaks of any other TF aside from the one under consideration (test TF); this is an important distinction from past benchmarks for the task (e.g., a recent “DREAM challenge” [34] and was designed to make the evaluation more specific to the unique binding behavior of the test TF rather than more general properties of TF binding implicit within ChIP data, such as DNA accessibility. Our tests were performed for each of 37 different TFs, selected based on the availability of ChIP-seq data for a well-studied lymphoblastoid cell line (LCL), GM12878, and other relevant criteria (see Additional file 1: Note S1). We noted that MOP and gkm-SVM produce similar accuracy (Fig. 2a, Additional file 2: Table S2) on average across the 37 data sets (TFs), while exhibiting some level of complementarity. MOP shows a clear improvement over STAP (Fig. 2b, paired T-test p value 0.0038, and Additional file 2: Table S2), demonstrating the value of using multiple motifs when available. (Additional file 2: Table S1 tabulates the number of motifs available for each TF.)

Fig. 2
figure 2

a, b Comparison of three TF binding predictors. We compared MOP with STAP and gkm-SVM. The performance of each model is measure by the Pearson correlation coefficient (CC) between ChIP score and predicted binding score on a test set of 400 sequences that are not used in model training. Performance evaluation is performed for each of 37 data sets (for different TFs). a MOP performs as well or better than STAP (using the best motif when multiple motifs are available) for 26 of the 37 data sets, with their average CC being 0.39 and 0.36 respectively. b MOP performs as well or better than gkm-SVM for 21 of 37 TF data sets examined, with average CC of the two methods being 0.39 and 0.37 respectively. ce Evaluation of TFBS-SNP impact prediction methods. Four different methods of binding change prediction (Delta-MOP, Delta-gkm-SVM, Delta-STAP, and Delta-PWM) were evaluated for their ability to predict allele-specific binding (ASB) events from non-ASB events, for each of 16 data sets based on ChIP-seq data for different TFs. Performance was measured using AUROC as well as AUPRC. ROC curve of RUNX3 using “Delta-MOP” as impact predictor is shown in (c). The last two rows show pairwise comparison of Delta-MOP and each of the other three methods based on AUROC (d) and AUPRC (e) achieved by the methods on the same data set

We next evaluated the above methods for the TFBS-SNP impact prediction task, by asking if the SNPs with strongest effects on predicted TF binding, henceforth called “binding-change SNPs,” are enriched for allele-specific binding sites (ASB), defined as sites where ChIP-seq read counts are significantly different between alleles [22]. The Delta-MOP score of the previous section is one way to predict binding-change SNPs, but analogous predictions can be made using STAP or gkm-SVM in place of MOP to score binding strengths of the two alleles. We refer to these as “Delta-STAP” and “Delta-gkm-SVM” [10, 11] scores respectively. As a baseline, we also evaluated a fourth method, called “Delta-PWM,” which is a PWM-based scoring metric included in the evaluations by Wagih et al. (We used the “delta raw score” provided by them as this baseline.) We obtained allele-specific binding (ASB) data for 16 TFs in lymphoblastoid cell lines from Wagih et al. [22], and tested whether the four abovementioned methods can accurately discriminate ASB SNPs from non-ASB SNPs (see “Methods”). Performance was measured using the area under the receiver operating characteristic curve (AUROC; ROC curve of RUNX3 is shown in Fig. 2c) and the area under precision-recall curve (AUPRC), following [35]. In AUROC comparisons (Fig. 2d, Additional file 2: Table S3), Delta-MOP appears to have better performance than Delta-STAP (average difference of 0.020, paired T-test p value 0.0013) and Delta-PWM (average difference of 0.025), but not as significantly different from Delta-gkm-SVM (average difference of 0.0043). The median AUROC using Delta-MOP is 0.60 and that using Delta-gkm-SVM is 0.58. Two of the 16 TFs—BHLHE40 and EGR1—had their ASB events predicted with AUROC of ~ 0.7 or greater when using Delta-MOP. These two methods exhibited a fair degree of complementarity in their performance on different TFs (Fig. 2d). In AUPRC comparisons (Fig. 2e), the performance of Delta-MOP is significantly better than that of Delta-PWM (average difference 0.095, paired T-test p value 0.00072), but similar to the other two methods, with the medians of Delta-MOP, Delta-gkm-SVM, and Delta-STAP being 0.39, 0.38, and 0.36 respectively.

To summarize the evaluations reported above, we found that the motif-based method MOP and the k-mer-based method gkm-SVM are equally good predictors of binding strength as well as of allele-specific binding events, with noticeable degree of complementarity to each other, while MOP shows clear improvements over the two other motif-based methods evaluated. We therefore selected Delta-MOP and Delta-gkm-SVM to predict TFBS-SNP impact for the next steps of analysis. It was instructive to find that a motif-based approach (Delta-MOP) is competitive with, and for some TFs better than, the k-mer-based Delta-gkm-SVM method (see “Discussion”). The same conclusions are supported by comparisons with a newer version of Delta-gkm-SVM, called “Delta-ls-gkm” [36], which yields better performance on the ASB prediction than Delta-gkm-SVM, but shows statistically insignificant difference from Delta-MOP (Additional file 1: Note S6).

Discovery of TFs regulating individual variation in cytotoxic drug response

To discover TFs associated with phenotypic variation, we adopted a statistical approach illustrated in Fig. 3. At its heart is a hypergeometric test of the overlap between two sets of SNPs, outlined below.

Fig. 3
figure 3

Process of identifying TFs regulating phenotypic variation. A hypergeometric test is used to test the overlap between a TF’s “binding change SNPs,” based on presence within ChIP peaks from ENCODE and high Delta-MOP score, and “phenotype-associated SNPs,” i.e., eQTLs of genes whose expression correlates with phenotype, located within cis-regulatory regions of the gene identified by Hi-C data. A TF is considered significant to the phenotype if the FDR q value is below 0.05

TF-phenotype association test:

  1. (a)

    We consider the collection of all SNPs that are located within accessible DNA in the cell type of interest; this is the “universe” set for the test. (Also, see Discussion about this choice.)

  2. (b)

    We define a subset of SNPs that are likely to impact phenotypic variation through a cis-regulatory effect on a gene’s expression; we refer to this as the “phenotype-associated” SNPs. Specifically, we identify phenotype-associated genes based on significant association between gene expression and the phenotype, and then determine significant eQTL SNPs in the regulatory regions (explained below and in “Methods”) of those genes.

  3. (c)

    We separately define a subset of SNPs that are likely to affect a particular TF’s binding strength, i.e., the “binding-change” SNPs. Although introduced above, these are now redefined as the SNPs with the greatest Delta-MOP or Delta-gkm-SVM score for that TF, among those located within the TF’s ChIP peaks for the cell type (see “Methods”).

  4. (d)

    A hypergeometric test is used to test the overlap between phenotype-associated SNPs and binding-change SNPs; a significant overlap is considered as evidence for the TF’s role in regulating phenotypic variation.

We note that the above test, conducted at the level of SNPs, is conceptually similar to that in Hanson et al. [16], with several key differences, the most prominent being our use of TFBS-SNP impact prediction scores as an additional criterion for designating binding-change SNPs. Hanson et al., in contrast, considered all SNPs within the TF’s ChIP peaks as binding-change SNPs. Other important differences are that Hanson et al. performed the statistical test at the gene level and did not use DNA accessibility or enhancer-promoter interaction data.

We used the TF-phenotype association test procedure on a data set of 284 lymphoblastoid cell lines (LCLs) that have previously been assayed for their cytotoxic response (EC50) to each of 24 different treatments, mostly cancer drugs [17]. Gene expression and genotype data are also available for these LCLs. We used ENCODE [37,38,39] ChIP-seq data for 37 TFs in the lymphoblastoid cell line GM12878 (see “Methods”), along with the abovementioned genotype data to identify binding-change SNPs, using Delta-MOP and Delta-gkm-SVM for TFBS-SNP impact prediction. We also repeated the analysis using only one or the other of these methods (see Additional file 2: Table S4, Table S5). To identify phenotype-associated SNPs, we considered genes whose expression levels correlated significantly with EC50 values (of a specific drug) across the panel of LCLs, and used Hi-C data [21] from the GM12878 cell line in step (b) of the above procedure. Here, we defined the regulatory region of a gene to include the chromatin interaction domain to which the gene belongs (see “Methods”), as well as more distal segments predicted to interact with the gene via chromatin “loops” [21].

Assessment of predicted TF-drug associations

A total of 888 TF-drug pairs (24 drugs × 37 TFs) were evaluated; we report in Table 1 all 38 pairs significant at false discovery rate (FDR) of 5% (nominal p value < 0.0021). (The full results are in Additional file 2: Table S6.) We also performed a variant of the above enrichment tests where TFBS-SNP impact prediction was not used; instead, a size-matched set of randomly selected SNPs within ChIP peaks (of the test TF) were chosen for consideration as binding-change SNPs, as was done by Hanson et al. [16]. (We used a size-matched random subset of within-peak SNPs, rather than all such SNPs, so that enrichment levels can be compared.) We repeated this “randomized control” test 100 times and noted how frequently each significant pair in the original analysis had a stronger p value in these randomized controls, reported in Table 1 (column “Impact predictor utility p score”). We note that 21 of the 38 reported pairs have only ≤ 10% chance of being discovered when not using TFBS-SNP impact prediction scores, thereby underscoring the value of such predictions in the procedure. This comparison establishes that impact prediction scores can indeed help identify novel statistical associations, though a rigorous assessment of the sensitivity-precision tradeoff due to their use is not attempted here. In another control experiment, we assigned to each TF a random set of SNPs (size-matched with the binding-change SNP sets above) from the universe of all SNPs within accessible regions and tested all 888 TF-drug pairs. We discovered that, on average, across 100 such randomized control tests, only 1.27 pairs (about 0.14% of the 888 tested) were significant at a nominal p value of 0.0021, the criterion used above for reporting (Table 1), providing further statistical evidence for the low proportion of false positives in our report.

Table 1 Significant TF-drug associations. Thirty-eight TF-drug pairs were discovered as significant at false discovery rate (FDR) of 5% (nominal p value < 0.0021). p value of the hypergeometric tests are shown in the third column. The fourth column (“Impact predictor utility p score”) shows an empirical p value for each association, computed by repeating the hypergeometric test using a size-matched random subset of SNPs within ChIP peaks (rather than SNPs with greatest TFBS-SNP impact scores) 100 times and counting how frequently the test p value in these random controls is smaller than that observed in the original test for that TF-drug pair

While we showed above that the use of TFBS-SNP impact scores can help predict TF-drug associations that might otherwise not rise above statistical significance, we also needed to convince ourselves that the discovered statistical associations are likely to be biologically true. In the absence of any systematic benchmarks of causal relationships between TFs and drug response, we had to rely on extensive but ad hoc survey of the literature for supporting evidence, following guidelines established in [40]. Out of the 38 significant TF-drug pairs of Table 1, eight were found to have “direct” supporting evidence (Table 2). For seven of these 8 cases, knock-down of the TF has been shown to lead to a significant difference in sensitivity. In one case—the pair ELF1-CDDP—we found published evidence that DNA-bound ELF1 increases CDDP-induced DNA damage at the bound locations, thereby directly and mechanistically implicating the TF’s regulatory activity in response to the drug. Notably, three of the top seven significant pairs (based on p value) have such direct confirming evidence, and these three pairs would have not have been discovered if not using TFBS-SNP impact scores (impact predictor utility p score ≤ 0.1, Table 1). Among the eight pairs with direct evidence, only two (ELF1-CDDP and PML-doxorubicin) would have reasonable chance of being discovered without use of TFBS-SNP impact information (impact predictor utility p scores of 0.11 and 0.14 respectively).

Table 2 TF-drug pairs with supporting evidence. This table lists the 18 TF-drug pairs (among the 38 pairs shown in Table 1) that have supporting literature evidence. We defined four different evidence types based on the type of evidence, as explained in text

We found six additional pairs to have strongly suggestive evidence of a biological relationship. This includes cases where the TF is a demonstrated regulatory mechanism of the drug’s action (evidence code “Regulation of drug action” in Table 2), is a known regulator of the drug’s target protein or pathway (“Regulatory target direct” in Table 2), or plays a role in sensitivity to a closely related drug (“Sibling drug evidence” in Table 2); see Additional file 1: Note S3 for details. As an example of “regulation of drug action,” SP1-mediated trans-activation of survivin has been shown to reduce doxorubicin sensitivity [41], supporting the pair SP1-doxorubicin. An instance of “regulatory target direct” evidence is provided by the pair REST-rapamycin: REST is known to exert regulatory control over the “mTOR” signaling pathway [42] and this pathway (mTOR = “mammalian target of rapamycin”) is the canonical target of the drug rapamycin [43]. An example of the evidence code “sibling drug evidence” is the pair PML-epirubicin, supported by direct evidence for the role of TF PML in response to the drug doxorubicin, which is closely related to epirubicin [17]: 6MP, 6TG, ARAC, arsenic, carboplatin, CDDP, cladribine, docetaxel, doxorubicin, epirubicin, everolimus, fludarabine, gemcitabine, hypoxia, metformin, MPA, MTX, NAPQI, oxaliplatin, paclitaxel, radiation, rapamycin, TCN, and TMZ. The phenotype, called EC50, represents the concentration at which the drug reduces the population of LCL cells to half of the initial population. Cytotoxicity assays were performed for every one of these drugs using the LCL panel. After initial optimization, cells were treated with a range of concentrations for any given drug tested, followed by incubation for 48 to 72 h. MTS cytotoxicity assays were then performed using Cell Titer 96 AQueous Non-Radioactive Cell Proliferation Assay kit (Promega Corporation, Madison, WI, USA), followed by absorbance measurement at 490 nm in a Safire2 microplate reader (Tecan AG, Switzerland). Cytotoxicity phenotypes were determined by the best fitting curve using the R package “drc” (dose–response curve) [74] based on a logistic model.

Transcription factor binding motifs

Two hundred twenty-five PWMs for 37 TFs were collected from three sources:

  1. 1)

    Twenty-nine PWMs for 27 TFs were collected from ENCODE factor book motifs from http://hgdownload.soe.ucsc.edu/goldenPath/hg19/databasefactorbookMotifPwm.txt.gz [75].

  2. 2)

    One hundred eighty-five PWMs based on ChIP data for 37 TFs from GM12878 cell line were downloaded from the Factorbook website (http://www.factorbook.org/human/chipseq/tf) [76].

  3. 3)

    Twenty-five PWMs for 21 TFs were obtained from HOCOMOCO Human (v10) [77], via the motif library of the MEME software [78].

All motifs are included in Additional file 3.

ChIP-seq and accessibility data

We used ChIP-seq data from the ENCODE project, as summarized in the “Txn Factor” track at the UCSC genome browser (“wgEncodeRegTfbsClusteredWithCellsV3” bed files). Clustered peaks observed in GM12878 cell line were used in this study. We also used genome-wide profiles of ChIP-seq signal values from the ENCODE project (www.encodeproject.org). Signal values are used as numeric measurements of the TF binding strength for training and testing TF-DNA binding prediction. DNaseI hypersensitivity (DHS) uniform peaks for GM12878 cell line (ENCODE project) were downloaded from the UCSC website [37].

Training set generation

MOP, STAP, and gkm-SVM need to be trained on ChIP-seq data using DNA sequences and corresponding ChIP scores. For training purposes, we generated balanced training data sets for each TF, which is composed of positive sequences and the same number of negative sequences. We selected 1000 segments of 500-bp length from each TF’s ChIP peaks as the positive set. (We limited the selection to peaks located within 50 kbp upstream of a protein coding gene and excluded “High Occupancy Target” or HOT regions, i.e., peaks overlap** 6 or more TFs with at least 50% overlap.) We defined a large collection of “negative windows” for a TF to be 500-bp-long segments in the positive sets of other TFs but not bound by the test TF. We then randomly selected 1000 windows from this collection as the negative set for the test TF. DNA sequence and signal value for each window was extracted from the reference genome (hg19) and ChIP-seq data from the ENCODE project. Thus a balanced data set with 2000 windows was generated for each of the 37 TFs. These data sets were further separated into a balanced training set with 1600 windows and a balanced test set of 400 windows. (See Additional file 1: Note S1.)

Prediction of TF-DNA binding

  1. 1.

    STAP: A separate STAP model [30] was trained for each of 225 PWMs (representing 37 distinct TFs) using the balanced training set. Cross validation (80% training, 20% testing) within these training data was used to learn a value for the site energy threshold (“eT”) hyperparameter.

  2. 2.

    gkm-SVM: For each TF, a separate model was trained as recommended by authors [10, 11], with default settings (http://www.beerlab.org/gkmsvm/gkmsvm-tutorial.htm).

  3. 3.

    MOP: The scores of a window reported by STAP models using different motifs of the TF were used as a feature vector representing the window and provided to a support vector machine (SVM). We trained an SVM model (package “e1071” in R [79]) to predict ChIP scores from such feature vectors, using the same training data as above.

To make the binding scores predicted by different methods fall on a comparable scale, we rescaled every score by a linear function so that the predicted binding scores for the 2000 windows in training and testing data range exactly from 0 to 1.

Prediction of TFBS-SNP impact

We first generated a reference genome specific to our LCL genotype data set by starting with the “hg19” reference genome and setting the nucleotide at each SNP location (in the LCL data set) to the major allele of that SNP in the data set. For each SNP, a 501-bp window centered on that SNP was extracted from this LCL-specific reference genome, and two versions of its sequence—one with the minor allele and another with the major allele of that SNP—were used as inputs for TF binding predictors. The absolute value of the difference between predicted binding scores of these two sequences was used as the TFBS-SNP impact score (Delta-STAP, Delta-gkm-SVM or Delta-MOP, depending on the binding prediction method used). In this step, binding predictors trained on all 2000 windows defined above were used. The fourth method for SNP impact prediction, called Delta-PWM, directly uses the “Delta raw scores” for “MEME signif PWM” provided by Wagih et al. [22].

Evaluations on allele-specific binding (ASB) data

ASB SNPs and non-ASB SNPs for lymphoblastoid cell lines were collected from Wagih et al. [22]. Twenty-two of the 37 TFs, for which we have binding predictors, have ASB data for these cell lines. Among these 22 TFs, MEF2A, NFYB, SRF, and USF1 have fewer than 150 annotated (ASB or non-ASB) SNPs, while SP1 and SPI1 did not have associated Delta-PWM data. For these reasons, these six TFs were excluded and we only used the data for the remaining 16 TFs in the ASB evaluation (Additional file 2: Table S3). The TFBS-SNP impact of each ASB and non-ASB SNP was measured by four methods (Delta-MOP, Delta-gkm-SVM, Delta-STAP, and Delta-PWM) as explained above. AUROC and AUPRC values were calculated for each TF-method combination, indicating how well the corresponding impact score can be used to label the ASB and non-ASB SNPs.

Identifying eQTLs in a gene’s regulatory region

We used Hi-C data [21] on 3-D chromatin architecture in the GM12878 cell line to construct the cis-regulatory region of each gene. First, the local “domain” that the gene overlaps with was included in such a region. Second, for each pair of loci connected by a loop, if the gene overlaps with one of the loci, the other locus was included in its cis-regulatory region. For each SNP located within the cis-regulatory region of a gene, the association between genotype and gene expression was calculated following [17], and SNPs with p value < 0.05 were considered as cis-eQTLs for the gene.

TF-drug association tests

Hypergeometric tests were used to identify TFs whose “binding-change SNPs” are enriched in drug response-associated SNPs. We used SNPs in GM12878 DNase-seq narrow peaks [80, 81] as the universe. For each drug, we defined genes whose expression correlates with drug response (EC50) with a correlation p value of 0.05 or lower as “drug response genes” and eQTLs assigned to these genes (see above) as the drug response-associated SNPs. A TF’s “binding-change SNPs” were defined as those with large TFBS-SNP impact score using either MOP or gkm-SVM. In particular, SNPs located within the TF’s ChIP peaks and ranked among the top 300 by Delta-MOP or among top 300 by Delta-gkm-SVM score were called “binding-change SNPs.”