Background

The advent of the assay for transposase-accessible chromatin using sequencing (ATAC-seq) [1] and, more recently, its single-cell counterpart, scATAC-seq [2], have brought about the current ubiquity of chromatin accessibility data across numerous cell types from diverse organisms, tissue samples, and disease states. Chromatin accessibility maps various kinds of genomic elements, including regulatory elements such as gene promoters and intronic and intergenic enhancers that are occupied by transcription factors (TFs) as well as structural elements such as CTCF and cohesin binding sites that may anchor 3D chromatin loops. The DNA sequence signals underlying regions of open chromatin are therefore complex: while a single assay allows us to create an atlas of tens of thousands of accessible “peaks” in a given cell type, we expect that dozens of TFs occupy overlap** subsets of these peaks due to the presence of their cognate binding sites or those of cofactors. A key problem in regulatory genomics is interpreting the regulatory information encoded in all chromatin accessible peaks, namely TF binding sites and the TF-specific “regulatory grammars” that allow TFs to bind at different locations within the same peak.

Traditional methods for identifying TF binding sites in chromatin accessible regions involve performing searches and enrichment analyses with a library of known TF motifs, each encoded as a position-specific weight matrix (PWMs). These standard approaches are useful in finding strong signals but are confounded by the problem of redundant or missing motifs, the near-identity of motifs for closely related factors, and the inherent limitation of using weight matrices to define binding sites when more subtle binding sequence signals may be present. De novo motif discovery can be underpowered when the sequence signal is complex; for example, if an important TF binds a small fraction of accessible sites, enrichment-based motif discovery may fail to identify the corresponding binding motif.

To address the limitations of PWMs, a range of supervised machine learning methods using k-mer representations have been used to train sequence models to predict or decipher chromatin accessibility. The first such methods were k-mer based SVMs [3, 4], which accurately discriminate between accessible sites and negative (flanking or random genomic) sequences but are more difficult to interpret in terms of constituent TF signals; feature attribution methods have recently been introduced to extract explanatory sequence patterns from gapped k-mer SVM models [5]. Other approaches include SeqGL [6], which trains a group lasso logistic regression model on ATAC-seq data, where k-mer groups correspond to TF binding patterns; BindSpace [7], a latent semantic embedding method for TF SELEX-seq data that enables multi-class identification of the TF signals in genomic sequences; and a topic model approach based on discovering combinatorial binding of TFs from ChIP-seq data [8].

In parallel work, a range of deep learning models have been applied to chromatin accessibility and other epigenomic data sets. Popular methods use a one-hot encoding of DNA sequence and train convolutional neural networks (CNNs) [9,10,11,12] to predict epigenomic signals. While these methods have made impressive strides, there is still an interpretability issue, especially for chromatin accessibility data, which contains numerous binding patterns for distinct motifs, as opposed to TF binding data (e.g., ChIP-seq, ChIP-nexus, CUT&RUN), where one might hope to identify a smaller number of binding patterns for the targeted TF as well as its cofactors. Even in this latter setting, a complex process of extracting sequence patterns through feature attribution and aggregating them into motifs may be required for interpretation [13].

Broadly speaking, the regulatory genomics field uses chromatin accessibility as measured by ATAC-seq to map candidate gene regulatory elements, including enhancer elements for genes. We note that not all accessible elements are enhancers, and not all elements involved in gene regulation are accessible (e.g., repressive elements may not be associated with open chromatin). Acknowledging these caveats, the next steps in decoding gene regulatory programs are to (i) associate ATAC-seq peaks with target genes and (ii) identify the TFs that might be binding peak regions and therefore regulating target genes. Problem (i) is not addressed here but is the subject of a wide range of current research activities, including the use of chromosome conformation capture assays to map 3D promoter-enhancer interactions and of single-cell multiomic data to enable correlation of peak accessibility and gene expression across individual cells. Problem (ii) is what we address here, by decoding the TF binding signals in specific candidate regulatory elements (peaks). Solving both problems will lead to mechanistic insight into the TF networks that regulate individual genes and gene expression programs.

In this work, we develop a deep learning approach based on Dirichlet variational autoencoders (VAE) for modeling chromatin accessibility data, using a k-mer representation of genomic sequences as input (Fig. 1a). VAEs are a family of machine learning models that learn probability distributions with latent variables. Similar to autoencoders, they learn representations of the input data by compressing the input via a ‘bottleneck’ layer in the neural network. A VAE achieves this compression in a probabilistic manner, whereby the encoder transforms the input x into parameters describing a probability distribution, which it samples from to get the latent representation z. The decoder then reconstructs the input from the latent representation z, with the goal of making the output x as close as possible to the input x by minimizing the reconstruction error. In the Dirichlet VAE or ‘topic model’ setting, we assume that the input bag of k-mers from a peak is generated by multiple ‘topics’ (we refer to individual components of the latent space as ‘latent dimension’, ‘topic’ or ‘latent factor’), which can correspond to binding signatures of TFs or other sequence signals in the data. The VAE formulation of latent variable models uses advances in neural network learning and enables efficient training on large data sets using backpropagation of gradients.

Fig. 1
figure 1

a Generative model captured by BindVAE. b The decoder weights \(\theta \in \mathbb {R}^{100 \times 112800}\) from the GM12878 model, where each row i contains the 8-mer distribution learned for the latent dimension i. Since visualizing the weights of all 112800 8-mers is difficult, the weights for a subset of the 8-mers are shown. This subset of 8-mers was picked to contain the top 20 8-mers of each latent dimension. See the text for details. c Motif analysis of each latent dimension shows that different binding specificities are captured by each latent factor in GM12878. Column 2 shows the name of the TF assigned by Algorithm 1, column 3 shows the CIS-BP motif corresponding to the TF, and the last column shows the motif that was found by HOMER using de novo motif discovery for the TF

Here we show that the Dirichlet VAE model captures a useful representation of chromatin accessible elements, where the k-mer distributions encoded in the latent space can often be interpreted as binding patterns for TFs. We further present an algorithm to interpret the latent space that uses HT-SELEX probes. We show that our model learns cooperative binding signals for pairs of TFs, and we find that our model learns different TFs for distinct cell types, consistent with whether a TF is expressed or not.

Results

BindVAE: a Dirichlet variational autoencoder to deconvolve sequence signals

Each input example to BindVAE is the bag of DNA k-mers in one chromatin accessible region as shown in Fig. 1a. We describe our k-mer representation in detail in the “Methods” section. The generative model underlying the VAE is based on the observation that each peak is a combination of DNA sequence patterns from the following categories: (a) binding sites for one or more TFs; (b) low complexity regions; (c) genomic background; and (d) cleavage bias from the enzyme used in DNA fragmentation (or tagmentation). We thus surmise that the representation z, learned for each peak, should have latent dimensions (topics) that correspond to these categories, and in particular we are interested in topics that correspond to the binding signals of individual TFs. We assume that the membership of the peak in these categories follows a Dirichlet distribution. That is, ziDirichlet(α). Each category (topic) in turn is represented as a multinomial distribution over k-mers. Therefore, we can think of a TF-related topic as representing a more general model of the TF’s binding preferences, with the highest probability k-mers in the multinomial corresponding to the preferred binding signals. In Fig. 1a, the latent dimension topic 1 (red color), parameterized by θ1, contributes k-mers that capture the binding preferences of TF1, while the blue-colored topic 2 might contribute k-mers representative of the genomic background. Note that the k-mer occurrences in a given DNA sequence are not independent, due to the overlap in the k-mers at successive locations of the input DNA sequence – the true generative process is thus not exactly multinomial. However, multinomial distributions have been successfully used in language models too, where the words that appear in a sentence are usually not independent of each other.

In general we find that the topics learned are not distributed mixtures of different signals: rather each latent component tends to capture a coherent pattern of k-mers for a single TF. This is a key advantage since, given an input DNA sequence, our model can disentangle it into dimensions containing TF-specific binding patterns and dimensions containing other types of content surrounding the binding sites.

We use k-mers with wildcards in our representation, meaning that we extend the alphabet of nucleotides with a wildcard character that matches all bases. This modeling choice is motivated by our previous k-mer based machine learning works including BindSpace [7], SeqGL [6], and our early string kernel work for epigenomic data [3] that anticipated the widely-used gkm-SVM method [14]. All these approaches use short k-mers (e.g. k=8) with some kind of inexact matching, such as matching to wildcards. Because TF binding signals are degenerate, the use of wildcards that capture a larger number of binding instances has consistently proven useful in machine learning models of regulatory sequences.

The dimension of our latent space or the width of the bottleneck layer, which we call ‘M’, is 100. In this way, the model encodes 100 topics, each corresponding to the binding signal of a TF or another enriched sequence pattern. For downstream quantitative evaluation and analysis, we mainly consider the 100-dimensional latent representations of inputs and the learned decoder parameters θ, which guide the reconstruction of the input from the latent vector z. Details on VAEs, Dirichlet VAEs, the training approach we use, and hyper-parameter tuning are provided in the “Methods” section.

In the following sections, we present qualitative and quantitative analysis of the VAE models learned on ATAC-seq peaks from: GM12878, a human B lymphoblastoid cell line; A549, a lung epithelial cell line; and three other cell types discussed in Additional file 1: Table S1. We show the sequence motifs learned for TFs by summarizing the various dimensions of the latent space, project DNA sequences from other assays into the latent space for interpretation, locate cooperative binding signals, and correlate learned TFs to expression data.

BindVAE learns diverse k-mer binding patterns

To summarize the DNA sequence patterns captured by BindVAE, we visualize the learned weights of the top 20 k-mers for each latent dimension. These weights are obtained from the decoder parameters of BindVAE: \(\theta \in \mathbb {R}^{M \times D}\), where M is the size of the latent representation and D is the number of k-mers in our input representation. Figure 1b shows this visualization where the x-axis has a total of 1068 k-mers and the y-axis shows the latent dimensions i∈[0,100]. The set of 1068 k-mers was constructed by first selecting the top 20 8-mers for each latent dimension and then computing the union of the selected 8-mers. The (i,j)th entry of the matrix is θij. Based on the diagonal-heavy structure of the matrix, we can say that the model learns diverse patterns. The off-diagonal blocks show that similar k-mer importance patterns are learned for some of the dimensions. We find that these correspond to TFs from the same family.

Overall, we see two types of latent factors: ones that capture unique patterns such as dimension #0 and #72 (see enlarged inset below Fig. 1b) and ones that capture redundant patterns such as #69 and #83. The latent factors of the former type can be thought of as forming a basis, with each axis roughly corresponding to a different binding signature. Redundant dimensions are groups of dimensions that capture very similar k-mer distributions, i.e., have high weights for very similar k-mers, possibly due to binding sites of paralogous TFs.

Motifs discovered de novo by our model

While Fig. 1b shows that diverse patterns are learned, we asked if these patterns represent coherent TF binding patterns. To answer this, we summarized the patterns learned by the model by doing a motif analysis for each latent dimension. Ideally, we would like to obtain motifs directly from the parameters of our model, i.e. using the distribution \(\vec {\theta _{m}}\) for a latent factor m. However, the use of relatively short k-mers (8-mers) to represent the input, and the use of wildcards, limit the length and the accuracy of PWMs directly obtained from \(\vec {\theta _{m}}\).

Given the caveats above, we adopted a postprocessing procedure that can use higher-order k-mers to get more accurate binding patterns. First, we generate the set of all possible 10-mer DNA sequences, S10mers (total of 524,800 10-mers without wildcards). The model trained on the ATAC-seq peaks is applied on S10mers, where the input for each 10-mer example is the 8-mer counts vector of size D. The latent representation vector \(\vec {z}_{l}\) for the lth 10-mer is its latent score vector. For a given latent dimension k, we can rank all the 10-mers based on the latent scores: zlklS10mers. Let the top 200 10-mer sequences in this ranking be S200={l1,l2...l200}. We then construct a PWM from the DNA sequences in S200 using MEME [15] and render it using SeqLogo [16] for consistent coloring. These results are tabulated in Fig. 1c for GM12878, with the third column showing the learned motifs. The second column shows the name of the TF assigned to the latent dimension by our procedure described in Algorithm 1 (discussed in the next section) and the fourth column shows the corresponding CIS-BP [17] motif. For brevity, we show the top few motifs in Fig. 1c, that were selected based on the p-values assigned by Algorithm 1.

We observe that the motifs learned by BindVAE are similar to the motifs from CIS-BP, which are based on in vitro and in vivo studies of individual TF binding. Since our input representation relies on 8-mers, the model is biased towards learning shorter motifs more accurately. We also find that TFs with long motifs are split across multiple dimensions, for instance RUNX3 is split across dimensions #60 and #39, and CTCF is split across dimensions #11 and #4.

Overall BindVAE finds motifs between 6 and 10 in width (average width 7.4). Our results illustrate that our VAE-based model learns binding preferences of representative TFs de novo.

Map** the TFs to dimensions

So far we showed that BindVAE is able to learn k-mer patterns of TF binding motifs. In order to validate that these learned motifs (Fig. 1c) are indeed meaningful to TF binding, we incorporated in vitro TF binding information derived from the HT-SELEX (high-throughput systematic evolution of ligands by exponential enrichment) study by Jolma et al. [18] In each TF experiment the HT-SELEX assay produces, at the end of several cycles of TF-bound oligonucleotide selection, oligomers that have a high affinity to bind the specific TF. The enriched probes thus represent TF binding preferences outside the cellular context. The HT-SELEX experiments in Jolma et al. [18] span hundreds of TFs from various TF families, out of which we use 296 TFs.

We use the oligomer probes that were enriched in each TF’s HT-SELEX experiment to map the individual latent dimensions i∈[1…M] from the bottleneck layer of BindVAE to TFs thereby making them amenable to biological understanding and facilitating further analyses. We project enriched probes into the latent space generated by the ATAC-seq peaks; i.e. we do inference on the probes using the BindVAE model trained on GM12878 peaks. Next, we use the procedure outlined in Algorithm 1 (please refer to the “Methods” section), where we compare the set of latent score vectors {z} of one TF’s HT-SELEX probes to those of the probes from all other TFs. A TF t is mapped to a dimension m, generating the map** mt, if m ranks t’s probes higher than the probes from all other TFs.

We find that our algorithm produces all types of map**s between TFs and dimensions: one-to-one, one-to-many, many-to-one, and many-to-many. Homologous proteins or multiple members of a subfamily are mapped to the same dimension due to the similarity of their binding preferences. For instance, in the GM12878 model, the T-box family of TFs are assigned to the same latent dimension. Some TFs might be assigned to several dimensions for two reasons: (1) they have a long motif or (2) some homolog of that TF appears in the data but does not exist in our set of 296 TFs, which is limited by the HT-SELEX experiments. Another consequence of this limitation is that some latent dimensions are not assigned to any TF even if there is an enrichment in the DNA sequence pattern that they capture.

Map** peaks to TFs

We next ‘assign’ each ATAC-seq peak in the input data to the top 3 TFs for downstream analysis and visualization. Given the M-dimensional latent score vector zi of the ith peak, and the map** F from dimensions to TFs, the assigned TFs are given by: F(arg maxd∈[1⋯M]zid), i.e., the 3 TFs corresponding to the highest latent scores. In general, we find that each peak’s representation is spread across multiple latent dimensions. For example, a 30bp region of a peak from our GM12878 dataset in Fig. 2a contains TFs from two different families: ETS1 and another unmapped TF from latent factor #93.

Fig. 2
figure 2

a TFs and the corresponding ‘latent scores’, assigned to a 30bp region of a random peak from our GM12878 dataset. b Heatmap of the latent space obtained by our model upon doing inference on ≈10,000 SELEX probes from 48 TF experiments. Each row is a SELEX probe, with the probes being colored by their TF. There are 200 probes per SELEX experiment. The green boxes highlight the FOX family of TFs and are showing the corresponding probes’ latent scores in dimension 96 (X96). c, d The top fifty 8-mers from two k-mer distributions learned by BindVAE are shown. These were obtained by sorting the decoder parameters encoding the k-mer distributions, namely \(\theta _{m} \in \mathbb {R}^{112800}\) for the following two topics/ dimensions m=72 and m=7. The 8-mers have been aligned using multiple sequence alignment and the ∗ symbols show the wildcards from our k-mer representation. In d, the 8-mers in the two red boxes correspond to roughly the prefix (top box) and suffix (bottom box) of the motif

Latent factors capturing non-TF related patterns

In addition to TF-specific patterns being learned, some latent dimensions capture information pertaining to low complexity regions, genomic background, or Tn5 transposase cleavage bias. In Additional file 1: Fig. S1 we show how the top k-mers from latent factor #37 encode genomic background regions with GC-rich patterns, and in Additional file 1: Fig. S2 we show peaks with low complexity regions being assigned to the same topic. In Additional file 1: Fig. S3 we show peaks of two types, demonstrating the disentanglement done by BindVAE on peaks with low-complexity regions and TFBS. Further details of these experiments are in Additional file 1: Section 1.

Projecting HT-SELEX probes into the latent space

From the previous sections we can conclude that BindVAE learns diverse and coherent patterns that map to TF binding preferences. In the following sections we explore what this entails for input DNA sequences of various lengths and types.

Just as in the previous section, we project the HT-SELEX probes into the latent space generated by the ATAC-seq peaks and visualize the results. Since HT-SELEX experiments use short probes that are 20bp long and capture in vitro binding affinities, they have very distinct patterns across different TFs. The heatmap in Fig. 2b shows the latent space for 10,000 probes, one probe per row, with the rows grouped by the TF that each probe was enriched for. There are 200 enriched probes (and hence rows) per TF experiment. We show 42 dimensions, out of the 100-dimensional latent space (M=100), chosen based on whether a dimension contains any signal over the 10,000 probes. We see that a block diagonal structure emerges in this matrix, which indicates that several latent dimensions are orthogonal to each other and show distinct binding patterns. Note that each red square block corresponds to 200 probes.

Further, TFs from a family share the same latent space due to the similarity in their DNA binding sites. For instance dimension #67 shows signal for several forkhead box (FOX) proteins: FOXJ2, FOXL1, FOXO4. Similarly, dimension #30 contains DNA binding preferences for T-box TFs TBR1, TBX19, TBX20, TBX21, and EOMES. In addition to T-box proteins, we see that dimension #30 also captures binding preferences for ZSCAN4, which is a C2H2 zinc finger. Dimension #21 encodes bHLH (basic helix-loop-helix) binding preferences for TCF3 and TCF4. In order to further understand the model’s grou** of similar TFs into a single latent factor, we compare their CIS-BP motifs in Additional file 1: Fig. S4. We observe that the TF binding PWMs are very similar for each group and hence binding sites for these TFs get projected into the same region of the latent space by our model.

Top 8-mers learned for NFKB1 and TFAP4

We illustrate the quality of the learned posterior distribution pθ, which is defined by the encoder parameters \(\theta \in \mathbb {R}^{M \times D}\), by sorting the decoder parameters θm for each latent dimension m and showing the top 50 corresponding 8-mers. Figure 2c shows the 8-mers from dimension m=72 which is mapped to NFKB1 by Algorithm 1. For ease of interpretation, we align the 50 8-mers using Clustal Omega [19] and also show the CIS-BP motif corresponding to the mapped TF. NFKB1 is a palindromic motif of length ∼13 and TFAP4 has a motif of length 10. For TFAP4, we see that the top 8-mers learned by our model can be partitioned into two groups based on their distinct patterns, with one group matching the beginning of the motif (pattern TCAGC) and the other matching the end (AGCTGT), as shown in Fig. 2d. In general, we observe that a single latent factor can capture the binding signature of TFs with motif length ≤ 10. TFs with longer dimer motifs are also captured by a single latent factor, for instance NFKB1 in Fig. 2c and VDR in Additional file 1: Fig. S5. When the TF motif is longer than 10, we typically find the binding signature to be split across two latent factors – with one capturing the beginning and the other capturing the end of the motif, as seen for CTCF in Additional file 1: Fig. S5.

Cell-type specific TF patterns

We find cell-type specific differences in the TFs whose binding signals are recovered by BindVAE, consistent with the different TFs expressed in these cell types (see Fig. 3d). There is some evidence that we also find differences in motifs for the same TFs between two cell lines, in cases where these TFs are expressed in both. We show some examples in Additional file 1: Table S2 for the two cell lines GM12878 and A549. We also compare all the k-mer distributions learned for both cell types by visualizing them together in the PCA plot shown in Additional file 1: Fig. S6. We apply PCA to the decoder parameters \(\theta \in \mathbb {R}^{100 \times 112800}\), where each row is one k-mer distribution.

Fig. 3
figure 3

a, b Analysis of cooperative binding signals learned by the model trained on GM12878 peaks by scoring HT-SELEX probes from individual TFs and CAP-SELEX probes from pairs of TFs. The y-axis shows the distribution of latent scores of individual probes for the TF experiments shown on the x-axis. In the brackets below each TF label, we show the p-value of assigning to the latent dimension, that particular TF, using Algorithm 1. Cooperative binding motifs can be enriched while the individual TF motif may not be enriched, as seen for MYBL1, MAX in a and TBX21 in b that have a p-value of 1. cleft: CAP-SELEX based motif for the TF pair binding from Jolma et al. [20] and right: Motif learned by BindVAE for FOXJ3:TBX21 cooperative binding d TFs found in each cell type by our model and their ‘accessibility scores’ obtained by summing over all ATAC-seq peaks showing the possible extent of accessibility in the two cell lines: GM12878 (light blue) and A549 (orange). See text for details. e Venn diagram showing the overlap in the TFs found de novo by the VAE and HOMER in GM12878 and the extent of their overlap with all expressed genes. We show HOMER results run with the set of 296 motifs from Jolma et al. [18] (selex restricted) where HOMER only uses the TFs from our HT-SELEX set of 296 TFs

Cooperative binding signals in GM12878

We find that multiple latent dimensions show k-mer patterns from diverse TF families, possibly indicating the presence of cooperative binding sites in the ATAC-seq peaks. Since the co-occurrence of multiple binding patterns might merely suggest that binding sites of two different TFs are present in a peak and not necessarily binding cooperatively, we use CAP-SELEX data [20] to validate this hypothesis. Jolma et al. [20] developed CAP-SELEX, an in vitro assay for studying interactions between pairs of DNA-bound TFs, using DNA sequences of length 40bp.

We analyze two latent dimensions: #60 in Fig. 3a and #67 in Fig. 3b, which show cooperative binding signals for MYBL1-MAX and FOXJ3-TBX21 respectively. In order to show the presence of cooperative binding patterns, we compare the latent scores of enriched CAP-SELEX probes from pairwise TF experiments to the latent scores of enriched HT-SELEX probes from individual TF experiments. We show the distribution of the latent scores along the y-axis, and the x-axis shows the source TF (or TF pair) experiment for each distribution. We also show the distribution of latent scores for probes from all other TFs, which shows that the pairwise signals are significantly higher than average. This is also indicated by the p-value (shown in brackets below each TF label) assigned by Algorithm 1. For example, the p-value of map** dimension #67 to the TF pair FOXJ3-TBX21 is 2e−62, while that for all ‘other TFs’ is 1.0.

We further analyze the motif captured by dimension #67 for the TF pair FOXJ3-TBX21 in Fig. 3c by constructing a PWM from 15-mers ranked high by this dimension. Since the number of all possible 15-mers is prohibitively large, we sample 10% of them and run inference on these. We show the motif obtained by this process on the right side of Fig. 3c. On the left, we show the motif published by Jolma et al. [20] (using their algorithm) for FOXJ3-TBX21 cooperative binding. We show the motif learned for MYBL1-MAX in Additional file 1: Table S3, where we also show the 13 other TF pair motifs found by BindVAE. Note that combinations of TFs with a successful CAP-SELEX experiment are limited after filtering for quality metrics as described in BindSpace [7]. Composite motifs that differ from the concatenation of individual motifs are found for ≈70 pairs of TFs, for which we can assess enrichment in the latent dimensions. Given the sparsity of CAP-SELEX data, we are unable to present a more comprehensive analysis of cooperative binding.

We further examine the results to understand if there is any intrinsic source of bias in CAP-SELEX probes or motif patterns. All examples of cooperative motifs that we find involve TFs with distinct binding motifs, i.e. the same nucleotide pattern is not repeated for each TF in the pair, which could confound cooperative motif analysis. Next, in Fig. 3a and Additional file 1: Fig. S7 (top) we see that the cooperative binding motif is enriched, while the individual TF motifs are not. Conversely, in Additional file 1: Fig. S7 (bottom) we show an example where the individual motif is enriched, while the cooperative motif is not. This shows that CAP-SELEX probes do not intrinsically achieve higher latent scores on account of either of the individual motifs being present.

Accessibility patterns in different and similar cell types

In Fig. 3d we show the proportion of binding sites that we predict across various TFs in the two cell types in our study. Each bar represents one TF (or latent dimension m) and the height of the bar, which we call the ‘accessibility score’ is obtained by summing the latent scores (or topic proportions) zim for all peaks i: \({\sum _{i \in \textrm {all peaks}}} z_{im}\) for each topic m. We only plot the latent dimensions m that were successfully mapped to TFs using Algorithm 1. The bars are colored by cell type, blue for GM12878 and orange for A549, overlap** areas appearing in grey. The bars are sorted in increasing order with respect to accessibility scores of TFs from GM12878. We see different accessibility patterns in these two cell types, and Gene Ontology (GO) term enrichment using PANTHER of the distinct TFs found for each cell type shows biological processes that are specific to each. For instance, we find ‘myoblast fate commitment’ and ‘calcineurin-NFAT signaling cascade’ being enriched in the TFs found for A549, whereas ‘regulation of CD8-positive, alpha-beta T cell differentiation’ and ‘intracellular steroid hormone receptor signaling pathway’ are enriched for GM12878.

We also illustrate the difference in activation patterns for similar cell types by training separate models on samples coming from the same cell type, but different donors: one male donor and one female donor for naive B cells. We plot the accessibility scores for TFs found by both models (Additional file 1: Fig. S8) and find that for similar cell types or replicates, there is overlap in the TF patterns.

Since BindVAE is a probabilistic model that optimizes the likelihood of the data, there is some variance/uncertainty across experimental runs. This is unlike autoencoders since the latent vector here is sampled from the Dirichlet posterior distribution. In addition, there is the variance from the randomized initialization of model parameters. We quantify this uncertainty by training models 5 times using identical hyperparameters, on data from only the female donor sample of naive B cells. We plot the accessibility scores for the union of all TFs found, along with the standard deviation across the 5 repeat runs. While there is variance in the exact TFs found across the repeat runs, if we visualize the k-mer distributions for two of the replicate experiments in Additional file 1: Fig. S6 (top right), we see that they are quite similar. Compare this to the k-mer distributions learned for two distinct cell types in Additional file 1: Fig. S6 (bottom left) or for the same cell-type but samples coming from different donors in Additional file 1: Fig. S6 (top left).

Comparison with de novo motif discovery methods

We compare BindVAE with three other motif discovery approaches: HOMER [21], MEME [15], and GADEM [22]. The last two only produce motifs and are hence run in conjunction with a motif-matching approach, Tomtom [23] to find the matching TFs. We find that BindVAE has a higher precision but often worse recall and F1 score (Additional file 1: Table S4a, S4b), while BindVAE+Tomtom, where we replace Algorithm 1 with Tomtom, has a higher recall and F1. We find that Algorithm 1 trades off precision for lower recall (see details in Additional file 1: Section 2, where we also compare run-times of the approaches). We find that BindVAE is one of the faster methods, particularly since the prior motif discovery approaches do not scale well to large-scale NGS datasets and the efficiency is dependent on the e-value/p-value threshold used. Here, we briefly present our observations with HOMER on the GM12878 dataset.

HOMER [21] is an unsupervised motif discovery algorithm that uses differential enrichment to find motifs. It compares peaks to background DNA sequences and tries to identify patterns with mismatches and gaps that are specifically enriched in the peaks relative to the background. We compare our unsupervised VAE-based method to HOMER on de novo motif detection in GM12878. Figure 3e shows the overlap in the TFs found by both approaches and how many of these are expressed in GM12878. Since the TFs detected by BindVAE are restricted by our post-processing algorithm that relies on HT-SELEX data, we present the HOMER results in a similar configuration for a fair comparison: ‘SELEX restricted’, where we use the HT-SELEX PWMs published by Jolma et al. [18] for the 296 TFs that we consider.

We find that HOMER finds a total of 99 motifs that are mapped to known TFs within our HT-SELEX set of 296 TFs. BindVAE finds 122 motifs from latent dimensions mapped by Algorithm 1. Looking at expressed TFs, 60/99 (≈60%) found by HOMER and 73/122 (≈60%) found by BindVAE are expressed. Therefore HOMER finds fewer TFs overall than BindVAE, and a similar proportion are expressed in GM12878 for both methods. We provide a table of the learned and expressed TFs in the Additional file 1: Table S5. We also present a detailed comparison with HOMER for varying p-value and e-value cut-offs in Additional file 1: Table S4b.

Projecting ChIP-seq peaks into the latent space

In Fig. 4a,b, we show a two-dimensional projection using Uniform Manifold Approximation and Projection (UMAP) [24] of the learned representations for ChIP-seq peaks from GM12878 for three transcription factors with distinct binding preferences. There are 1000 peaks from each ChIP-seq experiment, and each peak is colored by the TF. Figure 4a shows proteins from different families: CTCF, a C2H2 zinc finger, MAFK a bZIP TF, and ELK1 that contains an ETS domain. Here we see that our model learns distinct embeddings for TFs that have different binding preferences. However, embeddings of CTCF peaks are more distributed due to the ability of its zinc finger domains to bind to heterogenous DNA sequences. Figure 4b shows FOS a bZIP TF, IRF4 an IRF family transcription factor and SPI1 that contains an ETS domain. We find that embeddings of peaks from FOS are distinctly clustered while those from IRF4 and SPI1 largely overlap despite a low overlap in the genomic regions of the peaks (34 out of 1000 peaks overlap). This is possibly because IRF4 binds DNA weakly but cooperative binding with factors such as SPIB in B cells increases binding affinity [25]. We notice a similar pattern of overlap with BATF and JUND which are both bZIP TFs that form heterodimers while binding DNA [26].

Fig. 4
figure 4

a UMAP projection of the latent representation for ChIP-seq peaks from CTCF, ELK1 and MAFKb UMAP projection of the latent representation for ChIP-seq peaks from FOS, IRF4 and SPI1. For each TF, 1000 peaks are shown. c Heatmap showing 38,496 ATAC-seq peaks from GM12878 cell line, that overlap with ChIP-seq peaks from 11 different TFs. For a TF t, the peaks have been sorted by the latent score of the dimension that was mapped to t by Algorithm 1 d CIS-BP motifs for TFs shown in a and b show distinct motifs for all TFs. e Matrix of GM12878 ATAC-seq peaks ordered as per the matrix ranked by latent factor scores (topic scores) from Figure 4c. Rows are peaks and columns are TFs. Each entry i,j in the i-th row and j-th column of the matrix is the i-th peak’s ChIP-seq annotation for TF j

To analyze whether our model learns meaningful representations, we use ChIP-seq as a source of ground truth and verify whether known binding sites for a given TF are transformed to the same latent dimension by our model. We look at the intersection of the TFs in our HT-SELEX set (296 TFs) and those for which we have reliable ChIP-seq data, which gives us 11 TFs. For each of these TFs, we find all ATAC-seq peaks that have an overlap of at least 50bp with any ChIP-seq peak and plot the latent dimension that was assigned to that TF. In Table 1 we show the number of overlap** peaks between the ChIP-seq experiment and our GM12878 dataset. We show the resultant matrix for these 11 TFs by depicting it as a heatmap in Fig. 4c. The heatmap shows a centered log ratio (CLR) transform of the latent representations, with rows representing peaks and columns representing TFs. Our approach gives us a total of ≈38,496 peaks across the 11 TFs which we sort by their membership, with peaks that belong to RUNX3 being shown at the top of the heatmap as it has the largest number of mapped peaks.

Table 1 Statistics of ChIP-seq data used to plot Figure 4c showing the number of overlap** peaks

We show the TF ChIP-seq experiment connected with each row of Fig. 4c, in a second heatmap in Additional file 1: Fig. S14, that indicates the ChIP-seq ground truth label assigned to the ATAC-seq peak via overlap (i.e. a minimum of 50bp overlap between a ChIP-seq and ATAC-seq peak to assign the TF label) in the same order as Fig. 4c. We see that some of the co-binding relationships found by BindVAE: such as NRF1 and MAX are explained by the overlap in ChIP-seq peaks.

Discussion

Supervised deep learning methods for the prediction of TF occupancy data and chromatin accessibility are numerous, ranging from early deep convolutional neural network based models such as DeepSEA [10] and Basset [9] to more recent approaches usually mirroring advances in deep learning methods for natural language processing, such as the LSTM-based DanQ [27], Basenji using dilated CNNs [11], DeepSite [28], and DNA-BERT [29]. Naturally, it is possible to train discriminative deep learning models on sequence data to predict chromatin accessibility either as a binary label (open/closed) or in the regression setting, as many authors have done (e.g. the Basset and Basenji models [9, 11]). While these models have produced highly accurate predictions of TF occupancy, interpreting supervised models requires attribution of the learned parameters to the output labels—which is often not a robust process as many works in the broader deep learning literature have shown. For example, in medical image analysis, image features that are irrelevant for clinical analysis of the specimen can be used by the model to improve the prediction accuracy based objective function. In regulatory genomics, methods to interpret the sequence information captured by deep discriminative sequence models (e.g. DeepLIFT [30]) require detailed feature attribution over a large input window (such as 500bp to 1Mbp), and in general these methods do not generalize across cell types. Other approaches to improve the interpretability of these models (e.g. use of a Fourier prior in training [31]) are still in their infancy. Interpretation of supervised deep sequence models has been more successful where models are trained on high-resolution TF occupancy data [12], since the underlying motif grammar is less complex than that for chromatin accessibility data.

Motif-matching methods such as FIMO [32] have been popular in biological studies due to the wide range of TFs, ease of application, and inherent unsupervised nature where the available PWMs can be applied on any DNA sequence. However, these approaches may return hundreds of motif hits for genomic regions of the size used in ATAC-seq analysis, i.e. 100bp to 200bp regions under peak summits.

Given the limitations of supervised models for interpreting chromatin accessibility, we explored an unsupervised deep learning model that learns binding patterns given open chromatin regions derived from ATAC-seq and is thus complementary to existing work. There are currently ≈1500 DNase-seq and ATAC-seq datasets spanning hundreds of cell and tissue types on the ENCODE portal, whereas datasets on TF binding experiments such as ChIP-seq are restricted to a few TFs per cell type, with the exception of a few highly profiled ENCODE cell lines. Hence, unsupervised or semi-supervised approaches may be desirable for decoding the TF binding landscape on less studied cell types.

Generative models, such as VAEs, are also more interpretable by design. For example, BindVAE seeks to capture individual TF binding signals as latent dimensions in the model, as well as disentangling other sequence signals like low-complexity k-mer signatures. More broadly, generative models learn universal representations of the data that can then be used in downstream task-specific applications. We propose that such methods are the next step in the evolution of machine learning modeling on genomics data since supervised models are typically not transferable to other prediction tasks. Further, unlike recent language-based deep learning models, our VAE based model is easily scalable for training on data from several cell types and can also be run in a multi-processor computational environment without GPU support, albeit at lower efficiency.

The input to BindVAE uses 8-mers with wildcards, which allows us to interpret the learned latent factors. There is a rich literature on the use of k-mers for representing DNA and protein sequences, from early works using oligonucleotide frequencies [33] and k-mer based string kernels using support vector machines [4, 34]. We rely on the results from these prior works that show the robustness of this representation.

While we lose the exact positional information due to our choice of bag-of-words as input, we do not lose any context information as the surrounding DNA sequences are still part of the input representation; for this broader sequence context, the exact position of a specific k-mer in the 200bp peak may not be important for a successful encoding. Secondly, the bag-of-words representation actually gives us flexibility – the input DNA sequence can be of arbitrary size, and we can apply the same model on inputs of varying size from diverse in vivo and in vitro data. This is not true for a one-hot-encoding based input representation used by other deep learning models, which requires the DNA sequence inputs to be of a fixed constant size. We find that there is a trade-off between the complexity of the input representation and the interpretability of the resulting model. Our choice of a bag-of-words input makes the model readily interpretable by giving us an easy way to map components of the latent space to distributions over k-mers. Lastly, it would be straightforward to add positional information in our model the way transformers do, by adding position encodings.

Our model also does not use the real-valued peak accessibility as an input and therefore does not capture the relative frequency of binding patterns that are in less accessible regions vs. more accessible regions. However, we note that many computational methods for finding TF binding signals in bulk ATAC-seq make the same simplifying assumption of treating reproducible peaks as positive examples—without retaining accessibility values—and perhaps using inaccessible regions or shuffled genomic sequences as a negative background. For example, virtually all PWM-based motif finders (HOMER, MEME) are blind to real-valued accessibility information, and even the widely-used discriminative model gkm-SVM trains with binary rather than real-valued labels. That said, one straightforward extension to include accessibility information in BindVAE’s k-mer count based input would be to scale the counts based on accessibility of the region that each k-mer comes from.

We found that our VAE based model can learn distinct binding patterns from ATAC-seq peaks without any TF labels. Of the 102 distinct patterns learned over the latent dimensions, we found specific patterns for some TFs and were able to map the latent factors to unique TFs. In contrast, for others, we found a coarser pattern that corresponds to one of several TFs from a family, such as T-box proteins. Paralogous TFs are difficult to learn as separate factors due to the highly similar patterns in their binding sites, which cannot be captured uniquely by a distribution over 8-mers with wildcards in an unsupervised fashion. Our model also learns combinations of patterns for TFs that co-occur within peaks and that are involved in cooperative binding, and analyzing these patterns produced composite motifs. Using higher-order k-mers in our model can improve the coverage over longer motifs. However, this would increase the input dimension of BindVAE significantly and thereby increase computation overhead substantially.

Methods

We begin this section by introducing some terminology and notation used throughout the paper. We use z to refer to the latent variable, x to refer to the input variable, \(\vec {z}\) or \(\vec {z}_{i}\) to refer to an instantiation of the latent variable, i.e. to a latent vector corresponding to an input \(\vec {x}_{i}\), and \(\vec {z}_{ik}\) to refer to a component of the latent vector. M is the size of the latent space or the number of latent dimensions, i.e. the size of the bottleneck layer of the VAE. D is the size of the input and – since our model is an autoencoder – also the size of the output space.

Variational autoencoders

Variational autoencoders (VAEs) [35, 36] are latent variable models that combine ideas from approximate Bayesian inference (variational inference) and deep neural networks, resulting in a framework that can use backpropagation-based training.

Let x represent the data and z be the latent variable. VAEs express the joint distribution p(x,z)=p(z)p(x|z) where p(z) is a prior distribution over z, i.e zp(z), and pθ(x|z) is the likelihood function. In the context of neural networks, pθ(x|z) is the probabilistic decoder that generates data x given latent variables z, with the goal of reproducing \(\hat {\mathbf {x}}\) that is close to x. Since estimating the true posterior distribution pθ(z|x) is often intractable, an approximate posterior distribution (also known as the variational distribution) qϕ(z|x) is used, which is formulated by the probabilistic encoder in the neural network model. The encoder outputs zqϕ(z|x)=qϕ(z|η) where η=MLP(x) is computed from the observation x by a multi-layer perceptron (MLP). Figure 1a shows a neural network depiction of this model.

Loss function: VAEs optimize the parameters ϕ and θ of the encoder and decoder jointly by maximizing the evidence lower bound (ELBO) using stochastic gradient descent. The ELBO is the variational lower bound on the marginal log-likelihood of the data logpθ(x) and is given by:

$$\begin{array}{*{20}l} \log p_{\theta}(\mathbf{x}) & = { \sum\limits_{i=1}^{N} \log p_{\theta}(x_{i})} \\ & \geq \sum\limits_{i=1}^{N} \mathbb{E}_{q_{\phi}(z_{i}|x_{i})} [\log p_{\theta}(x_{i}|z_{i})] - KL(q_{\phi}(z_{i}|x_{i}) \| p(z_{i})) \end{array} $$
(1)

The first term is the reconstruction loss, i.e. the error in reconstructing x using z, and the second term is the Kullback-Leibler (KL) divergence between the posterior distribution qϕ(z|x) and the prior distribution p(z). The likelihood term logpθ(xi|zi) tries to maximize the probability of reconstructing the input xi from zi and is formulated as a multinomial distribution, with the decoder parameters θ containing the probability vectors for each component of the distribution. The reconstruction term tries to improve the quality of the reconstruction without regard to the properties of the latent space, while the KL term acts like a regularizer and constrains the latent representations within the space imposed by the prior distribution.

Since our goal is to incorporate k-mer distributions in the model, we use a Dirichlet distribution as a prior on the latent variables instead of the more prevalent normal distribution used in Gaussian VAEs, which are difficult to interpret as the bottleneck layer z and can take arbitrary values. On the other hand, a Dirichlet distribution will only allow non-negative latent variables; therefore the value taken by each latent dimension m given a particular x can be considered a ‘membership’, with larger values indicating stronger membership.

However, VAEs with a Dirichlet prior cannot be trained using the explicit reparameterization trick [

Datasets

ATAC-seq data from cell lines

We downloaded the publicly available GM12878 ATAC-seq dataset from the GEO database (accession GSE47753Footnote 1). We took only samples generated using 50,000 cells. Since replicate 1 has much higher sequencing depth than the other replicates, we combined replicates 2, 3, and 4 to obtain a second replicate (renamed replicate 2). We followed the ENCODE ATAC-seq processing pipeline (https://www.encodeproject.org/atac-seq/) for pre-processing. Raw fastq files were adapter-trimmed using Trimmomatic and aligned to hg19 genome using Bowtie2 with default settings. PCR duplicates were then removed using Picard MarkDuplicates and Tn5 shifts are adjusted for. Peak calling was performed for each replicate using macs2 with parameters: -nomodel -shift -37 -extsize 73. Finally, IDR was performed with the idr package and reproducible peaks were called with an IDR cutoff of 0.05. We identified a total of 76,218 reproducible peaks in the GM12878 ATAC-seq dataset. We downloaded the publicly available A549 ATAC-seq dataset from the ENCODE portal (accession ENCFF548PSNFootnote 2). The details of the three other datasets used are in Additional file 1: Table S1.

HT-SELEX

HT-SELEX (High-Throughput Systematic Evolution of Ligands by EXponential enrichment) is an in vitro experimental protocol that involves an iterative procedure that starts with an initial library of random oligonucleotides (oligos) of fixed length of either 20 or 40bp. Since this binding happens outside a cellular environment, it represents a TF’s intrinsic DNA-binding preferences. At every iteration of the procedure, the input pool of oligos compete to bind to the TF. Oligos that do not bind at all or bind weakly are washed out from the pool while the rest are amplified using PCR. A sample of the amplified pool is sequenced to allow for computational analysis while the rest of the pool is used as input for the subsequent selection round. In this way, at the end of each round there are more high-affinity oligos in the pool than before, while non-binders and weaker binders are gradually eliminated.

We used the filtered HT-SELEX probes from Yuan et al. [7] for training our models. Briefly, this dataset contains HT-SELEX data sequenced in Jolma et al. [18] (ENA accession ERP001824) and in Yang et al. [39] (ENA accession ERP016411), which together constitute 547 experiments for 461 human or mouse TFs. The experiments that were filtered out were those showing: 1) poor consistency of 8-mer enrichment in consecutive HT-SELEX cycles, 2) low number of enriched probes, or 3) low diversity of probe enrichment. For each remaining experiment, the top 2,000 enriched 20-bp probes were selected per experiment. The filtered dataset contains 325 high quality experiments covering 296 TFs.

ChIP-seq data: GM12878

Conservative and optimal IDR (irreproducible discovery rate) thresholded ‘narrowpeak’ files of ENCODE ChIP–seq data were downloaded from the ENCODE portalFootnote 3 for GM12878. We only downloaded ChIP-seq datasets which had a green ‘Audit category’ at the time of writing. We excluded experiments with ‘Audit category’ = orange or red as these have insufficient read length, insufficient read depth, poor library complexity, or partially characterized antibody. This gave us 82 experiments.

CAP-SELEX

Consecutive affinity-purification systematic evolution of ligands by exponential enrichment (CAP-SELEX) [20] is an approach to identify TF pairs that bind cooperatively to DNA. It is again an in vitro assay based on a consecutive affinity-purification protocol coupled with enrichment of bound ligands. For a given pair of TFs, say TF1 and TF2, we downloaded the probes from cycle 4 and selected candidate probes for cooperative binding by picking frequent probes where the PWM model for the pair TF1 and TF2 (see the supplementary material from Jolma et al. [20]) was found. We used the MAST algorithm [40] for motif matching with a high e-value cut-off of 10, due to the relatively short length of 40bp of the probes. We selected the top 1000 probes (or fewer, as found) and ran inference on them to obtain their latent representations. Further details about this dataset can be found in the Additional file 1.

RNA-seq expression data

For GM12878, we downloaded gene expression data from the ENCODE portal with accession numbers: ENCFF906LSJ and ENCFF630BDD. We consider a gene to be expressed if it has an average expression of 0.05 RPKM or higher over the two datasets. For A549 we downloaded ENCFF203NNS, and for the T cell female donor sample we downloaded ENCSR336VTKFootnote 4; we used a RPKM/FPKM cut-off of 0.05 to decide whether a gene is expressed or not.

HOMER motif analysis

We first processed the set of HT-SELEX PWMs from Jolma et al. into a database format that is used as input. HOMER was then run in the de novo motif discovery mode by invoking the perl script findMotifsGenome.pl in the following manner:

./findMotifsGenome.pl peak-file.bed hg19/hg38 output-dir -S 1000 -p 10 -size given -len 6,8,10,12 -noknown -mset jolma -e 0.1