Background

Next-generation sequencing technologies and tiling microarrays are frequently employed for genome-wide identification of regulatory elements and chromatin modifications. These applications generate vast numbers of experimental data points, which are compiled into extensive sets of genomic loci representing the units of biological activity measured in the particular assay. Researchers must then discern functionally-relevant results from these large-scale datasets, a process that poses significant bioinformatic challenges for research groups with limited computational support. For example, a common aim of transcription factor location analysis is to determine the relationship between ChIP-enriched loci and annotated genes; identifying the cis-regulatory elements occupied by the factor can reveal the set of genes it is likely to regulate across the genome. Correlating global transcription factor binding-site occupancy with target genes quickly becomes intractable in the absence of software tools to automate aspects of large-scale data analysis.

Sequence patterns occurring repeatedly among enriched loci are indicative of regulatory elements such as transcription factor-binding sites, and can often be identified by DNA motif analysis. Successful motif discovery relies on a set of candidate loci that exclude extraneous sequences while still containing the binding site consensus; however, since many peak-finding utilities merge overlap** areas of enrichment, the resulting peaks tend to be much larger than the actual binding sites. Peak regions often comprise more than one functional element (e.g. co-located transcription factor-binding sites or chromatin modifications), and these must be distinguished into individual loci in order to accurately interpret experimental results. The ability to subdivide composite peak regions into a finer-resolution set of individual binding sites (subpeaks) can improve the accuracy of sequence motif analysis.

Here we describe PeakAnalyzer, a set of standalone tools for the automated post-processing of large-scale chromatin profiling data. The programs are able to identify discrete enrichment peaks from loci corresponding to transcription factor binding or chromatin modification, retrieve individual peak sequences and annotate experimental data against various classes of functional elements, such as genes, CpG islands, regulatory features or DNase I hypersensitive sites. Results can also be compared across multiple datasets to report overlap** features, as well as those unique to a given experimental sample. The software is freely available and flexible in implementation, providing both high-performance solutions for pipeline integration and a GUI version for desktop users.

Implementation

Program description

PeakAnalyzer comprises two main utilities: PeakSplitter and PeakAnnotator. PeakSplitter accurately subdivides experimentally-derived peak regions containing more than one site of signal enrichment, optionally retrieving genomic DNA sequences corresponding to subpeak summit regions. This procedure facilitates more detailed analysis of individual subpeaks (Figure 1). PeakAnnotator scans the target genome to identify and report functional elements proximal to peak loci and contains three main subroutines: Nearest Downstream Gene (NDG), Transcription Start Site (TSS) and Overlap Data Sets (ODS).

Figure 1
figure 1

Experimentally-derived peak regions subdivided into individual sites of signal enrichment. A) Region of Oct4 binding in the Abca4 gene locus, subdivided into six individual loci by the PeakSplitter utility. The canonical binding sequences of Oct4, Nanog and Sox2 are concentrated in the center peaks, while potentially novel motifs or sites of associated co-factor binding may be present in the adjacent three. Peak splitting is equally applicable to the broader regions of enrichment generated from chromatin modification experiments, where signal peaks are more variable in size and shape. Depicted is a region of H3K4me3 methylation on chromosome 16 [29], subdivided into four discrete modification sites.

The function NDG locates the nearest downstream genes on both strands and calculates their distances. If the peak region intersects a gene, the program determines if the overlap is within an exon, intron, 5' UTR or 3' UTR. Multiple transcripts or genes overlap** a given location are all reported, providing a means to identify putative bi-directional promoters where the peak is proximal to genes on both strands. TSS locates the nearest transcriptional start site relative to each locus, scanning both downstream or upstream of the experimental peak to account for transcription initiation on either the sense or antisense strand. The ODS function calculates the overlap in positions/peaks between datasets, where peak loci intersecting by at least one nucleotide on either strand are reported. To compute a P-value of overlap enrichment, a random dataset is generated having peak lengths and chromosomal distribution matching the experimental dataset; the overlap between experimental and artificial loci is then determined, and through successive iterations a P-value representing the statistical significance of experimental signal over random is calculated.

Software distribution and input requirements

PeakAnalyzer is implemented as a unified Java program encompassing the software components described above. Equivalent versions of PeakSplitter and PeakAnnotator are also implemented in C++ and Java so that users can choose a distribution suited to their particular requirements. Core facilities processing numerous datasets have the option to incorporate the faster C++ version into a Unix pipeline, whereas the Java implementations can either be run as separate command-line utilities or as a single cross-platform desktop application using an intuitive graphical interface.

PeakAnalyzer requires only a single peak file and a feature annotation file in BED or GTF format; complete annotation files for the current builds of the human (HG19) and mouse (MM9) genomes are provided with the software distribution. The input files required by PeakSplitter are those commonly generated by peak-finding programs: a .bed-formatted peak file containing chromosome start and end locations of signal enrichment loci, and a .wig signal file describing the size and shape of each peak.

Algorithm implementation

PeakSplitter

We adopted the peak-splitting approach proposed by Fejes et al. [1] and implemented as the function subpeaks in recent versions of their FindPeaks tool. The method identifies multiple peaks within a given locus and accurately subdivides those containing more than one site of signal enrichment. In addition to incorporating the algorithm into PeakAnalyzer we provide a standalone version as the PeakSplitter utility, thereby enabling its application to signal loci called by any such program (e.g., [29]). Local maxima are identified in the peak region by scanning for relative peak heights, where those of adjacent maxima are compared and the lowest value is multiplied by a user-adjustable parameter to arrive at the read depth required for subpeak division. Binding sites are most likely to appear at or near subpeak summit regions, and these sequences can be retrieved directly from the Ensembl database [10].

PeakAnnotator

The PeakAnnotator component scans the target genome to identify and report functional elements proximal to peak loci. Rather than comparing each peak with all possible features, PeakAnnotator uses a combination of binary search and a modified version of the nested containment list (NCList) algorithm (see below and [11]) to rapidly identify proximal features among the full set of annotated elements. Proof of correctness of the algorithms described below and a discussion of their runtime complexity can be found in Additional file 1.

Generating a containment list

Determining the set of intersecting genomic regions across multiple experiments and data sources is not straightforward, because for a given dataset the regions queried may not be contiguous and some regions may be embedded within others. Thus, when sorting the regions by start position, the corresponding end positions could be out of sequence. This is more likely to be the case in higher eukaryotes where some loci encode overlap** genes.

The NCList algorithm constitutes a solution to this problem [11]. In this method the set of genomic regions is partitioned into a primary category of positionally-independent loci, and all remaining loci are segregated into a second category. We adopted this approach in our algorithm, where for each gene in the list PeakAnnotator creates a sublist of all genes containing it. A pseudocode description of the process is listed in Additional file 1, Figure S1.

Finding proximal downstream genes

The NDG utility determines the most proximal non-overlap** downstream genes on both strands. If a gene intersects a signal peak it will be stored in a separate list of overlap** genes. For simplicity, we define here a gene that is transcribed from the forward strand pos_gene, and a gene transcribed from the reverse strand neg_gene. The algorithm works as follows: the first non-overlap** gene located 3' to the peak, G 3' , is found using a binary search strategy such that G 3'-start > Peak end . If G 3' is a pos_gene, it is the closest downsteam gene on the forward strand; if not, genes located downstream to G 3' are visited until a pos_gene is found.

Next, the first gene located upstream to G 3' that does not overlap with the current experimental peak is found, termed G 5' . If G 5' is a neg_gene it has the potential to be the closest downstream gene on the reverse strand. However, if G 5' is contained within another gene transcribed from the reverse strand, this gene is potentially closer to or even intersecting the current peak. Hence, the next step is to determine the closest neg_gene and overlap** genes in the set of G 5' and the gene(s) containing G 5' . If G 5' is a pos_gene, genes located upstream are visited until a neg_gene is found. Finally, the closest downstream neg_gene is searched within the set of that gene and those containing it.

Finding proximal transcription start sites

The TSS function works as follows: the first gene located downstream to the peak's central position, G 3' , is found using a binary search strategy, and its distance to the current peak is calculated. Genes located downsteam to G 3' are visited until a gene that starts downstream of the G 3' locus is found. The gene having the lowest distance from the signal peak is then marked as the closest downstream gene. Next, the first gene upstream to G 3' , termed G 5' , whose end position < G 5'-start (i.e., G 5' = G3'-1 ) is found. Its distance, and the distance of all genes that contain it, is calculated in order to find the nearest upstream gene. The one representing the minimal absolute distance to the peak among the set of proximal downstream and upstream genes will be reported.

Finding overlap** data sets

The ODS function operates on two sets of peaks, denoted here S 1 and S 2 , and iterates over all loci in S 1 to find those intersecting by at least one nucleotide with loci in S 2 . For each locus Ln in S 1 , the first non-overlap** peak from S 2 located 3' to L1, termed L2 3' , is found using a binary search strategy such that L2 3'-start > L1 end . The algorithm then searches upstream of L2 3' to determine if any peak intersects L1, until the first locus in S 2 , termed L2 5' , is found having coordinates outside the boundaries of L1. Peaks containing L2 5' can potentially overlap L1, and are also considered.

Results and Discussion

To illustrate typical applications of PeakAnalyzer, we analyzed the genome-wide binding profiles of a series of transcriptional regulators (Ctcf, E2f1, Esrrb, Klf4, c-Myc, n-Myc, Nanog, Oct4, Stat3, Smad1, Sox2, Suz12, Tcfcp2l1 and Zfx) in mouse embryonic stem (ES) cells, determined using the ChIP-seq method [12]. We obtained the primary data from the NCBI GEO database (series GSE11431), mapped the sequencing reads to the mouse genome assembly using the Bowtie alignment program [13], and detected significant peaks of signal enrichment with MACS [2]. Subsequent analyses were performed on the set of chromatin-binding regions from each of these re-processed ChIP-seq datasets.

Identification and subdivision of signal peaks

In characterizing the binding patterns of each transcription factor, we first used the PeakSplitter utility to partition regions of signal enrichment into individual binding loci. The numbers of putative binding sites resolved for each factor before and after processing are summarized in Table 1. As illustrated in Figure 2, the number of original signal peaks roughly correlates with the number of subpeaks found by PeakSplitter. For some transcription factor proteins (Ctcf, Stat3, Nanog, Oct4 and Sox2), the total number of subpeaks is close to the original number identified; this suggests the presence of either a single regulatory element bound at each locus, or a small cluster of binding sites such that the combined distribution of peak regions is too uniform to be accurately partitioned. However, the binding profiles of Etf1 and Esrrb produced large numbers of additional subpeaks, where more than twice the original number of Etf1 binding sites were identified.

Figure 2
figure 2

Correlation between the total number of peaks and subpeaks for 13 transcription factors and the Polycomb group protein Suz12. PeakSplitter was used to partition the regions of signal enrichment of each factor into individual binding loci. Significant agreement is observed between the numbers of putative binding sites resolved before and after processing (Pearson correlation = 0.93). Peak profiles corresponding to Oct4, Sox2 and Nanog are co-localized on the plot, indicating similar patterns of chromatin association across the genome.

Table 1 Characteristics of transcription factor-bound peaks and subpeaks

A logical assumption when interpreting ChIP-seq data is that wider areas of signal enrichment may contain greater numbers of individual binding sites than narrow peak regions. To test this idea, we plotted the lengths of the original peaks resolved for each transcription factor relative to the numbers of subpeaks identified in each case by PeakSplitter (Figure 3). Broader peak areas were indeed subdivided into greater numbers of subpeaks, indicating the presence of composite binding loci. However, individual factors were found to exhibit varying length profiles within peak groups that were partitioned into the same numbers of subpeaks. For example, for a given number of subpeaks produced by PeakSplitter, Etf1 binding sites appear to be considerably longer than those of Zfx. This would indicate that the distance between co-localized DNA binding sequences specific to each transcription factors is different, an observation that may be related to the size of each transcription factor protein complex when co-factors are bound.

Figure 3
figure 3

Length of signal enrichment peaks relative to subpeaks derived from the same loci. Five representative transcription factors are shown with ChIP-seq binding regions ranging from 9,000 (c-Myc) to 82,000 (Esrrb). These were subdivided into maximally 10 subpeaks per locus, and the lengths of the original regions plotted against the number of subpeaks identified.

Genome-wide annotation of transcription factor binding sites

Binding sites identified from ChIP-based experiments are usually categorized relative to genomic features, such as the frequency of binding to promoters, enhancers, gene structures or unannotated intergenic regions. Of primary interest in determining transcription factor targets is the location of binding sites relative to known transcriptional start sites. The relationship between promoter occupancy and differential gene expression can often identify genes directly regulated by a factor, but can also provide insight into the mechanisms by which it mediates transcriptional activation or repression. For example, factors that bind close to transcriptional start sites have been proposed to promote gene expression by stabilizing the association of general transcription factors at the core promoter elements; factors that bind to distal regions, either upstream or downstream of a gene locus, may regulate transcription by mediating, through a chromatin loo** mechanism, the protein-protein contacts between distal complexes and the general transcriptional machinery bound at the promoter.

Here we used PeakAnalyzer to assign the genome-wide binding sites resolved for each of the 13 transcription factors to target genes, and profiled these interactions based on the distance between binding sites and gene loci. In [12], binding sites were assigned to target genes based on 17,762 annotated mouse promoters [6, 24] and SWEMBL [25]. Default parameters were used in each case, with the exception of FindPeaks where a height threshold of 5 was applied to the output. The number of peaks reported by each peak caller is presented in Table 2, along with the peak length distribution. All peak callers except SISSRs report peak regions with median lengths between 261 (USeq) and 1189 (FindPeaks).

Table 2 Numbers of peaks and length distributions reported by various peak-calling programs

We then applied PeakSplitter to subdivide the peak regions called by each program, and compared the number of subpeaks reported both with and without filtering based on minimum read depth. Such filtering is generally necessary to exclude spurious peaks in regions where sparse read map** contributes to low-level background signal. The numbers of resulting subpeaks and their length distributions are listed in Table 3. The relative numbers of peaks differ significantly when the unprocessed .wig signal was used as input. Interestingly though, the peak length distributions are nearly identical across different methods.

Table 3 Numbers and length distributions of subpeaks

We next examined the agreement between the output of each method by comparing the overlap between the reported peaks and subpeaks. A non-redundant list of peak loci was created by merging overlap** regions output by each program; the resulting numbers reflect how many called a peak within each site. The intersection is represented in Figure 8. FindPeaks and SWEMBL reported the highest numbers of peaks not supported by other methods, whereas USeq called the lowest number of peaks overall and is excluded from the figure for clarity. The relative overlap between the remaining five methods is similar when considering either the original peaks (Figure 8A) or subpeaks (8B).

Figure 8
figure 8

Comparison of peak-calling methods. Overlap of genomic regions identified by five different peak-calling algorithms, correlating the peaks (A) and subpeaks (B) generated from each on the Oct4 ChIP-seq dataset. The numbers of peaks/subpeaks shared between methods is similar, although FindPeaks and SWEMBL generated the highest numbers of unique calls.

Finally, we used these results to determine whether peak subdivision enhances motif discovery. The merged peak and subpeak datasets were divided into groups of 300 sequences and used as input to MEME. Since individual peak summit information is lost when regions called by different programs are merged, we used the entire peak sequences for motif analysis rather than regions flanking the summit. Following this analysis the canonical Oct4 binding sequence was not identified in any of the datasets containing the original peaks. After PeakSplitter was applied the motif was found in all of the subpeak datasets, aside from one instance where an Oct4 half-site was reported. These results indicate that subdividing signal peaks is essential for accurate motif discovery, independent of the original peak-calling method used.

Conclusions

Regulatory elements identified through functional genomic assays are commonly determined based on signal peaks from tiling array fluorescence data or aligned reads from massively parallel sequencing. In order to interpret the results of such experiments, they must be considered in context with genes and regulatory elements in proximity to peak regions. Methods to automate the functional annotation of chromatin binding and modification loci can greatly ease characterization of their biological significance in genome-wide analyses.

A variety of tools are available for processing the primary data generated by ChIP-seq experiments, such as map** sequence reads to a reference genome and identifying areas of significant enrichment. However, this is not the case for downstream analysis and data integration. Existing solutions that address these issues either rely on the transfer of large datasets via the Web for remote processing [26], require local installation of target genome databases [27], or operate within a specific computing environment [28].

PeakAnalyzer is a standalone solution amenable to a wide range of applications, including comparison of data generated on different experimental platforms. The software can accept any genomic loci as input and therefore can be used to process datasets spanning various methods, such as ChIP-seq, ChIP-chip, DamID, MeDIP and bisulfite sequencing. The PeakAnnotator component facilitates the automated annotation of numerous experimental results, and obviates the need to import large datasets into a genome browser for manual visualization and assessment.

Subdividing genomic loci with PeakSplitter is particularly useful for discerning individual binding sites that may be present in aggregate peak regions, and in extracting candidate sequences for motif analysis. We observe an increase in both accuracy and efficiency in motif search when ChIP data are processed by PeakSplitter. Partitioning broad signal peaks into discrete loci enriches the dataset for sequences containing transcription factor-binding sites and other regulatory elements, and can enhance the discovery of new consensus motifs by providing a more focused set of candidate sequences for alignment and/or model building.

Availability and requirements