Background

Studies estimate that up to 85% of the human genome is pervasively transcribed by RNA polymerase II (Pol II), resulting in a plethora of RNA products [1,2,3,4]. Many of these transcripts belong to well-established categories, such as messenger RNAs (mRNAs) which are characterized by the presence of 5′ cap, coding sequence (CDS), and poly(A) tail. Other transcripts are categorized as long non-coding RNAs (lncRNAs), generally defined as RNA molecules longer than 200 nt with little coding potential. Currently, lncRNAs are divided into three major groups depending on their genomic location relative to protein-coding genes: promoter upstream transcripts (PROMPTs), produced up to 2.5 kb upstream of active transcription start sites (TSSs) [5]; enhancer RNAs (eRNAs), bi-directionally transcribed from enhancer DNA elements [6, 7]; and large intervening non-coding RNAs (lincRNAs), located in intergenic regions, distal from protein-coding genes, and regulated as independent transcriptional units [8]. Gene and transcript annotations for the human genome are continuously updated, and their assignment to specific biotype categories can change across reference databases [9]. In particular, in the past decade, efforts towards the identification and characterization of novel lncRNA genes have been made, either through computational predictions or functional assays [10, 11]. Despite such endeavors, a marked proportion of RNA-seq reads from human cells still maps to unannotated, ostensibly intergenic portions of the human genome [12]. It is therefore often challenging to understand whether such reads originate from independent transcription units or are associated with annotated genes.

Many well-characterized lncRNAs, such as the X-inactive-specific transcript ** to the human genome yielded > 2.5 billion uniquely mapped reads (Fig. 1 and Additional file 1: Table S1, S2, and S3). We employed StringTie [34] to generate preliminary annotations of the transcriptional units expressed within each dataset. We then merged the results into a comprehensive transcriptomic assembly across the entire dataset and also included all genes present in the GENCODE reference annotation [35]. Finally, we employed a custom pipeline (see the “Methods” section) to annotate transcripts expressed in intergenic regions and to define their relationship with annotated genes (Figs. 1 and 2a; see the “Methods” section). We defined transcriptional units (TU) as products of transcription from intergenic portions of the genome, which can either take place as an independent event or in association with features in the reference annotation.

Fig. 1
figure 1

Flow chart of the data analysis pipeline. Schematic describing the main data processing steps, intermediate and final outputs of the analysis pipeline, applied to RNA-seq (left side) and other sequencing (NET-seq, right side) data. Procedures (blue) and tools (orange) are indicated

Fig. 2
figure 2

General features of newly identified transcriptional units (TUs). a Schematic representation of the gene-associated (green) and independent (yellow) transcriptional units annotated in this study. b Upper panels: genome browser views of nuclear RNA-seq signals in HeLa cells, for example, TUs (red and blue indicate RNA-seq reads map** to the sense and antisense strands, respectively). Lower panels: genomic annotations of pre-existing genes and newly identified TUs; horizontal line divides the features on the sense (S) and antisense (A) orientations. Coverage is reported at 1× depth (reads per genome coverage (RPGC)). c Comparison of the number of annotated and newly identified transcripts detected in the current RNA-seq dataset (TPM ≥ 1). d Comparison of transcript lengths. e Proportions of uniquely map** RNA-seq reads originating from different transcript types for the whole cell (left) and nuclear (right) subcellular fractions of HeLa cells. f Distributions of expression levels of annotated and newly identified TUs for the chromatin-associated (left panel) and nucleoplasm (right panel) subcellular fractions of HeLa cells

We classified TUs into two broad groups based on their genomic location relative to existing gene annotations (Fig. 2a, b). (i) Gene-associated TUs are those showing continuity of transcription from the body of annotated genes. These were further divided into upstream of gene (UoG), downstream of gene (DoG), and linker of genes TUs (LoG). (ii) Independent TUs are > 10 kb away from existing gene annotations and so classed as purely intergenic. Although we use the term “independent TUs” for the sake of clarity, it is possible that some might in the future end up annotated as new genes that produce functional non-coding RNAs or perhaps even protein-coding mRNAs. Indeed, up to 80% (1453) of these independent TUs overlap with transcriptional products annotated in public databases (e.g., NONCODE [36], CHESS [37], and RefSeq [38]; Additional file 1: Table S4), while a number of the independent TUs we identified (373) have not been reported in any of the aforementioned databases, thus indicating that our results, on top of being supported by external sources, also expand the current catalog of transcribed RNAs. In total, we classified 7411 TUs covering ~ 5.6% of the human genome (Fig. 2c). Both gene-associated and independent TUs are of comparable lengths to previously annotated long non-coding RNA genes, ranging from 1 kb (the minimum length threshold for a TU) to hundreds of kb (Fig. 2d).

Assessing the expression levels in HeLa cells, it is apparent that the relative abundance of TUs is higher in the nucleus (4.28% of mapped reads) compared with the cytoplasm (1.34% of mapped reads; Fig. 2e). Moreover, within the nucleus, TUs tend to be chromatin-associated (5.2% of mapped reads) rather than the nucleoplasm (1.21%) (Fig. 2e). Intriguingly, we noticed that about 80% of these reads mapped to features linked to transcription of previously annotated loci (i.e., gene-associated TUs), while the remainder belongs to independent TUs (Fig. 2e). We also compared the normalized expression levels of TUs with annotated genes (Fig. 2f). Protein-coding transcripts tend to be the most highly expressed within the chromatin-associated and nucleoplasmic compartments; however, in these subcellular fractions, both gene-associated and independent TUs are more highly expressed than annotated lncRNAs. Additionally, we found independent TUs tend to undergo less splicing than lncRNAs (Additional file 1: Figure S1B). In agreement with previous reports [30, 39], DoGs and LoGs show the highest expression among TUs, suggesting that levels of transcription outside annotated loci primarily depend on the activity of annotated upstream features (Fig. 2f).

To investigate the properties of TUs in greater detail, we focused further analysis to those with the strongest evidence:

  • Both the gene-associated TU and neighboring genes must have TPM expression ≥1 (Additional file 1: Figure S1D) and length ≥ 5 kb (to avoid overlaps when assessing meta-profiles).

  • UoG, LoG, and DoG TUs must be associated with a protein-coding gene (Additional file 1: Figure S1C), thus reducing the chance of including poorly annotated genes with relatively unreliable start and end genomic coordinates (e.g., pseudogenes).

  • Independent TUs must be ≥10 kb from any annotated feature on the same strand orientation to ensure that they are not transcribed as part of a known gene (Additional file 1: Figure S1A).

These filtering criteria left 1604 gene-associated TUs (88 UoG, 1329 DoG, 187 LoG) and 571 independent TUs. As controls, we paired gene-associated TUs with their corresponding protein-coding genes, and we identified 3462 lncRNA genes in a similar size range to independent TUs.

Gene-associated transcription breaks gene boundaries

Next, we sought to understand the transcriptional origins of gene-associated TUs. Here, we focus on data from HeLa cells unless stated otherwise, as it is the cell type with the largest variety of measurements.

Figure 3 displays meta-profiles of diverse transcriptional measurements aligned to the start and ends of TUs and protein-coding genes (Figure S2 for LoGs). All categories of TUs display clear RNA-seq coverage in the nuclear and chromatin-associated fractions, but in contrast to protein-coding genes, the signal is virtually lost in the cytoplasm (Figs. 3a and S2A). There is a clear jump in expression levels at the gene boundaries upon transition between the TU and associated gene, but TUs nonetheless display remarkably high relative expression levels in the nuclear and chromatin compartments.

Fig. 3
figure 3

Meta-profiles of transcriptional measurements around gene-associated TUs. Meta-profiles of transcriptional measurements plotted relative to the start positions of UoGs and their associated protein-coding genes (left-hand panels) and relative to the end positions of DoGs and their associated genes (right-hand panels). a RNA-seq measurements in different subcellular compartments. b CAGE-seq measurements in the sense and antisense strands. c NET-seq measurements for different Pol II CTD modifications. d ChIP-seq measurements for histone marks and EP300 occupancy-associated transcriptional activities

Since the novel TUs are transcribed by RNA Pol II, we asked if the unannotated TSSs initiating UoG transcription have previously been detected through cap analysis of gene expression (CAGE). To this end, we used annotated CAGE peaks derived from a large collection of cell lines and tissues [40, 41]. UoGs display a slight enrichment of CAGE peaks at the start site, but we could hardly detect any signal for DoGs and LoGs (Figs. 3b and S2B); this suggests that whereas UoGs show evidence of independent transcriptional initiation, DoGs and LoGs are most likely generated from transcriptional readthrough of the upstream gene. The modest CAGE signal for UoGs (detected for 48 out of 98 UoGs) suggests that they are not efficiently capped, in contrast to mRNAs initiating at annotated start sites of protein-coding genes (Fig. 3b). This indicates that the majority of intergenic TUs might be designated as substrates for exonucleases and prone to degradation [42, 43].

Figure 3c (and Additional file 1: Figure S3A) shows prominent Pol II occupancies at the start sites of UoGs, albeit at lower levels than at the TSS of associated genes. Together with the CAGE data, this possibly indicates the formation of a pre-initiation complex (PIC) and therefore the existence of unannotated, upstream TSSs. Active transcription of TUs is supported by mammalian native elongating transcript sequencing (NET-seq) data, which identifies nascent RNA fragments attached to transcriptionally engaged RNA Pol II [39]. NET-seq is capable of differentiating between distinct transcriptional stages by map** nascent RNAs associated with different patterns of RNA Pol II C-terminal heptad repeat domain (CTD) phosphorylation. The annotated and UoG TSSs display similar NET-seq profiles, thus suggesting that the TUs are not the result of stochastic Pol II binding but rather the outcome of coordinated transcriptional initiation events. Indeed, the profile for tyrosine-1 (Y1P) phosphorylated Pol II—a hallmark of TSS-paused protein-coding gene transcripts [17]—displays the highest signal at the start positions of both UoGs and protein-coding genes, with the former having a less pronounced peak and a broader distribution of the signal. Moreover, serine-5 (S5P) phosphorylated Pol II, which is mainly associated with TSS events such as co-transcriptional cap** and early transcriptional elongation [44], follows a pattern similar to the total and Y1P profiles around these regions.

Threonine-4 (T4P) phosphorylation is a hallmark of terminating Pol II and causes a characteristic NET-seq signal near transcription end sites (TESs) of protein-coding genes [17]. Among protein-coding genes, the T4P profile peaks immediately after canonical TESs and remains high, while gradually decreasing towards the end of the associated DoG (Fig. 3c). This observation implies that although Pol II is poised to terminate after encountering the canonical TES, actual Pol II detachment might occur several kilobases downstream, as previously reported [45]. LoGs represent a special case, in which a high T4P signal after the TES of the upstream gene is maintained throughout the intergenic space only to peak again at the TSS of the downstream gene (Additional file 1: Figure S2C and S3B). This suggests either that transcription of LoGs joins two adjacent transcripts thereby generating pseudo-bicistronic nascent RNAs or, alternatively, that Pol II reaches the downstream gene and reinitiates transcription from a T4P state. In both cases, the downstream gene is potentially dependent on the transcription and by extension the promoter state of its upstream gene, thus implying the existence of co-regulation.

Finally, we examined the ChIP-seq profiles for four histone marks associated with transcriptional activity, as well as the histone acetyltransferase EP300 (Fig. 3d and Additional file 1: Figure S4A). Epigenetic modifications such as H3K4me3 and H3K27ac, which are associated with active promoters and enhancers respectively [46, 47], are enriched at both protein-coding and UoG start sites (Fig. 3d). Furthermore, the tri-methylated forms of H3K27 and H3K9, commonly found at transcriptionally silenced regions [46, 47], are depleted. Interestingly, the histone acetyltransferase EP300, which regulates transcription of genes via chromatin remodeling, shows a comparable enrichment in binding at both UoG TSSs and annotated TSSs (Fig. 3d). EP300 is also known as a transcriptional coactivator, due to its ability to bind to transcription factors and the transcription machinery, and consequently activates transcription [46, 47]. Therefore, the presence of this protein more than 5 kb (size used for selecting the intergenic features) upstream of the canonical TSS is intriguing, as it suggests that transcription from the upstream intergenic regions is not merely the consequence of stochastic initiation events, but rather a concerted and precisely regulated process.

Transcription from deep intergenic regions

Next, we focused on independent TUs. We noticed a number of similarities between these elements and the 3426 control lncRNAs. Specifically, for both classes, we could detect RNA-seq signal upstream of the TSS and downstream of the TES (Fig. 4a). This is probably due to the sub-optimal annotation of these reference positions, a challenging task considering the intrinsically low level of expression of such transcripts [48, 49]. Interestingly, the CAGE signal displays equal enrichment in both orientations around the TSS of lncRNAs, possibly indicating that most of these RNAs originate from divergent transcription (Fig. 4b). The NET-seq profiles show similar enrichment patterns for total RNA Pol II and the CTD modifications TSSs and TESs of lncRNAs and independent TUs (Fig. 4c and Additional file 1: Figure S3C). Finally, the H3K9me3 and H3K27ac profiles around the TSSs of both lncRNAs and independent TUs resemble those of protein-coding genes and UoGs (Figs. 3d and 4d), highlighting equivalent chromatin statuses. Thus, based on the transcriptional and related measurements, independent TUs appear to be bona fide lincRNAs that eluded reference annotation.

Fig. 4
figure 4

Meta-profiles of transcriptional measurements around independent TUs. Meta-profiles of transcriptional measurements plotted relative to the start and end positions of independent TUs and control long non-coding RNA genes. ad as in Fig. 3

Rapid degradation of chromatin-associated intergenic RNAs

We showed that both gene-associated and independent TUs are widespread across the genome, and their expression levels in the nucleus are comparable to those of annotated genes. Moreover, analysis of the transcribed loci did not highlight distinctive characteristics that explain why TUs are found only in the chromatin cellular compartment. Therefore, we hypothesized that there may be differences in the control of retention and stability of these transcripts.

First, we compared the expression levels of annotated RNAs and intergenic TUs between chromatin-associated and nucleoplasmic fractions. We found that unspliced protein-coding and long ncRNA transcripts tend to be equally distributed between the two fractions, whereas TUs, in particular, DoGs and LoGs, are preferentially confined to the chromatin-associated fraction (Fig. 5a and Additional file 1: Figure S5A).

Fig. 5
figure 5

Impact of nuclease depletion on TU expression. a Expression levels of protein-coding genes and TUs in the chromatin and nucleoplasm fractions. b Relative nucleoplasmic-to-chromatin expression levels in response to EXOSC3 knockdown and control siLuc treatments. c Expression levels in CSTF2+CSTF2T and CPSF3 knockdowns relative to control in the chromatin fraction. d Expression levels in XRN2 knockdown (via activation of the auxin-inducible degron system) and basal (uninduced; minus auxin) treatments relative to unmodified XRN2 control in the nuclear fraction. p values were calculated using the two-sided Wilcoxon rank sum test, with asterisks indicating statistical significance at the following thresholds: nsp > 0.05; *p ≤ 0.05; **p ≤ 0.01; ***p ≤ 0.001; ****p ≤ 0.0001

The scarcity of these transcripts in the nucleoplasm suggests that they exert their function, if any, bound to the chromatin fraction or that they are transcriptional by-products that are rapidly degraded. We examined recently published RNA-seq data following knockdown or depletion of proteins involved in the processing and degradation of transcriptional products: specifically, EXOSC3 [17], CSTF2 (and its paralog CSTF2T), CPSF3 (also known as CPSF73) knockdowns in HeLa cells [39], and XRN2 depletion in HCT116 cells [50].

EXOSC3 is part of the RNA exosome complex; it possesses 3′ to 5′ exoribonuclease activity, and it is involved in eliminating transcriptional by-products. Known substrates include non-coding transcripts, such as promoter-upstream transcripts (PROMPTs), mRNAs with processing defects [51, 52], and most prominently rRNA and snoRNAs, as part of their normal processing and maturation in the nucleolus [53]. The EXOSC3 knockdown had little or no effect on transcripts of protein-coding genes and their associated DoGs and LoGs; however, there is a marked effect on the stability of lncRNAs, UoGs, and independent TUs in the nucleoplasmic fraction (Fig. 5b and Additional file 1: Figure S5B). Moreover, the accumulation of these transcriptional products, caused by the loss of a functional nuclear RNA exosome, is more dramatic in the nucleoplasm than in the chromatin fraction, suggesting that they are generally targeted post-transcriptionally and cleared once they move away from the chromatin environment.

Since we observed a predominant chromatin retention (Fig. 5a) and no effect of EXOSC3 knockdown on DoGs and LoGs (Fig. 5b and Additional file 1: Figure S5B), we hypothesized that other mechanisms must regulate these TUs. We examined the factors involved in processing the terminal regions of nascent transcripts: CSTF2 (and its paralog CSTF2T), implicated in 3′ end cleavage and polyadenylation of pre-mRNAs; CPSF3 (also known as CPSF73), a 3′ end-processing endonuclease; and XRN2, an exoribonuclease with 5′ to 3′ activity. Indeed, knockdowns of CPSF3 and of CSTF2+CSTF2T lead to increased levels of DoGs and LoGs (Fig. 5c and Additional file 1: Figure S5C), suggesting that degradation of these transcripts is strongly dependent on the correct processing of the 3′ end of nascent transcripts. Downstream of cleavage at the poly(A) signal by the CPSF/CSTF complex, the remaining 3′ by-product is depleted by the 3′→5′ exonuclease XRN2 [54]. Hence, we evaluated the expression of these transcripts in XRN2 depletion [50] to assess whether LoGs, like DoGs, are coupled with 3′ end processing of the upstream gene. XRN2 depletion greatly increased the expression of DoGs and LoGs, while leaving other transcript types unchanged (Fig. 5d and Additional file 1: Figure S5D), thus indicating that XRN2 activity indeed regulates DoG and LoG abundance.

Discussion

In 2010, van Bakel and colleagues explored transcribed unannotated transcripts by using tiling arrays and poly(A)+ RNA-seq technologies. Despite their study proving valuable to better characterize intergenic fragments associated with transcription of annotated genes, the methods used limited their detection potential. Indeed, the authors were able to identify a large number of novel alternative exons, although the functional implications of the intergenic transcripts they defined remained elusive [55]. Here, we take advantage of the large amounts of nuclear and chromatin subcellular fraction RNA-seq and NET-seq data, which allow a deeper and more comprehensive detection of short-lived long RNAs, and investigate their stability and degradation mechanisms. As a result, we were able to expand the current catalog of intergenic RNAs that are either associated with the transcription of annotated genes or produced as independent transcriptional units.

Non-canonical transcription upstream of genes

To date, transcription upstream of canonical genes has been reported as a consequence of bidirectional transcription from neighboring promoters or enhancers, with the transcript being generated in the antisense direction. In contrast to these transcripts, the UoGs identified here originate from the same strand as the associated downstream genes, thus limiting the possibility that these are products of enhancer- or promoter-derived divergent transcription. The presence of CAGE peaks on opposite strands around the beginning of these transcripts suggests that a minor fraction could instead originate from convergent transcription [56]. Either way, transcription close and across the canonical promoter region of the respective gene is expected to result in the regulatory impact of UoG units, such as altering chromatin accessibility, or recruitment of Pol II and co-factors. Our study highlights examples in widespread used cell lines that can be studied in depth leveraging further genome-wide transcription data (such as Hi-C) or mechanistic analysis through genome editing.

Non-canonical transcription downstream of genes

Studies have recently highlighted the presence of widespread transcription of intergenic regions downstream of protein-coding genes in mice and humans in response to heat shock, osmotic stress, or oxidative stress [30, 31]. Although this form of transcriptional readthrough has been ascribed to the mammalian stress response, here, we found evidence for such behavior in unstimulated, normally proliferating cell lines. We observed two categories of readthrough, which are characterized by distinctive patterns of Pol II CTD phosphorylation. In the first group, DoGs arise from the transcription of canonical genes that then continues for a few to hundreds of kilobases across intergenic space (DoG), as previously reported [17, 30, 39]. These are marked by the sharp increase in threonine 4 phosphorylation of Pol II (T4P) after the annotated TES, and the gradual and eventual loss of Pol II binding with distance. In the second group, Pol II continues transcribing to the next gene (LoG), thus hinting at the possibility of polycistronic transcription in higher eukaryotes [57]. In this case, the T4P signal does not fade, suggesting that most Pol II continues transcribing until it reaches the downstream gene. It is not clear whether Pol II proceeds uninterrupted through the next gene or reinitiates a separate transcriptional event. Co-regulation of genes in close proximity on the same chromosome has previously been described [58], and the existence of LoGs could be one of the factors explaining such observations.

Functional consequences of non-canonical transcription on canonical genes

Although intergenic transcription has been commonly considered a consequence of pervasive transcription and, therefore, having no apparent functional role, accumulating evidence indicates that such processes can have major repercussions on the activities of neighboring genes [14]. Indeed, the effect of lncRNA transcription on gene activation or repression has been reported by a few studies [15, 59,60,61]. Interestingly, this phenomenon does not seem to be restricted to lncRNAs but also extends to protein-coding mRNAs and, potentially, to all transcriptional events [21]. Gene-associated RNAs can recruit chromatin remodelers that are able to maintain an open chromatin state or act as binding platforms for protein complexes at gene-proximal sites, such as the transcriptional factor Yin and Yang 1 (YY1) [62] and the MLL complex subunit WD repeat-containing 5 (WDR5) [63]. As a result, transcription upstream and downstream of annotated genes that we identified in this study might be functionally important for maintaining an open chromatin state and for the correct expression of neighboring genes. That these transcripts are tightly associated with chromatin and are rapidly degraded by nuclear surveillance processes, suggesting that their functions do not go beyond the course of transcription. For example, DoGs are highly sensitive to XRN2-mediated degradation; it has been previously reported that this protein promotes transcriptional termination at protein-coding genes via the torpedo mechanism model, in which the exonuclease degrades the gene-associated RNA until it reaches the elongation complex, consequently causing its termination [64,65,66]. Hence, transcription of very long DoGs might underlie a longer engagement of RNA Pol II and, consequently, its inability to readily detach from DNA and restart transcription elsewhere and its contribution to maintain chromatin in an open state.

These mechanisms of transcription-associated chromatin regulation are not necessarily confined to intergenic regions linked to previously annotated genes but might be broadened to independent TUs. However, since these regions are usually found in gene-poor portions of the genome, their products are more likely to exert their functional role in trans. This and their similarities in terms of transcriptional activity, chromatin state, and degradation patterns to lncRNAs support the hypothesis that independent TUs could be novel lncRNA loci.

Conclusions

In summary, we assembled publicly available RNA-seq data to identify and classify intergenic transcripts based on their expression and location relative to annotated genes. We showed that gene-associated and independent RNAs have characteristic patterns of transcription and that they are highly sensitive to nuclear degradation processes. Our data are consistent with recently reported chromatin remodeling and gene expression regulatory mechanisms associated with transcription. Collectively, the results expand the current categories in gene annotation and provide the tools to further investigate the underappreciated role of intergenic transcription as a function of gene expression and regulation.

Methods

Read alignment and post-processing

Sequencing quality checks were performed on all experiments using FastQC [67]. Adaptor sequences were removed using TrimGalore (v0.4.4_dev) [68] with default parameters. Reads were filtered against human rRNA and tRNA sequences obtained from the NCBI using Bowtie2 (v2.3.3.1) [69] with the option --sensitive-local. Reads that failed to align were mapped with STAR (v2.5.3a) [70] to UCSC hg38/GRCh38 genome assembly using GENCODE (v27) gene annotation [35] as a reference, with the following parameters: --twopassMode Basic --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1 --outFilterMultimapNmax 1 --outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04 --outFilterType BySJout --outSAMattributes All --outSAMtype BAM SortedByCoordinate, and specific options for gapped (--alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000) and ungapped (--alignIntronMax 1 --alignMatesGapMax 300) alignments. PCR duplicates were removed using Picard MarkDuplicates (v2.18.3) with default parameters. Quantification of expression was performed using QoRTs (v1.3.0) [71] and the GENCODE (v27) gene annotation [35].

Genomic coverage tracks

Deduplicated unique alignments were converted to stranded normalized coverage bigWig files using deeptools (v3.0.2) [72] with --normalizeUsing CPM --binSize 20 --smoothLength 60 options, and --filterRNAstrand for the selection of forward and reverse strands.

De novo transcriptome assembly

Deduplicated uniquely mapped reads were assembled into a de novo annotation GTF using StringTie (v1.3.4c) [34] with the GENCODE (v27) gene annotation [35] as a reference, and the following parameters: -f 0.2 -g 100 -j 3 -t. The individual annotation GTFs from all wild-type RNA-seq datasets (no treatment condition was used to annotate the intergenic regions) were then used as input for StringTie with the --merge option to generate a non-redundant set of predicted transcripts. The output, which consists of a GTF file with merged gene models, was filtered using the gffcompare utility [73] with the -C option to discard predicted transcripts that were fully contained within larger annotated regions.

Identification of intergenic transcriptional units

A custom pipeline written in R was used to process the GTF file generated as described above. The pipeline performs several steps, the first of which is the discrimination of the purely intergenic regions (i.e., defined using the sequencing data) from the known features (i.e., already present in the GENCODE gene annotation). This operation is performed by the setdiff() function from the GenomicFeatures R package [74] on the gffcompare-generated GTF and GENCODE reference annotation files. Intergenic regions with length ≤ 1 kb are discarded. Although these can represent true signals, the low number of supporting reads observed in these small regions can be seen as a consequence of misannotation or misalignment. The remaining are further divided into “gene-associated” and “independent” transcriptional units (“gene-associated TUs” and “independent TUs,” respectively) based on whether they originate from annotated genes, thus showing transcriptional continuity with the gene body, or from regions devoid of annotated features, and therefore, they are considered independent events of transcription. Gene-associated transcriptional units are assigned to different sub-groups depending on their position and connection to the neighboring gene(s):

  • Upstream of gene (UoG): the unit is located upstream of the associated gene.

  • Downstream of gene (DoG): the unit is located downstream of the associated gene.

  • Linker of genes (LoG): the unit is located between two genes and transcriptionally associated with them.

To confirm the co-occurrence of the annotated gene(s) and gene-associated features, their expression is re-assessed across the RNA-seq datasets within R using a customized countOverlapsUnion() function. Features are considered co-transcribed if expressed (TPM ≥ 1) in the same cell line and in at least two datasets. At this level, the categorization is also re-evaluated and, if necessary, TUs can be re-assigned to the proper sub-group (e.g., a linker of gene TU whose downstream gene is not expressed will become a downstream of gene TU).

Independent TU comparison with public databases

Expressed independent TUs were used as input for BEDtools intersect to calculate the individual overlaps with annotated features retrieved from NONCODE [36], CHESS [37], and RefSeq [38] databases, using the options -s (stranded) and -f (to test the fractions of overlaps from 0.1 to 1). The number of independent TUs not annotated in any of the databases was obtained by performing sequential intersections (adding the options -wa -v).

Splicing analysis

Deduplicated unique alignments were parsed using samtools [75] view, and gapped alignments (i.e., reads encompassing known or putative splice junctions) were extracted based on their CIGAR information (i.e., whether or not it contained “N”). Reads were then assigned to “long ncRNA” or “independent TU” features using the countOverlapsUnion() function from the GenomicFeatures R package [74]. For each dataset, the fraction of junction reads was calculated over the total number of deduplicated unique reads.

Selection of HeLa TUs and metadata profiles

Since the large majority of data available for validation derived from HeLa cells, we decided to focus our analysis of intergenic features only to those expressed in this cell line. Therefore, we generated a set of annotated genes and gene-associated and independent TUs where each feature had an average expression of ≥1 TPM across the HeLa RNA-seq datasets. In addition, we required the gene-associated features to be connected to annotated protein-coding genes, thus reducing the chance to include poorly annotated genes for which start and end genomic coordinates are not reliable (e.g., pseudogenes). We retained only the independent TUs located ≥10 kb from any annotated feature on the same strand orientation, to ensure that their transcription is not directly linked to known genes. Finally, we discarded features with length < 5 kb to avoid signal overlaps between the start and end positions in metadata profiles.

The metadata profiles were generated using the CPM-normalized coverage bigWig files (see “Genomic coverage tracks” section) and a custom wrapper of the ScoreMatrixBin() function from the genomation R package [76]. The wrapper function is used to facilitate strand splitting, centering, and resizing (i.e., ±5 kb from region start or end position), binning (i.e., 200 bins over the 10-kb window), and normalization and averaging of the signal. When not specified in the figure legend, normalization was performed by dividing the bins of each feature (or group of features in case of paired annotated gene and its gene-associated TU) by the value of the bin with the higher count across the region.

Epigenetic modification profiles

We collected the “fold change over control” and merged replicates ChIP-seq bigWig files from ENCODE. The list of epigenetic modifications and associated accession numbers can be found in Additional file 1: Table S3. The ChIP-seq signals across the regions of interest were calculated using the wrapper function described in the previous section.

CAGE peak profiles

We retrieved the hg38 CAGE reprocessed data [40] from the FANTOM Consortium [77]. The density of the CAGE peaks (phases 1 and 2) was calculated using the wrapper function described in the “Selection of HeLa TUs and metadata profiles” section, without applying any normalization.

Quantification of expression and degradation

We collected data from cells in wild-type/untreated conditions and after the knockdown of several proteins from different sources (Additional file 1: Table S2). The datasets were processed as described in the “Read alignment and post-processing” section. Deduplicated uniquely mapped reads were loaded into R using the GenomicAlignments R package [74], and the expression of the features quantified with the countOverlapsUnion() function. The estimateSizeFactorForMatrix function from the DESeq2 R package [78] was used to normalize the counts for each group of experiments. The ggpubr R package was used to visualize the results and perform the statistical tests (i.e., two-sided Wilcoxon rank sum test).