Background

Eukaryotic genomes encompass thousands of open reading frames (ORFs). The vast majority are so-called “noncanonical” ORFs (nORFs) excluded from genome annotations because of their short length, lack of evolutionary conservation, and perceived irrelevance to cellular physiology [1,2,3]. The development of RNA sequencing (RNA-seq) [4] and ribosome profiling [5, 6] has revealed genome-wide transcription and translation of nORFs across species ranging from yeast to humans [6,7,8,9,10,11,12,13,14]. Recent studies have characterized individual nORFs that form stable peptides and impact phenotypes, including cell growth [10, 13, 15], cell cycle regulation [16], muscle physiology [17,18,19], and immunity [20,21,22]. Unraveling the cellular, physiological, and evolutionary implications of nORFs has become an active area of research [14, 23].

Many nORFs have evolved de novo from previously noncoding regions [24,25,26]. Thus, the study of nORFs and de novo gene birth as evolutionary innovation carries a synergistic overlap where findings in one area could improve our understanding of the other. For instance, Sandmann et al. measured physical protein interactions for hundreds of peptides translated from nORFs and proposed that short linear motifs present in young de novo nORFs could mediate how nORFs impact essential cellular processes [26]. Other studies observed a gradual integration of evolutionary young ORFs into cellular networks and showed they could gain essential roles [27,28,29]. These studies support an evolutionary model whereby pervasive expression of nORFs generates the raw material for de novo gene birth [24, 25].

The biological interpretation of nORF expression is complex. Some studies suggest that the transcription or translation of nORFs could be attributed to expression noise [30,31,32], whereby non-specific binding of RNA polymerases and ribosomes to DNA and RNA might cause promiscuous transcription or translation, respectively. How do nORFs become expressed in the first place? There are multiple hypotheses on how de novo ORFs gain the ability to become transcriptionally regulated [33]. One possibility is the emergence of novel regulatory regions along with or following the emergence of an ORF (ORF-first), as was shown for specific de novo ORFs in Drosophila melanogaster [34], codfish [35], human [36, 37], and chimpanzee [36]. Alternatively, ORFs may emerge on actively transcribed loci such as near enhancers [38] or on long noncoding RNAs [39], as was shown for de novo ORFs in primates [40] and for de novo ORFs upstream or downstream of transcripts containing genes [37] (transcription-first) [41,42,43]. Transcription has a ripple effect causing coordinated activation of nearby genes [44, 45]. Thus, de novo ORFs that emerge near established genes or regulatory regions may acquire transcriptional regulation by “piggybacking” [45] on the pre-existing regulatory context [41, 46]. This piggybacking could predispose de novo ORFs to be involved in similar cellular processes as their neighbors, which in turn would help with characterization. To date, the fraction of nORFs that are transcriptionally regulated and contribute to cellular phenotypes is unknown for any species.

An obstacle to studying nORF expression at scale is their detection, as nORF expression levels are typically low and reliant on specific conditions [24, 36]. Recent studies demonstrated that integrating omics data [14, 47,48,49] could effectively address detection issues. For example, Wacholder et al. [14] recently discovered around 19,000 translated nORFs in Saccharomyces cerevisiae by massive integration of ribosome profiling data. This figure is three times larger than the number of canonical ORFs (cORFs) annotated in the yeast genome. These translated nORFs have the potential to generate peptides that affect cellular phenotypes but are almost entirely uncharacterized.

Coexpression is a well-established approach for studying transcriptional regulation through the massive integration of RNA-seq data. Coexpression refers to the similarity between transcriptional profiles of ORF pairs across numerous samples. Coexpression has been used successfully to identify new gene functions [50, 51], disease-related genes [22, 52, 53], and for studying the conservation of the regulatory machinery [51, 54] or gene modules [55] between species. Based on the assumption that genes involved in similar pathways have correlated expression patterns, coexpression can reveal relationships between genes and other transcribed genetic elements [56, 57]. Most coexpression studies have focused on cORFs, but the abundance of publicly available RNA-seq data represents a tractable avenue to interrogate the transcriptional regulation of thousands of nORFs at once using coexpression approaches [47, 58,59,60,Full size image

To investigate how these seemingly conflicting attributes impact the organization of the coexpression network, we analyzed two global network properties: diameter, which is the longest shortest path between any two ORFs; and transitivity, which is the tendency for ORFs that are coexpressed with a common neighbor to also be coexpressed with each other. The incorporation of nORFs in the full network led to a larger diameter relative to the canonical-only network (Fig. 2E). This is in sharp contrast with the null expectation, set by 1000 degree-preserving simulations, whereby random incorporation of nORFs decreases network diameter. The full coexpression network is thus much less compact than expected by chance, suggesting that nORFs tend to be located at the periphery of the network. Network transitivity decreased with the incorporation of nORFs compared to the canonical-only network, but to a lesser extent than expected by chance (Fig. 2F). This suggests that despite their low degree and peripheral locations, the connections formed by nORFs are structured and may form noncanonical clusters.

To investigate this hypothesis, we inspected the ratio of nORFs and cORFs among the cluster assignments from WGCNA hierarchical clustering of the full coexpression matrix (Additional file 2: Fig. S6). Strikingly, we observed a bimodal distribution of clusters, with approximately half of the clusters consisting mostly of nORFs and the other half containing mostly cORFs (Fig. 2G). We conclude that nORFs exhibit a unique and non-random organization within the coexpression network, simultaneously connecting to all cORFs while also forming entirely new noncanonical transcription modules.

Coexpression profiles reveal most nORFs are transcriptionally associated with genes involved in cellular transport and homeostasis

To determine whether nORFs are transcriptionally associated with specific cellular processes, we performed gene set enrichment analyses [77] (GSEA) on their coexpression partners. GSEA takes an ordered list of genes, in this case sorted by coexpression level, and seeks to find if the higher ranked genes are preferentially annotated with specific GO terms. For each cORF and nORF, we ran GSEA to detect if their highly coexpressed partners were preferentially associated with any GO terms (Additional file 2: Fig. S12). Almost all ORFs (99.9%), whether cORF or nORF, had at least one significant GO term associated with their coexpression partners at BH adjusted FDR < 0.01, suggesting that nORFs are engaged in coherent transcriptional programs. We then calculated, for each GO term, the number of cORFs and nORFs with GSEA enrichments in this term (Additional file 6: Table S5). These analyses identified specific GO terms that were significantly more (16 terms, BH adjusted FDR < 0.001, odds ratio > 2, Fisher’s exact test; Fig. 3A, Additional file 7: Table S6) or less (23 terms, BH adjusted FDR < 0.001, Odds ratio < 2, Fisher’s exact test; Fig. 3B, Additional file 7: Table S6) prevalent among the coexpression partners of nORFs relative to those of cORFs. Most of the GO terms that were significantly enriched among the coexpression partners of nORFs were related to cellular homeostasis and transport (Fig. 3A) while most of the GO terms significantly depleted among the coexpression partners of nORFs were related to DNA, RNA, and protein processing (Fig. 3B). Running the same GSEA pipeline with Kyoto Encyclopedia of Genes and Genomes (KEGG) [78] annotations yielded consistent results (Additional file 2: Fig. S13, Additional file 8: Table S7, Additional file 9: Table S8). Half of nORFs were coexpressed with genes involved in homeostasis (GO:0042592, 53%), monoatomic ion transport (GO:0006811, 49%), and transmembrane transport (GO:0055085, 47%). The nORFs transcriptionally associated with the parent term “transport” (n = 2718, GO:0006810, GSEA BH adjusted FDR < 0.01) were 1.6 times more likely to contain a predicted transmembrane domain than other nORFs (p = 1.3e−4, Fisher’s exact test; Fig. 3C), in line with potential transport-related activities. These findings reveal a strong and previously unsuspected transcriptional association between nORFs, and cellular processes related to homeostasis and transport.

Fig. 3
figure 3

Biological processes associated with nORF transcriptional regulation. A,B Biological processes that are more (A) (odds ratio > 2, n = 16 terms) or less (B) (odds ratio < 0.5, n = 23 terms) transcriptionally associated with nORFs than cORFs (y-axis ordered by nORF enrichment proportion from highest to lowest, BH adjusted FDR < 0.001 for all terms, Fisher’s exact test, GO term enrichments were detected using gene set enrichment analyses (GSEA), error bars: standard error of the proportion). C nORFs that are highly coexpressed with genes involved in transport are more likely to have predicted transmembrane (TM) domains as determined by TMHMM [79] compared to nORFs that are not (odds ratio = 1.6, Fisher’s exact test p = 1.3e−4; error bars: standard error of the proportion). D nORFs and cORFs that are Sfp1 or Hsf1 targets are more likely to be downregulated when Sfp1 or Hsf1 are deleted compared to ORFs that are not targets (Sfp1: cORFs: p < 2.2e−16; nORFs: p = 2.8e−9; Hsf1: cORFs: p <2.2e−16; nORFs: p = 9.9e−13; Fisher’s exact test, error bars: 95% confidence interval of the odds ratio; dashed line shows odds ratio of 1; RNA abundance data from SRA accession SRP159150 and SRP437124 [80] respectively). E nORFs that are regulated by TFs are more likely to be coexpressed with genes involved in processes related to known functions of that TF. Created with BioRender.com

Hsf1 and Sfp1 nORF targets are part of protein folding and ribosome biogenesis transcriptional programs, respectively

Overall, our analyses relating coexpression to TF binding (Fig. 1D) and functional enrichments (Fig. 3A,B) suggest that nORF expression is regulated rather than simply the consequence of transcriptional noise. To further investigate this hypothesis, we sought to identify regulatory relationships between specific TFs and nORFs. We reasoned that if nORFs are regulated by TFs in similar ways as cORFs, then genetic knockout of the TFs that regulate them should impact their expression levels as it does for cORFs [81]. We focused on two transcriptional activators for which both ChIP-exo [68] and knockout RNA-seq data [80] were publicly available: Sfp1, which regulates ribosome biogenesis [82], and Hsf1, which regulates heat shock and protein folding responses [83].

For both cORFs and nORFs, knockout of Sfp1 or Hsf1 was more likely to trigger a significant decrease in expression when the ORF’s promoter was bound by the respective TF according to ChIP-exo evidence (Fig. 3D). The statistical association between TF binding and knockout-induced downregulation was as strong for nORFs as it was for cORFs, consistent with nORFs having similar mechanisms of transcriptional activation (Sfp1: cORFs odds ratio = 11.1, p < 2.2e−16; nORFs odds ratio = 21.8, p = 2.8e−9, Fisher’s exact test; Hsf1: cORFs odds ratio = 12.7, p < 2.2e−16; nORFs odds ratio = 12.1, p = 9.9e−13, Fisher’s exact test). Therefore, the nORFs whose promoters are bound by these TFs, and whose expression levels decrease upon deletion of these TFs, are likely genuine regulatory targets of these TFs. By this stringent definition, our analyses identified 9 nORF targets of Sfp1 (and 34 cORF targets) and 19 nORF targets of Hsf1 (and 39 cORF targets). The coexpression profiles of these Sfp1 and Hsf1 nORF targets were preferentially associated with genes involved in processes directly related to the known functions of Sfp1 and Hsf1 (Additional file 10: Table S9). For example, the coexpression profiles of 9 Sfp1 nORF targets revealed preferential associations with genes involved in “ribosomal large subunit biogenesis” and 7 Sfp1 nORF targets involved in “regulation of translation” according to our GSEA pipeline (Fisher’s exact test, BH adjusted p-value < 6.7e−4 for both terms). Similarly, 13 Hsf1 nORF targets were preferentially associated with genes involved in “protein folding” (Fisher’s exact test, BH adjusted p-value = 5.7e−9). These results show that nORF expression can be actively regulated by TFs as part of coherent transcriptional programs (Fig. 3E).

de novo ORF expression and regulation are shaped by genomic location

Previous literature has shown that many nORFs arise de novo from previously noncoding regions [24, 26]. We wanted to investigate how these evolutionarily novel ORFs acquire expression and whether their locus of emergence influences this acquisition. To define which ORFs were of recent de novo evolutionary origins, we developed a multistep pipeline combining sequence similarity searches and syntenic alignments (Fig. 4A). cORFs were considered conserved if they had homologs detectable by sequence similarity searches with BLAST in budding yeasts outside of the Saccharomyces genus or if their open reading frames were maintained within the Saccharomyces genus [14]. cORFs and nORFs were considered de novo if they lacked homologs detectable by sequence similarity outside of the Saccharomyces genus and if less than 60% of syntenic orthologous nucleotides in the two most distant Saccharomyces branches were in the same reading frame as in S. cerevisiae. These criteria aimed to identify the youngest de novo ORFs. Overall, we identified 5624 conserved cORFs and 2756 de novo ORFs including 77 de novo cORFs and 2679 de novo nORFs (Fig. 4B). In general, the coexpression patterns of de novo ORFs (Additional file 2: Fig. S14) were similar to those of nORFs (Fig. 3A,B).

Fig. 4
figure 4

Expression, coexpression, and biological processes similarity of de novo ORFs with respect to genomic orientations. A Pipeline used to reclassify ORFs as conserved or de novo. cORFs were considered for both conserved and de novo classification while nORFs were only considered for de novo classification. Conserved ORFs were determined by either detection of homology outside of Saccharomyces or reading frame conservation within Saccharomyces (top). De novo ORFs were determined by evidence of translation, lack of homology outside of Saccharomyces, and lack of a homologous ORF in the two most distant Saccharomyces branches (bottom). Created with BioRender.com. B Counts of cORFs and nORFs that emerged de novo. C Genomic orientations of de novo ORFs that cannot transcriptionally piggyback off neighboring conserved ORF (cannot share promoter with neighbor, pink shading) or can transcriptionally piggyback off neighboring conserved ORF (possible to share promoter with neighbor, green shading). Created with BioRender.com. D Counts of de novo ORFs that are within 500 bp of a conserved ORF in different genomic orientations; ORFs further than 500bp are classified as independent. E De novo ORFs in orientations that can piggyback have higher RNA expression levels than de novo ORFs in orientations that cannot piggyback (Cliff’s Delta d = 0.4). Only de novo ORFs in a single orientation are considered (dashed box in panel D). Dashed line represents the median expression of independent de novo ORFs. F De novo ORFs in orientations that can piggyback have higher coexpression with neighboring conserved ORFs compared to de novo ORFs in orientations that cannot piggyback (Cliff’s Delta d = 0.43). Dashed line represents median coexpression of de novo-conserved ORF pairs on separate chromosomes. G De novo ORFs in orientations that can piggyback are more likely to be transcriptionally associated with genes involved in the same biological processes as their neighboring conserved ORFs than de novo ORFs in orientations that cannot piggyback (Cliff’s Delta d = 0.31). Dashed line represents median functional enrichment similarities of de novo-conserved ORF pairs on separate chromosomes. (For panels E, F, and G: Mann-Whitney U test, ****: p < 2.2e−16)

We hypothesized that the locus where de novo ORFs arise may influence their expression profiles through “piggybacking” off their neighboring conserved ORFs’ pre-existing regulatory environment. To investigate this hypothesis, we categorized de novo ORFs based on their positioning relative to neighboring conserved ORFs. The de novo ORFs further than 500 bp from all conserved ORFs were classified as independent. The remaining de novo ORFs were classified as either upstream or downstream on the same strand (up same or down same), upstream or downstream on the opposite strand (up opposite or down opposite), or as overlap** on the opposite strand (antisense overlap) based on their orientation to the nearest conserved ORF (Fig. 4C,D). We categorized the orientations as being able to piggyback or unable to piggyback based on their potential of sharing a promoter with neighboring conserved ORFs, with down opposite and antisense overlap as orientations that cannot piggyback and up opposite, up same, and down same as orientations that can piggyback (Fig. 4C). The piggybacking hypothesis predicts that de novo ORFs that arise in orientations that can piggyback would be positively influenced by the regulatory environment provided by the promoters of neighboring conserved ORFs, resulting in similar transcription profiles as their neighbors and increased expression relative to de novo ORFs that do not benefit from a pre-existing regulatory environment.

We considered three metrics to assess piggybacking: RNA expression level, measured as median TPM over all the samples analyzed, coexpression with neighboring conserved ORF, and biological process similarity with neighboring conserved ORF. To calculate biological process similarity between two ORFs, we used significant GO terms at FDR < 0.01 determined by coexpression GSEA for each ORF (Additional file 2: Fig. S12) and calculated the similarity between these two sets of GO terms using the relevance method [84]. If two ORFs are enriched in the same specialized terms, their relevance metric would be higher than if they are enriched in different terms or in the same generic terms. We found that de novo ORFs in orientations that can piggyback tend to have higher expression (focusing only on ORFs that could be assigned a single orientation, dashed box in Fig. 4D, Cliff’s Delta d = 0.4; Fig. 4E), higher coexpression with their neighbor (Cliff’s Delta d = 0.43; Fig. 4F), and higher biological process similarity (Cliff’s Delta d = 0.31; Fig. 4G), compared to ORFs in orientations that cannot piggyback (p < 2.2e−16 Mann-Whitney U test for all). Thus, all three metrics supported the piggybacking hypothesis.

Closer examination revealed a more complex situation. First, the immediate neighbors of de novo ORFs in orientations that can piggyback were rarely among their strongest coexpression partners (only found in the top 10 coexpressed partners for 15% of down same, 4.5% of up same, 3% of up opposite ORFs). Therefore, emergence nearby a conserved ORF in a piggybacking orientation influences, but does not fully determine, the transcription profiles of de novo ORFs. Transcriptional regulation beyond that provided by the pre-existing regulatory environment may exist. Second, while ORFs in all three orientations that can piggyback displayed increased coexpression and biological process similarity with their neighbors relative to background expectations (Additional file 2: Fig. S15A-B), only down same de novo ORFs displayed increased RNA expression levels (Additional file 2: Fig. S15C). The expression levels of up same de novo ORFs were statistically indistinguishable from independent de novo ORFs, while those of up opposite de novo ORFs were significantly lower than those of independent de novo ORFs (Additional file 2: Fig. S15C). Down same de novo ORFs also showed stronger coexpression and biological process similarity with their conserved neighbors than up same and up opposite de novo ORFs (Additional file 2: Fig. S15A-B). Therefore, the transcription of down same de novo ORFs appeared most susceptible to piggybacking.

To understand the molecular mechanisms leading to the differences in expression, coexpression and biological process similarity between the orientations that can piggyback, which all have the potential to share a promoter with their neighboring conserved ORF, we investigated which actually do by analyzing transcript architecture. Using a publicly available TIF-seq dataset [69], we defined down same or up same ORFs as sharing a promoter with their neighbor if they mapped to the same transcript at least once. We defined up opposite ORFs as sharing a promoter with their neighbor if their respective transcripts did not have overlap** TSSs, as would be expected for divergent promoters [85]. According to these criteria, 84% of down same (n = 174), 64% of up same (n = 368), and 66% of up opposite (n = 185) de novo ORFs share a promoter with their neighboring conserved ORFs (Additional file 2: Fig. S16). Among all de novo ORFs that arose in orientations that can piggyback, those that share promoters with neighboring conserved ORFs displayed higher expression levels than those that do not (down same: d = 0.75, p = 1.06e−8; up same: d = 0.38, p = 1.23e−7; up opposite: d = 0.3, p = 2.9e−3 Mann-Whitney U test, d: Cliff’s Delta; Fig. 5A). We also observed a significant increase in coexpression and biological process similarity between de novo ORFs and their neighboring conserved ORFs when their promoters are shared compared to when they are not (coexpression: down same: d = 0.28, p = 2.99e−9; up same: d = 0.31, p < 2.2e−16; up opposite: d = 0.27, p = 2.1e−7; biological process similarity: down same: d = 0.24, p = 5.5e−7; up same: d = 0.108, p = 3.78e−3; up opposite: d = 0.24, p = 6.1e−6, d: Cliff’s Delta, Mann-Whitney U test; Fig. 5B, C, respectively). Hence, sharing a promoter led to increases in the three piggybacking metrics for the three orientations.

Fig. 5
figure 5

Effects of promoter sharing on expression, coexpression, and biological process similarities of de novo ORFs. A De novo ORFs that share a promoter with neighboring conserved ORFs, as determined by TIF-seq transcript boundaries, have significantly higher expression levels than de novo ORFs that do not. Considering only ORFs in a single orientation. Dashed line represents the median expression of independent de novo ORFs. B De novo ORFs that share a promoter with neighboring conserved ORFs have higher coexpression with their neighbors than de novo ORFs that do not share a promoter. Dashed line represents median coexpression of de novo-conserved ORF pairs on separate chromosomes. C De novo ORFs that share a promoter have more similar functional enrichments with neighboring conserved ORFs than de novo ORFs that do not share a promoter. Dashed line represents median functional enrichment similarities of the background distribution of de novo-conserved ORF pairs on separate chromosomes. D Down same de novo ORFs share a promoter with neighboring conserved ORFs significantly more often than up same ORFs. E Conserved ORFs with downstream de novo ORFs have a significant increase in expression compared to conserved ORFs with upstream de novo ORFs. F Existence of transcription termination factors (Pcf11 or Nrd1) in between conserved ORFs and nearby downstream de novo ORFs leads to less shared transcripts. G Transcript isoforms (gray) at an example locus where there are no transcription termination factors present between conserved ORF YBL015W (pink) and downstream de novo ORF chr2:195794-195847(+) (blue). H Transcript isoforms (gray) at an example locus where there is Pcf11 transcription terminator present (red line) between conserved ORF YPR034W (pink) and downstream de novo ORF chr16:641385-641534(+) (blue). All detected transcript isoforms on these loci are plotted for G and F. (For all panels: ****: p ≤ 0.0001, ***: p ≤ 0.001, **: p ≤ 0.01, *: p ≤ 0.05, ns: not-significant; Mann-Whitney U test)

Further supporting the notion that down same ORFs are particularly prone to piggybacking, the down same de novo ORFs that share a promoter with their conserved neighbors displayed much higher expression levels, and higher coexpression and biological process similarity with their conserved neighbor, than up same or up opposite ORFs that also share a promoter with their conserved neighbors (expression: down same vs up same: d = 0.58; down same vs up opposite: d = 0.55; coexpression: down same vs up same: d = 0.29, down same vs up opposite: d = 0.38; biological process similarity: down same vs up same: d = 0.37, down same vs up opposite: d = 0.45; d: Cliff’s Delta, p < 2.2e−16 for all comparisons, Mann-Whitney U test). This could be due to down same ORF’s tendency to share promoters more often than up same ORFs, as a larger proportion of transcripts containing down same ORFs also contain a conserved ORF (down same vs up same: Cliff’s Delta d = 0.26, Mann-Whitney U test p < 2.2e−16; Fig. 5D), or higher expression levels of conserved ORFs that have down same ORFs on their transcripts compared to conserved ORFs with up same or up opposite piggybacking ORFs (down same vs up same: d = 0.2, p = 5.4e−3; down same vs up opposite: d = 0.34, p = 6.5e−4, Mann-Whitney U test, d: Cliff’s Delta; Fig. 5E).

Based on these results, we reasoned that transcriptional readthrough could be the molecular mechanism underlying the efficient transcriptional piggybacking of down same de novo ORFs. To investigate this hypothesis, we examined the impact of transcription terminators Pcf11 or Nrd1 on the frequency of transcript sharing between a conserved ORF and its downstream de novo ORF. Analyzing publicly available ChIP-exo data [68], we found that the presence of terminators between conserved ORFs and their downstream de novo ORFs resulted in a notably lower percentage of shared transcripts (Cliff’s Delta d = −0.39, p = 1.59e−10, Mann-Whitney U test; Fig. 5F). As an illustration, consider the genomic region on chromosome II from bases 194,000 to 196,000, containing the conserved ORF YBL015W and a downstream de novo ORF (positions 195,794 to 195,847). No terminator factor is bound to the intervening DNA between these two ORFs. This pair has high coexpression, with ρ = 0.96, and we observed that nearly all transcripts in this region containing the de novo ORF also include YBL015W (Fig. 5G). In contrast, the genomic region on chromosome XVI from 639,000 to 641,800, containing the conserved ORF YPR034W and downstream de novo ORF (positions 641,385 to 641,534), does have a Pcf11 terminator factor between the pair, and as expected, none of the transcripts in this region contain both YPR034W and the de novo ORF, which have poor coexpression as a result (ρ = 0.1; Fig. 5H). We conclude that sharing a transcript via transcriptional readthrough is the major transcriptional piggybacking mechanism for down same de novo ORFs.