Introduction

Retrotransposons are transposable elements that replicate via an RNA intermediate [1]. They often make up a substantial fraction of the host genome in which they reside, occupying more than 40% of the human genome [2] and more than 50% of the maize genome [3]. Retrotransposons play a role in genome evolution [4] and can ultimately impact gene expression. However, our understanding of phylogenetic diversity of retrotransposons and their role in genome evolution is largely based on model organisms such as Drosophila melanogaster, Caenorhabditis elegans, Danio rerio, Mus musculus, Bombyx mori, etc. Animals living in marine environments and the deep-sea have been particularly underrepresented in transposable elements studies. For this reason, we explored the genome of the deep-sea tubeworm Lamellibrachia luymesi (Siboglinidae, Annelida) [5] which employs chemoautotrophic endosymbionts to inhabit hydrocarbon seeps in the Gulf of Mexico.

Retrotransposons are usually classified into two categories: LTR retrotransposons and non-LTR retrotransposons. Long terminal repeat retrotransposons (LTR retrotransposons) are transposable elements that are characterized by having long terminal repeats (LTRs) flanking an internal coding region. LTR retrotransposons usually serve as a model for the study of retroviruses [6], because both are structurally similar and phylogenetically related [7]. The main distinguishing characteristic is the presence of an envelope (env) gene in retroviruses which is absent in LTR retrotransposons. LTR retrotransposons are classified into three super families (Copia, Gypsy and Bel-pao), which differ in the arrangement of the protein domains encoded within the pol gene [8]. The two most common LTR retrotransposon super-families – Copia and Gypsy, are found in almost all eukaryotic lineages sampled to date [9]. These superfamilies display different distribution, abundance and diversity based on the element type and the host taxon been considered [10].

LTR retrotransposons (Fig. 1) includes long terminal repeats flanking elements that range from a few hundred bases to more than 5kb and usually start with 5’TG-3’ and ends with 5’-CA3’, a target site duplication (TSD) of 4-6bp, a polypurine tract (PPT), a primer binding site (PBS) and also gag and pol genes between the two LTRs [11, 12]. The gag gene encodes a structural protein that is essential for assembly of viral-like particles while the pol gene encodes four proteins domains including a protease (PR) which cleaves the Pol polyprotein, a ribonuclease H (RH) which cleaves the RNA in the DNA-RNA hybrid, a reverse transcriptase (RT) that copies retrotransposons RNA into cDNA and an integrase (INT) which integrates the cDNA into the genome. Occasionally, an additional open reading frame (aORF) may be downstream or upstream of the gag-pol gene, in sense or antisense orientation [13, 14]. Those located in the sense orientation encode proteins with certain structural and functional similarities to the env domain of retroviruses, and hence are sometimes called env-like domains [15, 16]. The env domain encodes for protein that is responsible for binding the cellular receptor and facilitates the early steps in the virus-cell interaction, and drives the fusion of viral and host cellular membrane [17]. In contrast, function of the aORF located in the antisense orientation is not clearly known, however , studies carried out so far suggests that they may be playing a regulatory role in retrotransposition [16, 18, 19].

Fig. 1
figure 1

Structure of a LTR retrotransposon. Gag - group-specific antigen gene; TSD- target site duplication; PR - aspartic protease gene; RT - reverse transcriptase gene; RH - ribonuclease-H gene; INT- integrase gene; PBS - primer binding site; PPT - polypurine tract. LTR retrotransposon structure was generated using Adobe Illustrator.

In previous reports, retroelements have been identified in marine organisms including sea urchins [20], corals endosymbionts [21] and crustaceans [22]. However, to the best of our knowledge, there has been minimal effort to characterize the LTR retrotransposons present in deep-sea (>200m) animals or in annelids. Available studies [5, 23, 24] tend to only consider transposable elements in context of their role in genome composition rather than detailed assessment of the elements and their evolution. Of particular interest, Li et al. assessed Lamellibrachia luymesi van der Land & Norrevang 1975; a deep-sea annelid. L. luymesi is a vestimentiferan tubeworm that forms bush-like aggregations at hydrocarbon seeps in the Gulf of Mexico. These animals lack a digestive tract and hosts sulfide-oxidizing, horizontally-transmitted bacterial symbionts for nutrition and growth [5, 25,26,27]. Their result showed that 2.52% of the genome consisted of LTR retroelements. However, the goal of the analysis was to see how much of the genome’s DNA was derived from repetitive elements using RepeatModeler [28] and RepeatMasker [29]. Their approach included altered copies such as truncated elements or solo LTR’s to gain a comprehensive view of L. luymesi’s genome composition rather than an exploration of the LTR retroelements biology. In the current study, we further characterized and classified LTR retrotransposons present in the genome of Lamellibrachia luymesi to shed light on the representation of LTR retrotransposon superfamilies, as well as augment understanding of the potential function and structure of intact elements. In addition, we also estimated insertion times of these elements to understand if they are due to recent or ancient events.

We hypothesized the possible presence of unknown LTR-retrotransposon families in marine organisms or unsampled animal lineages. This work represents an important step towards the characterization of LTR retrotransposons in marine systems (70% of the biosphere) and in unexplored animal lineages (e.g., annelids).

Results

Identification and classification of LTR-retrotransposon

A total of 223 intact LTR retrotransposons (Supplementary Table 1, 2) were identified in the 688 Mb L. luymesi genome, by screening and adjustment of LTR candidates from LTRharvest and LTR_Finder using modules employed in LTR_retriever (Fig. 2). Of the 223 intact LTR-retrotransposon identified by LTR_retriever, 51 were classified as unknown, 1 was classified as Copia while 171 were classified as Gypsy.

Fig. 2
figure 2

Bioinformatics pipeline for annotation of LTR retrotransposon in L. luymesi.

To further classify these elements, TEsorter was used to search their internal regions against Gypsy database (GYDB). Those matching at least one domain profile in GYDB were classified. All the 171 Gypsy and 1 Copia elements classified by LTR-retriever were also classified as Gypsy and Copia respectively in TEsorter. In addition, out of the 51 classified by LTR_retriever as unknown, 7 were classified as Gypsy, 2 were classified as Bel-pao while 1 was classified as Copia in TEsorter. The rest were not classified at all. Hence, in total, TEsorter classified 182 of the 223 intact LTR retrotransposons identified by LTR-retriever (Supplementary Table 2).

Further analyses were carried out on the remaining 41 elements not classified by TEsorter. This was accomplished by manually searching the internal region of these unclassified elements against PFAM [30] and Conserved Domains Database (CDD) [31] to identify domains present within their internal region. Results showed that 24 of the elements lacked domains matching any known profiles in the databases, 10 had domains that were unrelated to LTR retrotransposons (e.g., a transmembrane receptor, coagulation-inhibition site etc.), while the remaining 8 had only RT domains (Supplementary Table 1). To further verify and classify these elements, we used REXdb-metazoan database option of TEsorter. We also performed a manual hmmscan search using GYDB hmm profiles. The REXdb- metazoan option classified these elements as LINEs (Long interspersed nuclear elements) while no match was found in the GYDB hmm profile scan. Due to the inability to accurately classify these 41 elements, they were excluded from further analysis.

Summary details of the 182 LTR retrotransposons used for downstream analysis, which includes 178 Gypsy, 2 Bel-pao and 2 Copia elements are shown in Table 1.

Table 1 Summary of LTR retrotransposons in L. luymesi

Structural characterization

Of the 182 identified LTR retrotransposons, 32 elements had all domains (Gag and Pol – RT, INT, RH, PR) present with the remainder having at least one domain present. For Gypsy elements, 30 out the 178 had a complete set of domains, both the Bel-pao elements had a complete set of domains and both Copia elements lacked a complete set of domains. Further analysis to describe the position of these elements in relation to coding elements showed that 26.4% of them overlapped with coding elements, 46.2% were located > 5 kb of coding elements, 10.4% were located within 5-10 kb and the remaining 17% were more than 10 kb away from coding elements.

The target site duplication flanking ends of identified LTR retrotransposons ranged from 3 to 5 bp in length, with majority of them being 5 bp in length. Palindromic motifs detected in the elements includes TGCA, TACA, TATA, TCGT, TGAA, TGAC, TGAT and TTAT, with 89% of the LTR-retrotransposons having TGCA motif. In addition, differences in length of identified LTR-retrotransposons were substantial, ranging from 1389 bp-8866 bp while the length of the LTRs ranged from 103 to 1468 bp (Supplementary Table 2).

Estimation of insertion time

Insertion times of LTR retrotransposon elements in L. luymesi genome suggests that most elements were inserted around 1.0 million years ago (MYA; Fig. 3). The oldest observed and complete inserted retrotransposon was a Gypsy element, inserted around 2MYA. Interestingly, 50 Gypsy elements showed a 100% LTR identity, suggesting that they very recently inserted into the genome. However, calculations of insertion times used a substitution rate of 1.3 × 10− 8 substitution per bp per year, the LTR_retriever default based on the rice genome. Although these insertion time estimates for L. luymesi should be viewed with caution, decreasing the rate by two- or three-fold still suggests insertion times within the last few million years.

Fig. 3
figure 3

Insertion time distribution of intact LTR-RT in L. luymesi genome. Chart was generated using GraphPad Prism.

Phylogenetic analysis of LTR-retrotransposons

Phylogenetic analysis corroborates assignments made by TEsorter. However, weak internodal support limited inferences about evolutionary relationships. Final family assignment was done by considering placements of elements with strong nodal support indicating monophyletic lineage representing gene families (Fig. 4 for RT domain, Fig. 5 for RH domain, and Fig. 6 for INT domain). Due to issues of non-concordant evolutionary histories, domains were not combined into a single phylogenetic analysis. Naming conventions based on phylogenetic analyses are described in the Methods section.

Fig. 4
figure 4

RT domain phylogenetic tree. RT phylogenetic tree was generated in IQtree with the LG + F + R6 model. Tree lines are color-coded according to the superfamily above it. Elements in red are elements identified in the genome of L. luymesi.

Fig. 5
figure 5

RnaseH domain phylogenetic tree. RnaseH phylogenetic tree was generated in IQtree with the LG + R7 model. Tree lines are color-coded according to the superfamily above it. Elements in red are elements identified in the genome of L. luymesi.

Fig. 6
figure 6

INT domain phylogenetic tree. INT phylogenetic tree was generated in IQtree with the LG + R7 model. Tree lines are color-coded according to the superfamily name above it. Elements in red are elements identified in the genome of L. luymesi.

For Gypsy elements, phylogenetic analysis of the RT, RH and INT sequences showed that some elements fall into recognized families such as CSRN1 [32], Gmr1 [33] and Mag [Classification of discovered LTR retrotransposons

Classification of LTR retrotransposons is dependent upon the presence and order of protein domains within the pol gene [11] (Fig 1). LTR_retriever based the classification of LTR retrotransposons on identification of conserved protein domains of each LTR retrotransposon candidate using profile Hidden Markov Models (pHMMs) of LTR retrotransposon domains from Pfam database [30]. Elements returning ambiguous pHMMs matches were classified as unknown.

To refine classification, we employed the program TEsorter v1.2.5 [59] which translated nucleotide sequence of LTR retrotransposon candidates in all six frames and searched these sequences against HMM profiles obtained from existing mobile elements protein databases – specifically , REXdb [14] and Gyspsy database of mobile genetic elements [60]. For each domain of a sequence, only the best hit with highest score is retained. Classification into superfamilies and families were based on hits of the pol and gag genes to curated database. Elements lacking at least one domain were not classified.

To do this step, fasta sequences of LTR retrotransposon candidates were first extracted using the call_by_seq_list.pl script from LTR_retriever package. Obtained sequences were then input into TEsorter (parameters = ‘-db gydb, -st nucl and -p 10’) for further classification.

Naming conventions

To facilitate communication, naming conventions for LTR retrotransposons families and elements identified in this study were created. Gypsy families were designated as LGF (Lamellibrachia Gypsy Family), followed by a unique number (e.g., LGF1, LGF2 etc.), Copia families were designated as LCF (Lamellibrachia Copia Family), followed by a unique number (e.g., LCF1) while Bel-pao families were designated as LBF (Lamellibrachia Bel-pao Family), followed by a unique number (e.g., LBF1). For individual elements, identified LTR retrotransposons were designated as LLXY#, where LL denotes 2 letters representing L. luymesi, XY denotes the first two letters of the superfamily it belongs to and # denotes the element number (e.g., LLGY1 represents a Gypsy element).

Phylogenetic analysis

Phylogenetic analysis was used to further validate family-level assignment of these elements and to access the evolutionary position of L. luymesi LTR retrotransposon candidates. For this purpose, amino acid sequences of INT, RT and RH domains were extracted from the LTR retrotransposon candidates following the guideline from TEsorter package. Gag and Protease (PR) sequences were excluded from analyses as they are known for their variability which prevents reliable alignments [61, 62].

To infer phylogenetic trees, amino acid sequence of INT,RH and RT from other known organisms were obtained from the GYDB database and recent studies [47, 53, 63], and aligned using MAFFT v7.407 [64] to amino acid sequence of INT, RT and RH from LTR retrotransposons found in L. luymesi genome. Each of the 3 domains was analyzed separately and a combined analysis was not done due to difference in taxon sampling and the fact that the domains may have distinct evolutionary histories. Maximum likelihood with bootstrap analysis was employed to construct phylogenetic trees using IQtree v1.6.12 [65] with the following parameters ‘-bb 100000, -nt AUTO, --runs 5’. The substitution model employed by IQtree for the INT domain tree was LG+R7, the RT domain tree was LG+F+R6 while the RH domain tree was LG+R7. Phylogenetic trees were mid-point rooted, visualized and edited using Figtree v1.4.2 [66].

Estimation of insertion time

Time since initial insertion of LTR retrotransposon candidates was estimated using scripts implemented in the LTR_retriever package. Insertion time were calculated as T = K/2 μ, where K is the divergence rate measured by the Jukes-Cantor model with K = − 3/4*ln (1-d*4/3) [67] and μ is the neutral mutation which is set at 1.3 × 10− 8 mutations per bp per year [68].