Background

Transposable elements (TEs) are selfish genetic elements that replicate independently from the replication of the host genome [1, 2]. They are widespread and ubiquitous across all branches of the eukaryotic tree of life and, although showing a remarkable sequence diversity across organisms, the conservation of common catalytic domains responsible for their replication suggests that their emergence could be traced back to the eukaryotic most recent common ancestor or even predate it [3].

TE classification is not straightforward, although many efforts have been undertaken to try to reconcile their diversity in a systematic framework. Two main classes are generally recognized: class I, which includes all TEs replicating via RNA intermediates, and class II, which embodies TEs moving via DNA intermediates [4]. This latest distinction still represents the only unambiguous classification of TEs. Conversely, the within-class diversity is much more complicated to analyze, since it can be performed both with mechanistic and homology-based criteria [5]. For example, considering the way TEs replicate and reintegrate, all class I elements use a “copy-and-paste” mechanism, while class II exhibits several models: the classical “cut-and-paste,” or the “peel-and-paste” (also known as rolling-circle replication) or even the “self-synthesizing” model (reviewed in [5]). The current classification scheme, which is also implemented in the main TE database, Repbase [6], is based on homology and structural similarities [7]. Class I elements mainly include long terminal repeat (LTR) elements and long interspersed nuclear elements (LINEs, also indicated as non-LTR elements) which encode for a reverse transcriptase (RT), an endonuclease (EN), and other domains used to reintegrate in the host genome. Class II elements, on the other hand, include terminal inverted repeat (TIR) elements, Helitrons, and Mavericks (also known as Polintons). In addition, both classes include non-autonomous elements (short interspersed nuclear elements, SINEs, and miniature inverted-repeats transposable elements, MITEs), TEs usually with a smaller size, which do not code for the enzymes necessary for replication/reintegration but parasitize those encoded by their autonomous counterparts [7]. Besides this commonly accepted scheme, further classification efforts are less clear. Generally speaking, when taking into consideration coding TEs, the clustering pattern after a phylogenetic analysis of their ORF(s) is taken as an indication of clades that should be considered possible families, groups of elements, or clades [5].

Although a common approach, the phylogenetic framework has limitations in this context both because of the sometimes unclear homology of TE ORFs and the genomic turnover of paralogous TE lineages blurring the phylogenetic signal [8].

The same replicative dynamics of TEs may impact their phylogenetic clustering: in fact, based on studies on mutation distribution on non-autonomous class I Alu sequences in the human genome, two distinct models have been formulated to explain how TEs replicate [9]. The first model, named “master gene model,” implies that one or few copies give origin to all other copies in the genome producing new, so-called families each time a master copy mutates. This way, new families are generated in different timeframes. On the contrary, in the other model, termed “transposons model,” each new copy can produce other copies with the outcome of getting several families produced nearly at the same time.

The rate at which TEs replicate can be a function of several different factors, including the ability of the host genome to limit their uncontrolled proliferation. In particular, the successful invasion of a genome by TEs can be dependent on a complex interplay among TE features, host genome biology, repression mechanisms interfering with TE functionality, and the extent of selective pressures on the outcome of TE insertions [10]. Despite this, some TE lineages managed to reach very high copy numbers in the host genomes, apparently esca** such controlling mechanisms. A suitable model to explain these dynamics has been formulated on the well-studied human SINE family Alu and on their autonomous counterparts L1 LINEs. These elements show several subfamilies that evolved following a master gene model in different hominid lineages during the last few million years. However, their origin seems to predate their species-specific expansions by far, with little or no transposition for tens of million years. Han et al. [1), a pattern consistent with an ongoing transposition of these elements [35].

Table 1 Spearman’s rho correlation coefficients between family-based LINE transcript levels and number of insertions

We also added SINE families in the same RepeatMasker run, and we obtained their reliable genome occurrence in the 13 species selected for in-depth SINE mining (see the “Richness, diversity, and distribution of RT-containing LINEs” and “Construction of a manually curated library for LINE, SINEs, and DDE/D-related transposons” sections). SINE genome occurrence can greatly vary between and within species belonging to different bivalve orders (Table 2). The genomes of A. marissinica (6.02%), T. granosa (3.69%), S. broughtonii (4.37%), and B. platifrons (4.68%) host a relatively high number of SINEs while on the contrary, we observed a great reduction in the genome of C. gigas (0.08%) and S. glomerata (0.31%). Different SINE types successfully colonize different bivalve genomes: the Deu family was found to be dominant in A. marissinca (72% of the overall SINE complement), C. sinensis (94%), and S. broughtonii (55%), while the V family is dominant in the B. platifrons genome (67%) and the Meta in S. constricta (54%) and S. grandis (50%). Finally, in T. granosa, both Deu and V families occupy a considerable proportion of the overall SINE complement of respectively 30% and 46%. Finally, we did not find any evidence of a significant correlation between SINE and LINE genomic occurrence (Spearman’s rho = 0.31, p = 0.33).

Table 2 Percentage of genome occurrence of different SINE types in the 13 selected bivalves

Discussion

A comprehensive TE annotation for bivalves

The phylum Mollusca shows a high level of organism diversity and includes species that are important for both their ecological and economic value. Although genomic studies are accumulating and comparative analyses are becoming more common for these organisms, a deep analysis of the mobilome is still limited to single genomes or to a few comparative studies with only a handful of species [24, 25]. As could be expected, this also resulted in a scarce representativity of molluscan TEs in the public databases which makes their automated annotation less reliable. As previously shown, high-quality, manually curated repeat libraries are considered necessary for a consistent, reliable repeat annotation and characterization in novel genomes [26, 27]. In the present analysis, we decided to focus our efforts on bivalves, which represent 27 out of the 39 analyzed genomes, due to the recent, increasing genome sequencing efforts for this class. The inclusion of five gastropod genomes, representative of their major lineages, together with two cephalopods, one polyplacophoran genome, and three annelids allowed us to identify the major shifts in TE composition that occurred during molluscan evolutionary history. To overcome the limitations of automatically generated TE sequence libraries, we set up a pipeline which included both automated, ORF-based extraction and classification and manual curation approaches and that has been used consistently across the analyzed genomes. In particular, the manual curation process allowed us to provide the first freely available and manually curated repeat library for bivalves, comprising DDE/D, LINEs, and a subset of SINE elements for a total of 1609 elements comprising all identified LINEs, with the exception of the low copy number R2 superfamily and 12 different DDE/D-related superfamilies. These new genomic resources could help future genome annotation projects and shed novel insight into TE evolutionary dynamics in bivalves. On the other hand, the ORF-based approach allows us to confidentially characterize both LINEs and DDE/D-related TE complements. As a comparison, concerning LINEs in the RepBase library v. 20181026, 1031 sequences are deposited for molluscs, with 796 of them belonging to well-characterized C. gigas. Fifty-nine of these are annotated as LINEs and, more specifically: one R2, two CR1, 12 CR1-Zenon, 14 L1-Tx1, and 27 RTE-X. In the present analysis, we also found multiple Proto2, RTE-BovB, and L2/L2-2 elements. Regarding DDE/D transposons, out of 422 total sequences coming from RepBase for C. gigas, 92 possess an ORF longer than 300 amino acids, and they belong to 13 different superfamilies. With our approach, we manage to identify ORF-derived signatures coming from all of them, with the expectation of Zator, Merlin, and Sola1 for which only two sequences are deposited for each superfamily in RepBase. Overall, these results suggest that our ORF-based approach successfully captures in a flexible way most of the diversity of coding TEs in non-model species.

We also paid particular attention to filter out possible misannotations from the automatically generated TE sequence libraries, such as the inclusion of repetitive genes, tandem repeats, degenerate, and low-copy number families, which are hard to correctly annotate and classify. This approach is probably quite conservative, indeed in some instances, it provided different estimates of the overall TE content compared to published genome papers. For example, in Mytilus edulis, our study estimated 47% of TE content vs the 56% provided in [36]; the same holds for S. glomerata (42% in the present study vs 45% in [37]) and for A. granulata (18% here vs 23% in [38]). In other instances, though, our analysis provided almost the same estimates as in the previous analyses, as in M. coruscus (49% here vs 47% in [39]), A. immaculata (41% here vs 40% in [40]), and M. mercenaria (51% here vs 49% in [41]).

TEs have been shown to be one of the major contributors to genome size evolution in metazoan lineages, such as insects [42] and vertebrates [43], and in angiosperms as well [44]. Our analyses provided further support for this hypothesis finding a positive correlation between TE content and assembly size also in molluscs. Across bivalves, the TE content varies greatly, ranging from ~ 20% in the Pectinida M. yessoensis up to ~ 60% in the Mytilida M. philippinarum. Different sequencing technologies and sequencing depths could potentially contribute to such differences; however, it must be noted that also for Illumina sequenced genome, we observed a high TE content, such as for the M. philippinarum and B. platifrons. It is interesting that the low TE load found across all analyzed Pectinida species. In fact, this order includes the most TE-poor bivalve species, with almost twofold less TE content compared to Mytilida and Ostreida. Similar occurrences of interspersed repeats were already observed for this lineage during whole genome sequencing projects [45,46,47,48], and transposable elements hosted by M. yessoensis were found to be generally less active in recent times compared to what was observed in the Pacific and pearl oysters [46]. This low TE activity was suggested to be the reason behind their conserved genome architecture that could resemble that of bilaterian ancestors [46]. However, as well described in birds, low TE content and apparent lack of activity could also originate from nonallelic homologous recombination which could physically remove TEs and other repetitive regions from the genome without implying a general genomic stability [49].

Concerning class I elements, LTR elements in general occupy a low proportion of host genomes as previously observed by [24], while we found LINE elements as the richest retroelements. They contribute from 1.61 to 10.84% respectively in C. virginica and M. coruscus genomes using automatically generated TE sequence libraries and between 6.18% for A. marissinica and 0.82% for P. fucata using manually curated libraries. A similar scenario occurs also for SINE elements, whose genome coverage can greatly vary between different bivalve species using both automatic and manually curated libraries. In both instances, we identify the genomes of A. marissinica, B. platifrons, and Arcida as richer in SINEs compared to other analyzed bivalves, but we did not find any evidence of a general increase of the SINE complement coupled with an increase of their autonomous counterparts LINEs.

Class II and RC elements generally outnumber other TEs, especially in bivalves where DNA elements were found significantly enriched compared to all retroposons. This is strikingly different from what is observed in mammals, where retroposons constitute the most successful TE group, but similar to what is observed in actinopterygian fishes where class II elements greatly dominate the overall TE content [43]. Moreover, we found that non-autonomous counterparts (MITEs) occupy a considerable proportion of host genomes suggesting the high proliferation of small, non-autonomous copies. Within the most rich superfamilies of DDE/D ORF-derived signatures in bivalve genomes, we identified TcMariner and hAT lineages. Interestingly, the same superfamilies were also found to be the richest of ORF signatures in all other analyzed molluscs and to be ubiquitous even when using the automatically generated TE sequence libraries. Both TcMar and hAT superfamilies were found anciently expanded across cephalopods in a recent study from [25], possibly suggesting their high representativeness as a plesiomorphic state of molluscs. On the other hand, we could identify notable examples of bivalve-specific expansion, such as for Academ and RC elements. The former seems to be poorly represented in non-bivalve genomes, with only few ORF identified in the ringworms C. teleta and H. robusta and few insertions annotated in non-bivalve molluscs when using automatically generated libraries. RC elements can occupy up to 12% of the analyzed Crassostrea species. As a comparison, RC have a more patchy distribution in arthropod genomes, generally contributing to a smaller extent of the genome size with only few lineage-restricted expansions (e.g., Drosophila and Musca domestica [42, 50]). Also in plants, where they were first discovered, they are usually less represented, covering a maximum of 6% of the maize genome [

Conclusions

In the present study, we performed the first comparative analysis of transposable element evolutionary dynamics across molluscs with a particular emphasis on bivalves, an ecologically and economically important group. Despite genomic resources still being limited to few representative species compared to other clades, such as insects, the relatively low taxon sampling allowed us to deeply characterize for the first time their LINE and class II DDE/D-related complement. Moreover, because a high-quality repeat library is essential for the analyses of new genomes, our reference set of classified LINEs and DDE/D elements can be used to improve genome annotations and/or to easily classify novel elements across other lophotrochozoans. We also want to emphasize the necessity to extend similar analyses to other classes of transposons, empowering the scientific community with novel and high-quality genomic resources. While TEs have been hypothesized to be involved in the evolution of multiple bivalve genomic oddities, such as high levels of gene presence-absence variation [79] and of hemizygosity [80], the ability to identify their possible role deeply and consistently in sha** bivalve genome evolution will be limited as long as the great majority of elements are unclassified, fragmented, or not freely accessible for the scientific community.

With our approach, we discovered a diverse set of LINEs and DDE/D that were likely already greatly diversified in the most recent common ancestor of bivalves. The restricted emergence of the bivalve-rich Proto2, RTE-X, CR1-Zenon, and Academ elements could have contributed to bivalve fast radiation providing novel raw genomic material for their diversification. Moreover, we found that this LINE diversity seems to be maintained across extant species by an equally diverse set of potentially contemporary active families that could follow a stealth driver model of evolution. Indeed, multiple families seem to be able to survive and co-exist for a long period of time in the host genome without triggering the evolution of sequence-specific repression mechanisms, resembling what was previously observed in multiple non-mammalian vertebrates such as lizards and fishes. Finally, despite their relatively low genome occurrence, several LINE superfamilies/clades/types emerged, and others contracted in a lineage-specific manner during the diversification of bivalves. Therefore, this highly diverse LINE complement, despite being less represented than class II elements, is a rather dynamic portion of bivalve genomes and can play important roles in local adaptations and lineage-specific evolutionary dynamics.

Methods

Genomic resources and phylogeny construction

Thirty-six molluscs and three annelid genomes were downloaded from publicly available resources (NCBI, GigaDB, Dryad, MolluscDB, dbSROG, and Phaidra, see Additional file 1: Table S1), giving preference to bivalve assemblies representative of their major clades. Concerning molluscs, we selected 27 genomes belonging to bivalves, five to gastropods, two to cephalopods, and one to the polyplacophoran A. granulata. The species tree was manually reconstructed following the phylogenetic relationships found in recent phylogenomic studies [81,82,83,84] as well as the reference phylogeny presented in MolluscDB [85].

Mining and annotation of interspersed repeats

For each analyzed genome, we compiled species-specific repeat libraries using a combination of structural and homology-based methods. RepeatModeler v. 2.0.1 [86] with the LTR pipeline extension which includes the structural-based LTRharvest [87] and LTR_retrivier packages [88], MITE Tracker [89], and HelitronScanner v. 1.1 [2: Fig. S1.

ORF-based annotation of RT containing LINEs and class II DDE/D elements

To have a more precise picture of the representation of different superfamilies and clades of both LINEs and DDE/D class II elements, we applied an ORF-based extraction and classification pipeline. Firstly, insertion sites resulting from RepeatCraft analyses were extracted with the bedtools suite [98] together with 1000 bp at both ends to correct for possible partial/fragmented annotations due to the likely incomplete status of automated generated consensus sequences [26]. ORFinder was then used to identify and extract non-overlap** open reading frames (-n) with a required methionine as the start codon and a minimum ORF length of at least 300 amino acids (i.e., 900 nucleotides; -ml 900). To further characterize both class II DDE/D-related transposons and LINE elements, we used an HMM-based approach. For the former, we started from the amino acid sequences corresponding to DDE/D domains found in the 17 superfamilies described in [31]. All sequences coming from each superfamily (namely hAT, Tc1/Mariner, PIF/Harbinger, CMC, Merlin, MULE, P, Kolobok, Novosib, Sola1, Sola2, Sola3, PiggyBac, Transib, Academ, Ginger, Zator) were downloaded and separately aligned with MAFFT v. 7.475 [99] (E-INS-i strategy), and from each alignment, we build up a superfamily-specific HMM profile using the hmmbuild function from the HMMER3 package [100]. The collection of all 17 profiles was then used as a target database for hmmscan homology searches (E-value < 1E − 5) against all extracted ORFs provisionally annotating each element based on the corresponding best hit. To avoid misclassification of Ginger elements due to their high homology to Gypsy-encoded integrases [101] and to confirm the classification of all ORFs, we additionally blasted all significant hits against the full RepeatPep library (Blastp; E-value 1E − 05), imitating a reciprocal best-hit approach. Sequences with a best hit against a different superfamily compared to our previous HMM-based classification were considered as miss-classified and discarded.

For LINE elements, we started with an RPSblast search on the same set of extracted and translated ORFs against the complete CDD database (E-value < 1E − 05). Sequences with a significant hit against RT-related profiles were considered as putative retrotransposons (see Additional file 28: Table S8 for a list of CDD entries). To distinguish between LTR- and LINE-derived RT-containing ORFs, all LINE and LTR elements from the Repeatpeps library were extracted and separately aligned with MAFFT v. 7.475 (l-INS-i strategy) together with the seed sequences of the RVT_1 Pfam HMM profile (PF00078) to manually identify boundaries of the RT domain. We extracted LINE and LTR RTs from the resulting alignments, and we built two class-specific HMM profiles with the hmmbuild function from the HMMER3 package. The two profiles were then used as a target database for hmmscan (E-value < 1E − 5) homology searches of our previously identified RT-containing ORFs. Sequences with the best hit against the LTR-specific RT profile were considered as putative LTR and therefore discarded from subsequent analyses. LINE elements were considered autonomous when both RT and EN domains (see Additional file 28: Table S8 for a list of CDD entries) were present on the same ORF (i.e., non-intervening stop codons). Sequences missing the EN domain were classified as RT-only LINEs.

To test the interplay between assembly quality and the ability to identify RT-containing and autonomous LINEs as well as DDE/D-related transposons, we checked for a correlation between a number of identified elements and contig/scaffold N50 with Spearman’s rank correlation tests.

All confirmed LINEs (regardless of being autonomous or RT-only) and DDE/D-containing transposons were clustered at the nucleotide level using CD-HIT and following the 80–80 rule (same parameter set used for repeat library construction). Therefore, hereafter, we will refer to clusters as groups of TEs related by high nucleotide homology along their coding sequence to distinguish them from the canonical transposon families which ideally should take into consideration the elements along their entire length [7].

For LINE elements only, we additionally called “low-copy number clusters” clusters with less than 5 members and as “singleton cluster” sequences that did not fall in any cluster. For class II elements, we avoid such classification because non-autonomous members of a family can replicate through the genome parasitizing their autonomous counterparts. Moreover, while the presence of a complete ORF can give some first insight on which superfamilies/clades could have been more active in recent/mid times, on the other hand, it must be noted that this approach is not able to identify non-autonomous elements thus greatly underestimating the number of short Class II transposons.

Tree-based classification of ORF-containing LINE elements

ORF-containing LINE elements were classified using a phylogenetic approach. We adopted the superfamily classification scheme proposed by [7] and the clade classification proposed by [29], as in [102], while we use the “type” term to refer to the RepeatMasker or Dfam classification schemes [103]. Starting from previously identified clusters (> 5 members), we extracted the amino acid sequence of the RT domain based on the coordinates of the RPSblast hits. RT segments were aligned with MAFFT v. 7.475 (g-INS-i strategy) and cleaned from columns with gaps in more than 50% of the sequences using TrimAl [104]. Cons from the EMBOSS package [105] was then used to build up a consensus sequence from the resulting alignment setting the parameter plurality to 3. RT consensus sequences were then aligned together with reference LINE sequences from [29] and a subset of LTR and LINE elements from the Repeatpeps library, using MAFFT and a g-INS-i strategy. Poorly aligned sequences were removed from the alignment using TrimaAl (-resoverlap 0.75 -seqoverlap 80). Because of the short RT domain, the deep divergence time of LINE superfamilies, and the consequently difficulties in identifying stable LINE phylogenies (e.g., [29, 30, 106]), we used a combination of neighbor-joining, unconstrained maximum likelihood (ML), and constrained ML tree inferences. Each topology was then statistically tested in a ML framework to produce a confident phylogeny useful for LINE classification. We performed (a) a neighbor-joining (NJ) clustering with Clearcut v. 1.0.9 [107], reshuffling the distance matrix and using a traditional neighbor-joining algorithm (–shuffle and –neighbor options, respectively); (b) 5 unconstrained maximum likelihood (ML) tree searches with IQtree v. 2.1.3 [108] and the corresponding best-fit evolutionary model identified by ModelFinder2 [109]; (c) 6 constrained ML tree searches forcing the full NJ topology (FullNJ constraint, one run); and (d) only the monophyly of LINEs superfamilies, as inferred by the NJ tree, with the exception of Jockey and I superfamilies which were constrained in a single, comprehensive monophyletic clade (SupFAM constraint, 5 runs). For the unconstrained and the SupFAM-constrained ML tree inferences (analyses b and d, respectively), nodal support was estimated with 1000 UltraFastBootstrap replicates [110]. All ML topologies were tested using the Kishino-Hasegawa test [111], Shimodaira-Hasegawa test [112], expected likelihood weights [113], and approximately unbiased (AU) test [114]. As an additional confirmation of our classification and to avoid the inclusion of Penelope-like elements, we (a) blasted each consensus RT (blastp; E-value < 1E − 5) against all protein sequences from the RepeatPeps library extracting the best-hit for each query sequence and (b) used the online implementation of RTClass1 [29] on a random subset of 111 RT sequences covering all identified clades. Low-copy numbers, singletons, and clusters removed by TrimAl were classified based on Blastp best-hit (E-value < 0.05) against tree-based classified clusters and the whole RepeatPeps library for competing purposes. For the low-copy clusters, one representative (i.e., the longest) sequence was used. For bivalve species, and excluding the poorly represented R2 superfamily, the correlation between the number of RT-containing LINEs and the number of clusters in each identified LINE clade was tested for each superfamily separately with Spearman’s rank correlation tests.

Additional prediction of SINEs in a subset of selected species

To have a first insight into the SINE composition of bivalves, we selected 13 species (namely, A. marissinca, C. sinensis, C. gigas, S. glomerata, T. granosa, S. broughtonii, M. coruscus, B. platifrons, S. constricta, S. grandis, P. maximus, M. yessoensis, M. nervosa) representative of Venerida, Ostreida, Arcida, Mytilida, Adepedonta, Pectinida, and Unionida, to mine additional SINE candidates using SINE_Scan v1.1.1 [115]. This software collects and validates SINE candidates based on copy number across the genome, presence of target site duplications (TSDs), and trRNA-related heads. All representative elements were merged with consensus sequences classified as SINEs by RepeatModeler in the corresponding species-specific repeat library (see the “Class-level mollusc mobilome characterization using automatically generated TE sequence libraries” section) and subjected to manual validation and curation as described in the following section. After this process, curated consensus sequences were annotated using the RepeatClassifier utility from the RepeatModeler package.

Manual curation of LINEs, SINEs, and DDE/D-related transposons

We selected a set of the previously found LINEs RT, SINEs, and DDE/D-containing clusters for manual refinement, following [27] guidelines. For LINEs, we selected all clusters with at least one autonomous element (i.e., encoding for an ORF with both RT and EN domains without interrupting stop codons) and five other sequences (both autonomous and/or RT-only) while for DDE/D elements, we required only the presence of at least five elements in the corresponding cluster. These criteria were chosen in order to prioritize the manual curation of sequences that likely possess one or more autonomous copies across the genome and thus could potentially be recently mobilized or mobilize their non-autonomous counterparts. Members of LINEs and DDE/D-related clusters were aligned at the nucleotide level using MAFFT (–auto strategy). CIAlign [116] was then used to remove insertions found in less than 50% of the sequences and to construct a nucleotide consensus sequence (–remove-insertions and –make-consensus option). At this set of LINEs and DDE/D preliminary consensus, we also added all the aforementioned SINEs, and all sequences were subjected to a “blast-extend-extract” process with a minimum required query coverage and identity of 70, extending each hit by 3 kb and extracting the top 25 hits for each query sequence and building up a preliminary consensus sequence using CIAlign. Resulting alignments were manually inspected to (i) identify structural features (e.g., microsatellites for LINEs and SINEs at the 3′ end 5′ truncations for LINEs, terminal inverted repeats, and superfamily-specific motifs for DDE/D elements), (ii) identify boundaries of the elements searching for TSDs whenever possible, (iii) identify domain signatures using the CDD web server, and (iv) correct and extend as long as possible the consensus sequence. Additionally, for SINE only, we also required (a) the presence of a detectable tRNA-related region at the 5′ ends and predicted with tRNAScan-SE (sequence source: mixed; score cutoff 0.01 [117]) and (b) the presence of a central domain and/or a tail region after the tRNA-related head. It must be noted that the presence of TSDs to confirm the boundaries of the element was only required for SINEs and class II superfamilies that exhibit them (thus excluding for example the SPY group from the PIF-Harbinger superfamily; see [120] and bowtie2 [121] to align reads on extracted insertions. Raw counts were then normalized by the length of the corresponding family consensus sequences, and TPM values were calculated. Log2-transformed normalized counts were tested for a correlation with the number of previously identified 3′ anchored insertions with a minimum length of 100 bp for the corresponding family for each species, tissue, and biological replicate separately.