Background

Chromosomal mutations, such as polyploidization and chromosomal rearrangement, can lead to speciation, adaptation, and diversification [1,2,3,4,5]. Extant species are ancient polyploids from a common ancestor that experienced at least one whole-genome duplication (WGD) [6]. Eudicots core to their clade descended from an ancient whole-genome triplication event (γ) [7]. Chromosomal evolution influences the development of chromosomal size, structure, composition, and number of chromosomes [8]. Karyotype evolution will cause the chromosomal structure to be unstable, such as fusion and fission regions caused by rearrangement, as well as centromere regions that increase or disappear due to WGD or chromosome fusion [9]. Transposable elements may fill and stabilize these unstable regions in the chromosomes [10]. Therefore, reconstructing the ancestor karyotype and analysing the distribution of transposable elements are crucial for untangling the species local adaptation and speciation.

Previous approaches for ancestral karyotype reconstruction and projection defined contiguous ancestral regions based on collinearity among genomes. This method results in gaps in the projections and reveals unrefined karyotype details [11,80] to identify the diversity in karyotype evolution and chromosomal rearrangement. Homologous dot-plots between Q. glauca and those of other oak species were plotted with the ACEK karyotype map** results. CD-HIT [81] was used to remove redundant protein sequences with “-c 0.8 -aS 0.8 -d 0” parameters for further constructing phylogenetic trees. Then, OrthoFinder v2.5.4 [82] was used to identify orthologs and construct a maximum likelihood (ML) phylogenetic tree with the “-S diamond -M msa” parameters. We used “-M msa” for multiple sequence alignments (MSA) and used default parameters in MAFFT v7.515 [83] and FastTree v2.1.11 [84] to infer maximum likelihood trees.

LTR-RTs identification and annotation

We used EDTA v1.9.6 [85] (Extensive de-novo TE Annotator), a comprehensive process tool that integrates the results of several current LTR prediction tools, such as LTR_FINDER [86], LTRharvest [87], and LTR_retriever [88], to build a highly reliable non-redundant TE database, and annotated repeated sequences with RepeatMasker [89]. We used EDTA.pl with the “-species others -step all -anno 1 -sensitive 1” parameters to obtain the TE database for each oak genome. The protein domains of the elements belonging to different lineages of Copia or Gypsy superfamilies were analyzed using REXdb [27], which was implemented using TEsorter v1.2.5.2 [59]. The recombination caused by the disappearance of internal components will lead to the removal of intact LTR-RTs and the formation of solo LTRs [61, 62]. We extracted solo LTRs from the annotation file generated by the RepeatMasker in EDTA.

To explore LTR-RTs amplification and the disparity in evolution among oak species, we used the formula T = (1 - identity) / 2µ to calculate the transposition time of LTR-RTs, where identity represents the sequence similarity between 5’ and 3’ LTRs obtained from the EDTA analysis, µ represents the base substitution rate. The substitution rate 1.01 × 10− 8 of Q. lobate [49] is the oak substitution rate in this study. To investigate the historical dynamics of different lineages of Copia and Gypsy, we extracted RT protein domain sequences of diverse lineages in these superfamilies by the concatenate_domains.py script in TEsorter [59]. After sequence alignments were carried out using MAFFT v7.515 [83], ML phylogenetic trees were constructed and visualized using FastTree v2.1.11 [84] and iTOL [90], respectively.

LTR-RTs associated with genes

We analyzed the number and function of genes that overlap with LTR-RTs. The LTR-RTs overlap** with gene and promoter regions were calculated using the “intersect” function from BEDtools v2.30.0 [91]. Protein sequences of the gene and promoter regions overlap** with LTR-RTs were extracted. GO enrichment analysis of extracted genes was carried out using the eggNOG-mapper [92] online tool and the R package ClusterProfiler [93]. The metabolic pathways were annotated with KAAS [94] and visualized with R package ggplot2 [95].

We used transcriptome data from the leaf, inflorescence, and stem of Q. glauca from the NCBI SRA database (BioProject: PRJNA868092) to evaluate the impact of LTR-RTs on the expression of adjacent genes. Hisat2 v2.2.1 [96], Samtools v1.13 [97], and StringTie v2.2.1 [98] were used to compare transcriptome data to the reference genome, sort and index sam files, and obtain the read count. Gene expression level was quantified in TPM (transcripts per million). Paralogous genes were detected using BLAST v2.12.0 [80]. Expression levels of paralogous genes with and without overlap** LTR-RT were compared. We further analyzed the impact of LTR-RTs insertion on the expression level of resistance genes (R-genes), as the evolution of R-genes is widely considered to be affected by LTR-RT insertion.

LTR-RTs distribution

LTR/Copia and LTR/Gypsy were usually mixed with tandem repeats and enriched in plant centromere regions. Combined with previous research [79, 99], we used Q. glauca as a reference to scan the regions with a higher frequency of tandem repeat, LTR/Copia, and LTR/Gypsy distribution and also a higher GC content but low-frequency gene density. The densities of genes, tandem repeats, LTR/Copia, and LTR/Gypsy were calculated using BEDtools v2.30.0 [91] with parameters “-w 1000000 -s 200000” to make interval “windows” and “-counts -F 0.5” to compute the coverage. The GC content of the Q. glauca genome was calculated by seqkit [100] tools with the same sliding window size. The R scripts completed data visualization.

To predict potential centromere regions, we first used the Telomeres_and_Centromeres [99] method to detect the tandem repeats (TRs) by TRF v4.09.1 [101] software with the “2 7 7 80 10 50 500 -f -d -m” parameters, and TRF2GFF (https://github.com/Adamtaranto/TRF2GFF) was used to merge the annotated results. Then we screened high-frequency repeat units in each chromosome, using IGV v2.16.1 [102] to visualize the density of genome annotation, LTR-RTs, and repeat units. Potential centromere regions showed low-frequency peaks of genome and TE and high-frequency peaks of repeat units in IGV. Second, Juicebox v1.11.08 [103] was used to observe the Hi-C heat map of the Q. glauca [40] genome. Third, Centromics (https://github.com/zhangrengang/Centromics) and the CentroMiner tools of quarTeT v1.1.1 [78] were used default parameters to predict the potential centromere regions.