Background

The fall webworm, Hyphantria cunea Drury (Erebidae: Hyphantria), is a polyphagous pest species in forest and agricultural ecosystems; where its larvae feed on most deciduous tree leaves [1]. When trees are infested, the fall webworm consumes nearly all leaves and causes great ecological and economic impact to the forest industry [2]. H. cunea is also an invasive pest, native to North America, but has spread globally in the past seven decades [3]. Behavioural, physiological and ecological adaptations present in this species are believed to contribute to its rapid spread.

First, the fall webworm has an extremely wide range of host plants and been reported to forage on more than 600 plant species, covering nearly all types of deciduous trees, especially mulberry, boxelder, walnut, sycamore, apple, plum, cherry, and elm [4].. Insect host selection is regulated by the chemosensory systems [5], especially for polyphagous herbivores [6,7,8]. Insect chemosensory systems consist of several gene families, including odorant receptor (OR), gustatory receptor (GR), ionotropic receptor (IR), chemosensory protein (CSP) and odorant binding protein (OBP) families. These genes encode proteins that participate in host plant detection and sexual communication [9,10,11,12]. Previous investigations have suggested that the large expansions in chemosensory gene families are a possible adaptation mechanism which enables polyphagy in the lepidopteran insect Spodoptera frugiperda [13] and other taxa such as Apis mellifera, Bombyx mori and Bemisia tabaci [9, 14,15,16,17,

Results

Overview of genome assembly and annotation

The genome survey with k-mer analysis (Figure S1) showed that there is a small peak in depth = 22 which represented the heterozygous sequences, while the average k-mer depth was 45, and the peak indepth = 90 indicates the repetitive sequences. As a results, the tentative genome size of H. cunea was 563.96 Mb with a low heterozygosity of 0.23% and repetitive elements of 36.20% of the whole genome (Figure S1 and Table S1).

The generated genome assembly of H. cunea comprises a 559.30 Mb sequence with a 3.09 Mb contig N50. It contains 198.97 Mb of repetitive elements that occupy 35.71% of the genome. After correction with RNA sequencing data from 12 samples of different tissues and stages of H. cunea, we obtained 15,319 genes using three gene prediction strategies (Figure S2), 94.42% of which could be annotated and enriched by the GO and KOG databases (Figures S3 and S4), and the distribution of Nr homologous genes with the H. cunea genome in insect species was showed in Figure S5. Moreover, 637 tRNAs, 71 rRNAs, 48 miRNAs and 300 pseudogenes were predicted from the Rfam and miRBase databases by the Infernal, tRNAscan-SE and GenBlastA software (Table S2). Further analyses showed that 94.54 and 92.96% eukaryotic conserved genes were found in the genome of H. cunea by CEGMA and BUSCO, respectively, suggesting that the genome sequence we obtained was largely complete (Tables S3-S6). The genome of H. cunea possesses a comparatively longer contig N50 among all genomes of Lepidoptera species sequenced so far, the top 4 are as follows: Operophtera brumata (6.38 Mb) [40], Spodoptera frugiperda (5.6 Mb) [48], Papilio bianor (5.5 Mb) [49], and H. cunea (3.09 Mb), further confirming the high quality of the genome sequence of H. cunea (Table 1). Homology analysis of the H. cunea genome led to the identification of 2142 pairs of one-to-one single-copy orthologs among twelve species. This ortholog dataset was used for further studies described below. Only 27 genes were specific to H. cunea, which is the smallest species-specific number among the eight lepidopteran species (Fig. 1a).

Table 1 Overview of sequenced lepidopteran genomes
Fig. 1
figure 1

Overview of the H. cunea genome. a Types and numbers of homologous gene families among twelve species. b Maximum likelihood phylogenetic analysis among twelve insect species based on genomic data. The twelve species are Apis mellifera, Bombyx mori, Drosophila melanogaster, Helicoverpa armigera, Hyphantria cunea, Operophtera brumata, Papilio machaon, Papilio polytes, Papilio xuthus, Pieris rapae, Tribolium castaneum and Plutella xylostella. The numbers next to the nodes are the estimated node ages in million year (scale is 50.0 million years), and the colored box below indicates the geochronologic scale from Permian to Neogene

Phylogeny of Lepidoptera

RAxML was used to construct a maximum likelihood phylogenetic tree using the 2142 single-copy orthologs among twelve insects whose genome sequences were available; eight Lepidoptera were included, while Hymenoptera (A. mellifera), Coleoptera (T. castaneum) and Diptera (D. melanogaster) were used as outgroups. The results showed that all nodes were supported by strong bootstrap values of 100%, and the topology of the higher taxa was consistent with those of previous phylogenetic studies [50, 51]. The results revealed that Lepidoptera was closer to Diptera, while Hymenoptera was located at the basal branch and formed a single clade (Fig. 1b). Within Lepidoptera, Papilionoidea (butterflies) formed a single clade, and P. xylostella (Yponomeutoidea) was separated from other moth taxa (Noctuoidea, Geometroidea and Bombycoidea). H. cunea was shown to be most closely related to H. armigera, which also belongs to the superfamily Noctuoidea. These results are in agreement with those obtained from the phylogenetic studies of Lepidoptera based on morphology and molecular data [52, 53]. The phylogenetic analysis indicated that Lepidoptera diverged from Diptera approximately 244.60 million years ago, which is consistent with the previously reported divergence time [50]. In Lepidoptera, the divergent time between the moths and butterflies in our study and was at Paleogene period, which is consist with Kawahara’s work, moreover, the genetic relationship between GEOMETROIDEA and BOMBYCOIDEA were close related, and they were grouped together with NOCTUOIDEA as well [54]. H. cunea and H. armigera were estimated to have diverged at the Eocene-Oligocene boundary with a divergence time of approximately 32.07 million years ago. The period from the late Eocene to early Oligocene has been considered as an important transition time and a link between the archaic world of the tropical Eocene and the more modern ecosystems of the Miocene [55].

Expansion of chemosensory and detoxification gene families

To further explore host adaptation, the H. cunea gene families related to chemosensory abilities (ORs, GRs, IRs, CSPs and OBPs) were studied. With the combination of de novo assembly, homology-based search and RNA sequencing annotation, 72 ORs, 46 GRs, 66 OBPs, 20 CSPs and 21 IRs were identified in the H. cunea genome (Table 2). This result increased the number of chemosensory genes in H. cunea from the previous identifications via antennal transcriptome studies, which reported 52 ORs, 9 GRs, 30 OBPs, 17 CSPs and 14 IRs [56]. For the gene families related to detoxification, 32 UGTs, 25 GSTs, 75 CCEs, 95 ABCs and 109 P450s were identified using the same strategy as above (Table 2). The numbers of chemosensory and detoxification genes in H. cunea were further compared with those of some lepidopteran insects (Table 2) [46].

Table 2 The number of chemosensory and detoxification genes of H. cunea and other insect

Gene family expansion/contraction analyses showed that the CSP, CCE, GST and UGT gene families were expanded in H. cunea compared to the tested Lepidopteran species, as the divergence sizes were all significantly lower than the species sizes for these genes (Table 3). CSPs contribute to transportation, sensitivity and possibly the selectivity of the insect olfactory system [10]. In our study, an expansion of CSPs was detected, suggesting that they might relate to host plant selection of H. cunea, but much more testing is required. Among the detoxification gene families, UGT, CCE and GST families were found to be expanded in H. cunea (Table 3). Some studies also found that in some polyphagous species in Noctuoidea GSTs and CCEs were greatly expanded, such as H. zea, H. armigera and S. litura [45, 46].

Table 3 Gene families expanded in H. cunea as calculated by CAFE

Other major expanded gene families were hemolymph protein [57], cecropin A [58], serine protease [59], apolipophorins [60], DNA helicase [61], insulin-like growth factor [62], and yolk proteins [63, 64] (Table 3). These gene families are supported to be involved in immunity, growth and development, biomacromolecule metabolism and reproduction in insects [57,58,59, 65,66,67,68].

DEG analysis in different stages and tissues

To further study the chemosensory and detoxification gene families that were found to be expanded, transcriptome studies on these genes were performed to explore their expression profiles in different developmental stages and tissues. The analysis of differential gene expression by pairwise comparison led to the identification of 8232 DEGs whithn the different stages RNA (eggs, second instar larvae, fourth instar larvae, pupae, and male and female adults), and 7733 DEGs within the different tissues RNA (head, thorax, leg, abdomen, antenna, and female sexual glands). Then these two DEG datasets were combined, and the duplicated sequences were removed to create a final dataset of 10,348 DEGs (Table S7). The relative expression levels of these DEGs in different tissues and stages as indicated by log10FPKM values were shown in as the Box plot in Figure S6, and the numbers of alternative splicing events was showed in Figure S7. The expression of DEG gene families (CSPs, GSTs, CCEs and UGTs) was transformed into an expression heatmap and presented in Fig. 2 to better compare their expression levels in different tissues and developmental stages. Nine of the 20 CSPs were grouped together and specifically expressed in the antennae, four CSPs were highly expressed in pupae relative to other stages, while two CSPs were specifically expressed in the sex gland (Fig. 2a). In the expanded detoxification gene families CCE, GST and UGT, some genes were highly expressed in the fourth larval instar (Fig. 2b, c and d), which is the peak period of H. cunea foraging behavior [1].

Fig. 2
figure 2

Expression heatmap of the expanded gene families in different tissues and stages. The colors represent the level of gene expression from low (purple) to high (yellow) as shown by log10 (FPKM + 0.000001). E (eggs), L2 (second-instar larvae), L4 (fourth-instar larvae), P (pupae), F (female adults), M (male adults); An (antenna), H (head), T (thorax), Ab (abdomen), L (leg), and Sg (female sexual gland). a Expression heatmap of the CSP gene family. The transcripts that were grouped together and specifically expressed in the antennae are EVM0003477, EVM0000271, EVM0002854, EVM0002317, EVM0009785, EVM0004106, EVM0004147, EVM0005331 and EVM0008758. The transcripts that were highly expressed in pupae are EVM0014103, EVM0014851, EVM0003329 and EVM0014642. The transcripts that were specifically expressed in the sex gland are EVM0010490 and EVM0014431. b Expression heatmap of the CCE gene family. c Expression heatmap of the GST gene family. d Expression heatmap of the UGT gene family. The heatmaps were constructed with HemI (windows_1_0_win32_x64) using the DEGs of each expanding family. The expression levels are presented by the color bar from low expression as green with negative numbers and high expression level as yellow with positive numbers

Positive selection on genes related to nutrient metabolism and detoxification

Next, a positive selection analysis based on the homolog genes was performed on the genome of H. cunea to gain a better understanding of the mechanisms in its host selection. The branch-site model showed that 39 genes were under significant positive selection pressure (LRT, p < 0.05), of which 13 were nutrient regulation genes reported to be involved in the metabolism of lipids, carbohydrates, vitamins and amino acids (Table 4). Many studies have shown that nutrient regulation in herbivorous insects is shaped by natural selection [69, 70]. Significant positive selection pressure was also detected in HcunP450 (EVM0009687), a member of the major detoxification-related gene family P450 (Table 4), consistent with a previous study reported that P450s could mediate insect resistance to many classes of insecticides [71]. HcunP450 was most similar to the cytochrome P450 CYP306A1 of the cotton bollworm H. armigera (AID54855.1), with 81.63% identity at the amino acid level. The expression of HarmP450 CYP306A1 was found to be induced by 2-tridecanone, and to mediate cotton bollworm development [72]. The CYP306A1 gene family was also shown to play an essential role in ecdysteroid biosynthesis during insect development [73], in fluoride resistance of B. mori [74]. Thus, the positive selection of HcunP450 CYP306A1 might reflect the rapid development of insecticide resistance in H. cunea. However, it is needed to determine whether it is caused by long-term host adaptation or by rapid evolution due to the extensive use of insecticides in recent years.

Table 4 Genes under significant positive selection (LRT, p < 0.05)

Compositional diversity of the gut microbiota

Our gut microbiota sequencing of H. cunea yielded 8.65 GB of valid data after filtering of H. cunea genome sequences and produced 28,846,959 clean reads and 151,448 contigs with a total length of 520.68 Mb after de novo assembly (Table S8). Based on the alignment of sequencing reads to the NCBI RefSeq database, the microorganism composition was annotated (Table S9) and analyzed, and the microbes were grouped into taxonomic categories from kingdom to species level. We found 324 kingdoms, 135 phyla, 13 classes, 244 orders, 157 families, 200 genera, and 78 species in the larval gut of H. cunea.

At the phylum level, the H. cunea gut microbiota was dominated by Proteobacteria (71.33% of the total midgut bacteria contigs), followed by Euryarchaeota and Firmicutes (8.40 and 6.10% of the contigs, respectively) and to a lesser extent, Tenericutes, Actinobacteria, Cyanobacteria and Bacteroidetes; other phyla were less than 1% of the total contigs (Fig. 3a). At the class level, Gammaproteobacteria, Betaproteobacteria, Halobacteria and Clostridia comprised 77% of the contigs (Fig. 3b), while Enterobacteriales, Halobacteriales and Burkholderiales comprised 60% of all contigs at the order level (Fig. 3c). The three most abundant families were Enterobacteriaceae, Halobacteriaceae and Burkholderiaceae (50.86, 6.16, and 4.58% of total contigs, respectively) (Fig. 3d). At the genus level, microorganisms were rich in Klebsiella, Halovivax and Burkholderia (37.92, 4.75 and 4.32% of total contigs, respectively) (Fig. 3e). Klebsiella oxytoca was the most abundant species in the midgut of H. cunea, followed by Halovivax ruber, Mannheimia haemolytica, and Burkholderia vietnamiensis (Fig. 3f).

Fig. 3
figure 3

Taxonomic-category-based microbiota composition in the H. cunea larval midgut. a Microbiota proportion at the phylum level. b Microbiota proportion at the order level. c Microbiota proportion at the class level. d Microbiota proportion at the family level. e Microbiota proportion at the genus level. f Microbiota proportion at the species level

Functional annotation of the leaf-eating caterpillar gut metagenome

Our metagenomic analysis led to the identification of 102,787 nonredundant protein-coding genes with an average length of 300 bp (30.80 Mb total length) in the microbiota of the H. cunea larval gut. Gene functional annotation based on KEGG pathways showed that the most abundant function in the metagenome was metabolic function, representing 45.16% of all KEGG functions in the H. cunea gut microbiota.

KEGG iPath 2 analysis showed that the metabolic activities of the H. cunea gut bacteria were associated with digestion, nutrition and detoxification, including metabolism of energy, carbohydrates, amino acids, lipids, cofactors, vitamins, glycans, xenobiotics, and terpenoids. The most enriched functions within these activities were “Folding, sorting and degradation”, representing 15.35% of all KEGG pathways, followed by “Signal transduction” (11.08%). The nutrient metabolism functions that could be provided by gut microbiota were “Carbohydrate metabolism” (8.83%), “Amino acid metabolism” (7.09%), “Energy metabolism” (6.59%), “Nucleotide metabolism” (4.55%), “Lipid metabolism” (3.90%), “Glycan biosynthesis and metabolism” (1.93%) and “Metabolism of cofactors and vitamins” (2.72%). In addition, genes in the gut microbiota were found with functions related to “Xenobiotics biodegradation and terpenoid metabolism” (2.09%) and “Biosynthesis of other secondary metabolites” (0.31%) (Figs. 4 and S9 and Table S10).

Fig. 4
figure 4

Metagenomic genes in the midgut of H. cunea larvae. a KEGG pathway functional annotations of metagenomic genes. b Functional enrichments of metagenomic genes. The X-axis indicates the proportion of genes annotated in each category

A total of 336 enzymes associated with cellulose and hemicellulose hydrolysis were identified in the intestinal flora of H. cunea based on the Carbohydrate-Active EnZyme (CAZy) database, including 42 auxiliary activities (AAs), 68 carbohydrate binding modules (CBMs), 75 carbohydrate esterases (CEs), 68 glycoside hydrolases (GHs), 82 glycosyltransferases (GTs) and one polysaccharide lyase (PL) (Figure S8 and Table S11). The results indicate that the gut microbes of H. cunea were most likely involved in cellulose degradation. By sequence alignment, we also predicted 55, 256 and 236 genes possibly encoding for glutathione S-transferases, esterases, and P450s, respectively (Table S12).

Silk-web-related genes

Notably, one gene family related to silk production, Kazal-type serine proteinase inhibitors (KSPIs) [75], showed an expansion among the tested orthologous gene groups, which implies that silk-related genes might also have a role in the environmental adaptation to larval development of H. cunea. Hence, we performed further studies on silk-web-related genes.

The silk gland is a long paired organ of the fall webworm. It specializes in the synthesis and secretion of silk proteins (Fig. 5a) and quickly atrophies after the onset of adulthood. The anatomy of the silk gland in the fall webworm is quite similar to that of B. mori, and consists of three functionally distinct regions: the anterior silk gland (ASG), middle silk gland (MSG) and posterior silk gland (PSG) [76]. Thirty-three silk-gland-related genes were identified in H. cunea (Table 5) through a homologous search against those from other Lepidopteran silk glands in previous studies [76,77,78], including 3 silk protein genes, 4 silk regulation genes and 26 protease inhibitor genes. In B. mori, the silk protein is composed of a ((Fib-H) - (Fib-L))6 -P25 fibroin complex and held together by the protein sericin [79]. Here, three fibroin structure genes, HcunFib-H, HcunFib-L and HcunP25, were identified from the H. cunea genome, but our results showed that no sericin genes were annotated in the H. cunea genome; however, some silk regulation genes, such as silk gland factors (SGFs), fibroin-modulator-binding protein-1 (FMBP-1) and fibroinase, were identified in the H. cunea genome. Moreover, several protease inhibitors, such as kazal-type serine protease inhibitors, pacifastin-related serine protease inhibitor (pacifastin), phosphatidylethanolamine-binding protein, alpha-2-macroglobulin (A2M), cysteine proteinase inhibitor, carboxypeptidase inhibitor, cystatin, serpins and proteasome inhibitor genes, were identified.

Fig. 5
figure 5

Silencing results of HcunFib-H and phenotype analysis. a Anatomy of the larval silk gland of H. cunea. ASG indicates anterior silk gland, MSG indicates middle silk gland and PSG indicates posterior silk gland. b RT-qPCR results 4 days after dsRNA injection for RNAi. Statistical differences were evaluated by t-tests. *** p < 0.001, n.s., not significant. Data indicate the means + SEM, N = 5. The expression levels were normalised by the expression of β-actin gene in different treatment samples. c & d Diameter of the silk ball after RNAi; statistical differences were evaluated by t-tests. *** p < 0.001, n.s., not significant, scale bar = 0.5 mm. e & f Comparison of silk spinning between wild-type and dsHcunFib-H-injected insects

Table 5 Identification of silk-web-related genes

Silencing of silk fibroin genes and phenotype analysis

Because silk is a structural material and plays a crucial role in the survival of many insects, the extraordinary mechanical properties of silk are often explained in adaptive terms [80]. Fibroin is the key component of silk; it determines both the quantity and the structure of a silk web [81]. Here, three structural protein genes, HcunFib-H, HcunFib-L and HcunP25, were chosen for RNAi experiments to explore the mechanism of web production in H. cunea because of their involvement in silk production. We targeted three fibroin genes for silencing and measured their expression levels by qRT-PCR 4 days after injection (Fig. 5b). In comparison with the noninjected groups, there were no significant changes in the expression of GFP, HcunFib-L and HcunP25 (p > 0.05), while the relative expression of HcunFib-H was dramatically decreased (p < 0.001). The different expression levels among the three genes might be one of reasons that resulted in the differences in RNAi as the expression of HcunFib-H was much higher than those of HcunFib-L and HcunP25.

Within 10 days after the injection, the average diameters of the silk balls in the different treatments were as follows: that of the noninjected wild type was 3.45 ± 0.12 mm; dsGFP-injected, 3.39 ± 0.14 mm (p = 0.64 > 0.05); dsHcunFib-L-injected, 3.41 ± 0.15 mm (p = 0.79 > 0.05); dsHcunP25-injected, 3.25 ± 0.14 mm (p = 0.18 > 0.05) and dsHcunFib-H-injected, 1.67 ± 0.09 mm (p < 0.0001) (Fig. 5c and d). There was a significant decrease in the quantity of silk in the dsHcunFib-H injected group, which was consistent with the dramatic decrease in the gene expression of HcunFib-H of the dsHcunFib-H-injected group after RNAi. The silencing of the silk structure protein gene Fib-H led to less silk production and damaged the leaf-silk shelter structure of the fall webworm by breaking the silk-leaf connections (Fig. 5e and f), suggesting that HcunFib-H contributes significantly to the formation of fibroin, to related web-producing behaviors and to the silk-web-related adaptations of H. cunea.

Discussion

In this study, the genome of the fall webworm we obtained was of high integrity by PacBio sequencing. And compared with other publicly available Lepidopteran genomes (Plutella. xylostella, Papilio polytes, Papilio machaon, Papilio xuthus, Pieris rapae, H. armigera, O. brumata, B. mori, S. frugiperda and P. bianor), the H. cunea genome possesses a comparatively longer contig N50 (only smaller than the genome of O. brumata, S. frugiperda and P. bianor). The large genome size of O. brumata could be explained to a large extent by its higher repeat content, containing 53.5% repetitive elements in O. brumata genome (35.7% in H. cunea and 38.4% in B. mori genomes) [40]. However, the large genome of H. cunea is more likely to be caused by a larger average intron size, the mechanism is worthy of further study, because the average intron size of the H. cunea genome was 1491 bp, much larger than 1082 bp of B. mori and 139 bp of O. brumata. A similar phenomenon was also reported in the Locusta migratoria genome [82].

According to the result of the phylogeny of Lepidoptera, H. cunea and H. armigera were estimated to have diverged at the Eocene-Oligocene boundary, while from the late Eocene to early Oligocene, with the end of a continuous cooling event [83], deciduous trees that were better able to cope with large temperature changes began to overtake evergreen tropical species [84]. In North America, where H. cunea is native, litchi and cashew nut were the dominant trees in the early Oligocene [85] . With the expansion of temperate deciduous forests during this epoch, the food sources of the fall webworm increased, which might have contributed to the expansion of the host range of H. cunea.

The CSP, CCE, GST and UGT gene families were expanded in H. cunea compared to the tested Lepidopteran species, similar expansions of the chemosensory gene family have also been detected in other insect genomes [13, 154] and Benchmarking Universal Single-Copy Orthologs (BUSCO v2.0) [155], were used to assess the completeness of the WTDBG assembly.

Repeats and noncoding RNAs

The specific repetitive sequence database was used to predict repeat sequences. A de novo repeat library of H. cunea was constructed by LTR_FINDER v1.05 [156], MITE-Hunter [157], RepeatScout v1.0.5 [158] and PILER-DF v2.4 [159]; then, it was classified by PASTE Classifier [160] and combined with the Repbase transposable element library to act as the final library. Afterward, RepeatMasker v4.0.6 [161] was used to find the homologous repeats in the final library. tRNAscan-SE v1.3.1 [162] was used to search for tRNA coding sequences. rRNA and microRNA were identified by Infernal v1.1 [163] based on the Rfam database and miRBase database. The pseudogene were predicted by two steps: Firstly, GenBlastA v1.0.4 [164] was applied to identify the candidate pseudogene by homologous searching against genome data. Secondly, GeneWise v2.4.1 [165] was performed to search for immature termination and frameshift mutation of pseudogene .

Gene prediction and functional annotation

To identify protein-coding sequences, a combination of ab initio gene prediction, homology-based prediction and unigene-based methods were used as annotation pipelines. Genscan [166], Augustus v2.4 [167], GlimmerHMM v3.0.4 [168], GeneID v1.4 [169] and SNAP [170] were used to predict the protein-coding sequences. Four species (Amyelois transitella, Bombyx mori, Helicoverpa armigera and Plutella xylostella) were used to complete homology-based gene prediction with GeMoMa v1.3.1 [171]. Reference transcriptome assembly was performed by HISAT v2.0.4 and StringTie [172], and gene prediction was performed by TransDecoder v2.0 [173] and GeneMarkS-T v5.1 [174]. The de novo transcriptome was completed by PASA v2.0.2 [175]. Finally, all the results from three gene prediction methods (GeMoMa, TransDecoder v2.0 and GeneMarkS-T) were integrated by EVidenceModeler (EVM) v1.1.1 [176] and annotated by PASA v2.0.2.

Gene functions were assigned according to the best-match BLASTp alignments in the NR databases, KOG, TrEMBL and Kyoto Encyclopedia of Genes and Genomes (KEGG). GO annotations were obtained by Blast2GO based on the results of alignment to NR. Moreover, we also performed enrichment analyses of the Clusters of Orthologous Groups of proteins (COG), GO terms and KEGG pathways.

Orthologous gene families

The most updated genome sequences of twelve sequenced insects (Apis mellifera, Bombyx mori, Drosophila melanogaster, Helicoverpa armigera, Papilio machaon, Papilio polytes, Papilio xuthus, Pieris rapae, Plutella xylostella, Tribolium castaneum, Hyphantria cunea, and Operophtera brumata) were used to infer gene orthology and construct the phylogenetic tree, the details of these genomic datasets we used in this study were showe in Table S13. After downloading the annotated coding sequences from NCBI, the longest protein sequences per gene were extracted to perform a best reciprocal hit (BRH) analysis by all-v-all BLAST using an E-value equal to 1E− 05 to identify orthologous genes among the twelve species by OrthoMCL 5 [177].

Phylogenetic tree and divergence times

The longest open reading frames (ORFs) for the longest transcript pairs across the twelve species were extracted by a Perl script, and tORFs in each orthologous set were aligned using PRANK [178] with the following parameters: −f = fasta -F -codon -noxml -notree -nopost. The alignment for each locus was trimmed by Gblocks v 0.91b [179] (Parameters: −t = c − b3 = 1 − b4 = 6 − b5 = n) to reduce the rate of false positive predictions by filtering out sequencing errors, incorrect alignments and no-orthologous regions based on codons [180]. After trimming, alignments of less than 120 bp were removed. The single-copy orthologous genes were concatenated into one supergene, and the best amino acid substitution model was estimated. RAxML v. 8.0.26 [181] was used to construct the phylogenetic tree based on the supergene under the LG + I + G + F model with 1000 bootstrap replicates. The divergence times among species were estimated by R8s v. 1.7.1 [182] with a node dating approach that used three fossil records as the most recent common ancestor. The three fossil records we used in this study were the oldest definitive beetle (Coleopsis archaica gen. et sp. Nov., 298.9 to 295.0 Ma), the oldest fossil Diptera (such as: Anisinodus crinitus n. gen., n. sp., 247.2 to 242.0 Ma) and the oldest fossil Rhopalocera (Praepapilio Colorado n. g., n. sp., P. gracilis n. sp., and Praepapilioninae. Riodinella nympha n. g., n. sp., 46.2 to 40.4 Ma), respectively [183,184,185].

Gene family expansion/contraction

CAFE [186] was used to examine the expansion and contraction of gene families among the twelve species. The results of the orthologous gene identification were filtered by CAFE’s built-in script, and the global parameter λ was estimated by the maximum likelihood method. Comparing divergence size and species size calculated by CAFE could determine whether expansion had occurred. The divergence size indicates the ancestral gene family size for each node in the phylogenetic tree, and the species size indicates the gene number in the homologous gene family. When the divergence size is smaller than the species size, the gene family is expanding. Additionally, for each gene family, a conditional P-value was calculated, and gene families with P-values < 0.05 were considered to have significantly expanded or contracted.

Positive selection analyses

A branch-site model (parameters: Null hypothesis: model = 2, NSsites = 2, fix_omega = 1, omega = 1; alternative hypothesis: model = 2, NSsites = 2, fix_omega = 0, omega = 1) in PAML [187] was used to identify the genes with positively selected sites in the fall webworm genome using our tree topology as the guide tree. Then, likelihood ratio tests (LRTs) were performed to detect positive selection on the foreground branch. Only those genes with LRT P-values less than 0.05 were inferred as positively selected.

Transcriptome analysis of different stages and tissues

RNA sequencing was performed on different developmental stages and tissues of H. cunea. The following developmental stages were selected for the transcriptome analyses: eggs, second instar larvae, fourth instar larvae, pupae, and male and female adults. The following tissues were used for the tissue transcriptome experiment: head, thorax, leg, abdomen, antenna, and female sexual glands. For each group, fifteen individuals were mixed for RNA extraction, and three biological replicates were produced for each sample. Total RNA was isolated from the homogenized samples using TRIzol reagent (Invitrogen, Carlsbad, CA, USA) according to the manufacturer’s protocols. After extraction, total RNA was assessed with the NanoDrop 2000 (Thermo Fisher Scientific, Waltham, MA, USA) and the Agilent Bioanalyzer 2100 System (Agilent Technologies, CA, USA) to verify the integrity and quality of RNA.

After each sample was quantified, the libraries were built and sequenced on the Illumina HiSeq X Ten platform. After filtering, clean reads were mapped to the reference genome sequence obtained in this study with Hisat2 tools [188]. Only reads with a perfect match or one mismatch were retained for further analysis. Cufflinks counts the expression of each gene and reports it in fragments per kilobase of transcript per million fragments mapped (FPKM) [189]. For each sequenced library, the read counts were adjusted by the edgeR package [190] with one scaling normalized factor. Differentially expressed gene (DEG) analysis within two sample groups (stages and tissues) was performed using the EBSeq R package [191], and then the false discovery rate (FDR) was performed based on the Benjamini-Hochberg (BH) procedure [192] to correct the P value of the identified datasets, with the standard of FDR ≤ 0.01 and fold change (FC) ≥ 2 to remove the false positive datasets.

Metagenomic sequencing and analysis

To test whether symbiotic microbes facilitate environmental adaptation in H. cunea, detailed profiles of the gut microflora were obtained by metagenomic sequencing. Midgut samples were collected from ten last-instar larvae of H. cunea from the same wild population on their host plant (Quercus mongolica) and preserved in RNAlater. DNA was extracted from a mixture of ten gut samples using an effective gut microbiota DNA extraction kit (QIAamp DNA Stool Mini Kit; Qiagen) and stored at − 20 °C. A paired-end gut microbiota DNA library was built using the NEBNext DNA Library Prep Mast Mix Set for Illumina (New England Biolabs, Ipswich, MA, USA). Sequencing was then performed on the Illumina HiSeq platform. The raw reads were checked and filtered by the following methods: 1) reads with adapters were removed; 2) reads with low-quality and N bases (quality value ≤10) were removed; 3) to gain a clearer understanding of the bacterial genome data, the host genome data were filtered out by eliminating fall webworm genome sequences.

Kraken [193] was used for the taxonomic identification and relative abundance calculations, and the NCBI Reference Sequence Database (RefSeq), which includes high-quality bacterial, archaea and virus data, could further filter the nonbacterial genome sequences. The microbiota composition was visualized by Krona [194] and Python scripts.

De novo assembly was performed by IDBA-UD [195] (parameter: --mink:21, −-maxk:101, −-step:20, −-pre_correction), resulting in sequences greater than 500 bp. The assembly quality was assessed by QUAST [196]. MetaGeneMark [197] was used to perform ab initio gene prediction with the default settings. Prophage prediction was performed by BLAST (E-value: 1E-05) with a local database based on the ACLAME database. Transposable elements, including DNA transposons, long terminal repeats (LTRs), long interspersed elements (LINEs) and short interspersed elements (SINEs), were identified by using RepeatMasker v 4.0.5 and RepeatProteinMasker [161]. A nonredundant data set was outputted by CD-HIT [198] with a minimum coverage cut-off of 0.9 for the shorter sequences. All genes in our nonredundant dataset were translated into amino acid sequences and aligned to relevant databases: NR, COG, KEGG, Swiss-Prot, CAZy and ARDB by BLASTP (E-value ≤1E-05). Blast2GO was used to obtain GO annotations, and HMMER v 3.0 [199] was used to annotate sequences in our dataset from the Pfam database.

RNA interference with silk fibroin genes

To study the web-producing mechanism in the fall webworm, silk-gland-related genes were identified by analyzing the H. cunea genome and the silk gland transcriptome produced in this study. Three genes encoding structural proteins, fibroin heavy (Fib-H), fibroin light (Fib-L) and protein 25 (P25), were silenced by RNA interference (RNAi) to examine their biological functions. RNAi was performed by injecting the corresponding gene-specific double-stranded RNAs (dsRNAs), and green fluorescent protein (GFP) was used as a negative control. The dsRNAs for HcunFib-H, HcunFib-L, HcunP25 and GFP were synthesized by using the MEGAscript RNAi Kit (Ambion, Austin, TX, USA) following the manufacturer’s procedure and purified by lithium chloride precipitation. After quantification with a NanoDrop 2000 (Thermo Fisher Scientific, Wilmington DE, USA) and 1% agarose gel electrophoresis, the dsRNA of the four genes was stored at − 80 °C before use. Then, newly molted third-instar larvae were injected with 4 μg of targeted dsRNA in 1 μL into the abdomen using a Nanoliter 2000 injector (World Precision Instruments, Sarasota, FL, USA). In total, 20 individuals were injected, divided into four plastic boxes (20 cm × 10 cm × 5 cm) and fed fresh mulberry leaves (6 g per box per day); of these, 15 individuals were used to observe the phenotype (N = 3), while 5 individuals were used for RT-qPCR validation (N = 5). The effect of RNAi was examined by RT-qPCR 4 days after injection; each cDNA sample was quantified based on the total RNA (2 μg) from the 5 insects separately before reverse transcription (SuperScript™ III First-Strand Synthesis SuperMix), and β-actin was employed as an internal control. RT-qPCR was performed on a StepOnePlus Real-Time PCR Detection System (Bio-Rad, Hercules, CA, USA) using TransStar Tip Top Green qPCR Supermix (TransGen Biotech, Bei**g, China). The silk web was collected from each box within 10 days after injection. Because the silk filaments were difficult to quantify, they were rolled into a tight ball, and the diameter of the silk ball was used to calculate the silk quantity. The RT-qPCR data were analyzed by the 2-ΔΔCT method. The primers used in this study are listed in Table S14, and the efficiency of each primer pair was tested before the RT-qPCR experiments.