Background & Summary

The largemouth bass, Micropterus salmoides (Perciformes, Centrarchidae), is a native of North America introduced in other parts of the world, including the Iberian Peninsula, Italy, Mexico and China, either as a game or farmed fish1,2,3. It is now one of the top ten most common aquatic species in every continent, except Antarctica4,5, and has been listed among the top 100 invasive species6, with temperature and hydrologic changes as main predictors of its distribution1,7. Although its main habitat is freshwater lakes and rivers, it colonizes brackish waters, such as in the Gulf of Mexico and the Atlantic coasts of North America8. Largemouth bass has been introduced into China from the US in 19832, and it has become one of the main aquaculture species in China for its fast growth2,9.

The whole genome information is the basis for studying the nature of organisms, including advantages during biological invasions and adaptation to extreme environments such as hypoxia10,11,12, climate change13,14, temperature15,16 and salinity17,32,33. Briefly, muscle tissue cells were fixed with formaldehyde to maintain the 3D structure of DNA in cells and the cells were digested using restriction endonuclease Hind III. Then, biotin-labeled bases were introduced using the DNA terminal repair mechanism. DNA (4 µg) was fragmented by a Covaris S220 focused-ultrasonicator (Gene Company Limited, Hong Kong) and 300–700 bp fragments were recovered. The DNA fragments containing interaction relationships were captured by streptavidin immunomagnetic beads for library construction. Library concentration and insert size were determined using the Qubit 3.0 and LabChip GX platforms (PerkinElmer), respectively. qPCR was used to estimate the effective concentration of the library. High quality Hi-C libraries were sequenced on the Illumina NovaSeq 6000 sequencing platform, and the sequencing data were used for chromosome-level assembly34. The software Burrows-Wheeler Aligner (BWA-MEM v. 0.7.10-r789) was used to align the sequencing pair-end clean reads with the sequence of the assembled genome to obtain the uniquely mapped read pairs35. The uniquely mapped read pairs were processed using HiC-Pro36. The genome contigs, split into 50 kb segments, combined with uniquely matched Hi-C data, were clustered, ordered and directed onto the pseudochromosomes using LACHESIS34 with the following parameters: CLUSTER_MIN_RE_SITES = 30; CLUSTER_MAX_LINK_DENSITY = 2; CLUSTER_NONINFORMATIVE_RATIO = 2; ORDER_MIN_N_RES_IN_TRUN = 68; ORD-ER_MIN_N_RES_I-N_SHREDS = 67. Finally, the chromosome assemblies were cut into 100 kb bins of equal lengths and the interaction signals generated by the valid mapped read pairs between each bin were visualized in a heat map.

In total, 277.88 million read pairs (77.53 Gb clean data; 94.06 × coverage of the genome) were generated from the Hi-C library (Table 1), of which 77.26% were uniquely mapped on the assembled genome. Of the unique mapped read pairs, 60.67% were the valid interaction pairs (130.26 million), which were used for the next Hi-C assembly (Table S1). A total of 844.00 Mb (99.9%) assembled genome sequences were anchored on 23 chromosomes, and the order and direction of 827.39 Mb (98.03%) sequences could be determined. The detailed distribution of each chromosome sequence was shown in Table 2. The heat map of the Hi-C assembly interaction bins is consistent a genome assembly of excellent quality (Fig. 3). Finally, the genome size of M. salmoides was assembled at 844.88 Mb, while contig N50 and scaffold N50 were 15.30 Mb and 35.77 Mb, respectively (Table 1).

Table 2 The sequence distribution of each chromosome using Hi-C technology.
Fig. 3
figure 3

Hi-C assembly of chromosome interactive heat map. Chr01 - Chr23 are the abbreviations of 23 Chromosome. The abscissa and ordinate represent the order of each bin on the corresponding chromosome group. The colour block illuminates the intensity of interaction from yellow (low) to red (high).

Repeats prediction

The repetitive elements of the M. salmoides genome were identified and annotated using RepeatModeler2 containing RECON37 and RepeatScout38. The derived repetitive sequences were searched against curated libraries and the repetitive DNA element databases Repbase39, REXdb40 and Dfam41. The LTR retrotransposon retriever42 was applied to identify the output from LTRharvest43 and LTR_FINDER44. The results were combined and deduplicated, and the repetitive elements were finalized by RepeatMasker45. About 38.19% M. salmoides genome was repetitive sequences, composed mainly of class II transposable elements (Table 3).

Table 3 The repeat sequence statistics of assembled genome.

Genes prediction and annotation

The prediction of the genome gene structure was based on three different strategies: ab initio-based, homolog-based, and unigene-based. Genscan46, Augustus v2.447, GlimmerHMM v3.0.448, GeneID v1.449 and SNAP (version 2006-07-28)50 were used to perform ab initio-based prediction. GeMoMa v1.3.151,52 was used for prediction based on homologous species. Hisat v2.0.453 and Stringtie v1.2.354 were used for assembly based on reference transcripts, and TransDecoder v2.0 and GeneMarkS -t v5.155 were used for gene prediction. PASA v2.0.256 was used to predict unigene sequences based on unreferenced assembly of full-length transcriptome data. Finally, EVM v1.1.157 was used to integrate the prediction results obtained by the above three methods, and PASA v2.0.2 was used to modify the final gene models. A total of 26,370 protein-coding genes were predicted by integrating the prediction of ab initio, homology-based and RNA-seq strategies (Table S2), with average gene length of 14,483 bp, exon length of 2,601 bp, coding sequence of 1,724 bp and intron length of 11,882 bp (Table 4). Finally, 25,760 genes (97.69% of the total) were successfully annotated GO, KEGG, KOG, TrEMBL, and NR database (Table S3).

Table 4 The basic information statistics of assembled genome.

Blastn searches using the Rfam database58, as input against the M. salmoides genome was used to identify microRNA and rRNA and tRNAscan-SE59 was used to identify tRNA. Non-coding RNAs were predicted to be 2,639, including 633 microRNAs (miRNA) of 84 families, 230 rRNA genes of 4 families and 1,830 tRNA genes of 25 families (Table S4). Pseudogenes were predicted in the following way. The predicted protein sequences were used to search for homologous gene sequences (putative genes) through BLAT alignment60. Then GeneWise61 was used to search for immature termination codons and code-shifting mutations in the gene sequences to obtain pseudogenes. In total, 986 pseudogenes were identified with a total length of 5,885,501 bp and an average length of 5,969 bp (Table S4).

Data Records

The sequencing data (Full-length transcriptome, Hi-C, Illumina and PacBio) have been deposited in SRA (Sequence Read Archive) database as SRR1288657562, SRR1288657663, SRR1288657764, and SRR1288657865. The assembly genome data was deposited in GenBank66. The assembly genome data, gene CDS and Exon data and functional annotations were also stored in Figshare67.

Technical Validation

The assembly was evaluated using three criteria: the map** of Illumina reads, core gene integrity, and BUSCO assessment. The Benchmarking Universal Single Copy Orthologs were searched in CEGMA v2.568 and BUSCO v 3.069 to evaluate the conserved core genes in the genome. The Illumina reads fully (99.54%) mapped to the assembled genome, including 97.78% of paired-end reads. A total of 445 out of in 458 conserved eukaryotic core genes from the CEGMA database were found in the assembled genome (Table S5). Finally, 97.49% of the complete BUSCOs were included in the assembled genome (Table S5). In summary, this is a high-quality de novo assembly reference genome.