Chromosome-level genome assembly and annotation of the yellow grouper, Epinephelus awoara

Zhang, Weiwei; Yang, Yang; Hua, Sijie; Ruan, Qingxin; Li, Duo; Wang, Le; Wang, **; Wen, **n; Liu, **aochun; Meng, Zining

doi:10.1038/s41597-024-02989-8

Chromosome-level genome assembly and annotation of the yellow grouper, Epinephelus awoara

Data Descriptor
Open access
Published: 31 January 2024

Volume 11, article number 151, (2024)
Cite this article

Download PDF

You have full access to this open access article

Scientific Data

Chromosome-level genome assembly and annotation of the yellow grouper, Epinephelus awoara

Download PDF

Weiwei Zhang¹,
Yang Yang^1,2,3,
Sijie Hua¹,
Qingxin Ruan¹,
Duo Li¹,
Le Wang⁴,
** Wang⁵,
**n Wen⁶,
**aochun Liu^1,7 &
…
Zining Meng^1,7

1153 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Epinephelus awoara, as known as yellow grouper, is a significant economic marine fish that has been bred artificially in China. However, the genetic structure and evolutionary history of yellow grouper remains largely unknown. Here, this work presents the high-quality chromosome-level genome assembly of yellow grouper using PacBio single molecule sequencing technique (SMRT) and High-through chromosome conformation capture (Hi-C) technologies. The 984.48 Mb chromosome-level genome of yellow grouper was assembled, with a contig N50 length of 39.77 Mb and scaffold N50 length of 41.39 Mb. Approximately 99.76% of assembled sequences were anchored into 24 pseudo-chromosomes with the assistance of Hi-C reads. Furthermore, approximately 41.17% of the genome was composed of repetitive elements. In total, 24,541 protein-coding genes were predicted, of which 22,509 (91.72%) genes were functionally annotated. The highly accurate, chromosome-level reference genome assembly and annotation are crucial to the understanding of population genetic structure, adaptive evolution and speciation of the yellow grouper.

Chromosome-level genome assembly of Acrossocheilus fasciatus using PacBio sequencing and Hi-C technology

Article Open access 03 February 2024

The sequencing and de novo assembly of the Larimichthys crocea genome using PacBio and Hi-C technologies

Article Open access 01 October 2019

Chromosome-level genome assembly and annotation of eel goby (Odontamblyopus rebecca)

Article Open access 02 February 2024

Background & Summary

Groupers belongs to the subfamily Epinephelinae under the family Epinephelidae, which mainly inhabits tropical and subtropical coral reefs or continental shelves, acting as the top predators maintaining the ecological balance of coral reef ecosystems¹. Groupers encompasses over 16 genera and more than 160 species, out of which approximately 47 species were currently cultivated for aquaculture², making them globally significant economic fish species. According to statistics from the Food and Agriculture Organization (FAO), the global aquaculture production of groupers in 2020 amounted to 226.2 thousand tonnes³. In China, as per data from the China Fisheries Statistical Yearbook 2023, the aquaculture production of groupers in 2022 reached 205.8 thousand tonnes, ranking it fourth in terms of marine aquaculture production in China⁴. In conclusion, groupers hold significant ecological aimportance and commercial value. In recent years, several high-quality chromosome-level assemblies of grouper reference genomes have been assembled, including seven species of the Epinephelus genus^{5,6,7,8,9,10,11,12,13}, one species of the Plectropomus genus^14,15,16,17, one species of the Cromileptes genus¹⁸, and one species of the Cephalopholis genus³⁴. Finally, the Hi-C library was subjected to paired-end sequencing with 150 bp read lengths using the MGISEQ-2000 platform to capture the spatial interactions between chromosomal regions. As a result, 109.97 Gb of Hi-C read data was generated, with an average sequencing coverage of 111.71× (Table 1).

RNA sequencing (RNA-Seq) using short-read sequencing technology was widely used method for transcriptome profiling³⁵. While emerging single molecule, long-read RNA-Seq technologies have enabled new approaches to study the transcriptome and its function³⁶. SMRT isoform sequencing (Iso-Seq) with the PacBio platform can generate full-length cDNA sequences³⁷. Read lengths achieved with these technologies (~15 kb) surpass lengths of most transcripts. In this study, for substantiating transcripts to annotate the genome structure, we performed RNA-Seq and Iso-Seq of the total RNA, respectively. Total RNA was extracted by grinding tissue in TRIzol reagent (Tiangen) on dry ice and processed following the protocol provided by the manufacturer. The integrity of the RNA was determined with the Agilent 2100 Bioanalyzer (Agilent Technologies) and agarose gel electrophoresis. The purity and concentration of the RNA were determined with the Nanodrop (Thermo Fisher Scientific) and Qubit (Thermo Fisher Scientific). Then, equal amount of them were pooled together for RNA sequencing. Finally, sequencing of RNA-Seq and Iso-Seq were performed on the MGISEQ-2000 platform and the PacBio Sequel II platform, respectively. A total of 17.79 Gb RNA-seq data and 64.59 Gb clean Iso-Seq data were generated (Table 1), which were then used for whole-genome protein-coding gene prediction.

Genome survey

The k-mer analysis was performed using MGI paired-ended sequenced raw reads prior to genome assembly to estimate the genome size and heterozygosity. Briefly, 56.71 Gb raw dara was filtered by fastp v 0.21.0³⁸ software with parameters of “-n 0 -f 5 -F 5 -t 5 -T 5 -q 20”, and 52.65 Gb clean data were retained (Table 1). The quality-filtered clean reads were subjected to k-17mers frequency distribution and heterozygosity using the KMC program³⁹ with parameters of “-k17 -ci1 -cs1000000”. The genome size was estimated using FindGSE software⁴⁰ and GenomeScope (v 1.0.0)⁴¹ with parameters of “default”. Finally, a total of 34,657,425,513 k-mers were counted with a k-mers peak at a depth of 35 (Table 2). We estimated that the genome size of the yellow grouper = K-mer num/K-mer depth = 990.21 Mb. The heterozygosity rate was estimated to be approximately 0.40% on k-mer depth distribution (Table 2).

Table 2 The result of k-mer analysis.

Full size table

De novo assembly of the yellow grouper genome

The raw PacBio CCS reads data was used for de novo genome assembly using hifiasm v 0.19.4⁴² with default parameters. To further improve the accuracy of the assembly, the preliminary assembled genome was polished by short reads from the same individual using four iterative correction rounds of Nextpolish (v1.2.4⁴³) with default parameters. To evaluate the accuracy of the assembly, all the Illumina paired end reads were mapped to the assembled genome using BWA (Burrows-Wheeler Aligner, v 0.7.12-r1039⁴⁴) and the map** rate as well as genome coverage of sequencing reads were assessed using Minimap2 v r41⁴⁵ with parameters of “-x map-pb”. Besides, base accuracy of the assembly was calculated with samtools v 1.4⁴⁶ and Bcftools v1.8.0⁴⁷ with default parameters. To avoid including mitochondria sequences in the assembly, the draft genome assembly was submitted to the NT library and aligned sequences were eliminated using the blast v2.9⁴⁸. The resulting assembly consists of 64 contigs and has a total length of 984.53 Mb with a contig N50 length of 40.27 Mb (Table 3).

Table 3 Assembly statistics of yellow grouper.

Full size table

Pseudochromosome construction

In total, 109.97 Gb clean paired-end reads were generated from the libraries. Firstly, low-quality sequences (quality scores <20), adaptor sequences and sequences shorter than 30 bp were filtered out using fastp v0.21.0³⁸ with default parameters. Then, the clean paired-end reads were mapped to the draft assembled sequence using bowtie2 v 2.3.2⁴⁹ with parameters of “-end-to-end,–very-sensitive -L 30” to get the unique mapped paired-end reads. Valid interaction paired reads (invalid read pairs, including dangling-end, self-cycle, re-ligation, and dumped products were filtered) were identified and retained by HiC-Pro v 3.1.0⁵⁰ from unique mapped paired-end reads for further analysis. The scaffolds were further clustered, ordered, and oriented scaffolds onto chromosomes by Lachesis⁵¹ with parameters of CLUSTER MIN RE SITES = 100, CLUSTER MAX LINK DENSITY = 2.5, CLUSTER NONINFORMATIVE RATIO = 1.4, ORDER MIN N RES IN TRUNK = 60, and ORDER MIN N RES IN SHREDS = 60. Finally, placement and orientation errors exhibiting obvious discrete chromatin interaction patterns were manually adjusted. Following the scaffolding procedure, the 974.86 Mb were successfully anchored to the 24 chromosomes with an integration efficiency of 99.02%, and the lengths of chromosomes ranged from 23.08 Mb to 48.78 Mb (Table 4). After Hi-C scaffolding, the 984.48 Mb chromosome-level genome of yellow grouper was assembled, with a contig N50 length of 39.77 Mb and a scaffold N50 length of 41.39 Mb (Table 3). Moreover, we evaluated the result of Hi-C based pseudo-chromosomes construction. The 24 scaffolds are clearly distinguishable in the heatmap, the interaction signal around the diagonal is strongly apparent (Fig. 2a), indicating the high-quality of the pseudochromosomes assembly.

Table 4 Statistics of yellow grouper genome sequence length (chromosome level).

Full size table

Repeat annotation

We first annotation the tandem repeats, including simple repeat sequences (SSRs) and tandem repeat elements, were identified using the software GMATA v2.2⁵² and Tandem Repeats Finder (TRF V 4.07b⁵³) with default parameters. Then, transposable elements (TE) in the yellow grouper genome were identified using a combination of ab initio and homology-based methods. Briefly, an ab initio repeat library was first predicted using MITE-hunter⁵⁴ with parameters of “-n 20 -P 0.2 -c 3” and RepeatModeler version open-2.0.4⁵⁵ with parameters of “-engine wublast”, in which LTR_FINDER⁵⁶, LTRharverst⁵⁷ and LTR_retriver⁵⁸ synchronously to detect repeat sequences in the yellow grouper genome. The obtainted library was then aligned to TEclass Repbase (http://www.girinst.org/repbase) to classify the type of each repeat family using TEclass v 2.1.3⁵⁹. For further identification of the repeats throughout the genome, RepeatMasker (open-4.1.4)⁶⁰ was applied to search for known and novel TEs by map** sequences against the de novo repeat library and Repbase TE library with parameters of “nolow -no_is -gff -norna -engine abblast -lib lib”. Overlap** transposable elements belonging to the same repeat class were collated and combined. A total of 405.30 Mb sequences, 41.17% of the yellow grouper genome, were identified as repeat elements (Table 5 and Fig. 2b). We estimated that the yellow grouper genome consists of approximately 0.73% tandem repeats, including 0.18% of SSR and 0.56% of tandem repeats in the genome (Table 5 and Fig. 2b). A significant portion of the genome, approximately 35.68%, is masked, resulting in 351.25 Mb being identified as transposable elements (Table 5 and Fig. 2b). Among these transposable elements, DNA transposons were the main type, occupying 20.08% (197.69 Mb) of the genome. Retroelements, including long interspersed nuclear elements (LINEs, 7.52%), long terminal repeats (LTRs, 4.24%), and short interspersed nuclear elements (SINEs, 1.02%), resulting the proportion of 12.78% of the genome (Table 5).

Table 5 Repetitive elements sequence statistics of the assembled genome.

Full size table

Gene model prediction and functional annotations

We conducted protein-coding gene prediction with three independent approaches, including homolog protein, transcriptome-assisted annotation, and de novo prediction, were used for gene prediction in a repeat-masked genome. For homology-based gene prediction, we utilized GeMoMa v1.6.1⁶¹ with default parameters to align the protein-coding sequences frome E. fuscoguttatus (brown-marbled grouper⁹), E. moara (kelp grouper¹⁰), E. lanceolatus (giant grouper¹¹), Cromileptes altivelis (humpback grouper¹⁸), Plectropomus leopardus (leopard coral grouper¹⁶), Danio rerio (zebrafish, GCF_000002035.6⁶²), and Oryzias latipes (Japanese medaka, GCF_002234675.1) to the genome assembly, and then got the gene structure information. For transcriptome-based prediction, the filtered long read Iso-seq and short-read RNA-seq data were aligned to the reference genome using STAR v2.7.3a⁶³, followed by the transcripts were then assembled using Stringtie v1.3.4d⁶⁴ and open reading frames (ORFs) were predicted using PASA v2.3.3⁶⁵ to produce a training set. For the de novo prediction, Augustus v3.3.1⁶⁶ with parameters of “--gff3 = on --hintsfile = hints.gff --extrinsicCfgFile = extrinsic.cfg --allow_hinted_splicesites = gcag,atac–min_intron_len = 30 --softmasking = 1” were then utilized for ab initio gene prediction with the training set. Finally, EVidenceModeler (EVM, v1.1⁶⁵) was used to produce an integrated gene set of which gene with TE were removed using TransposonPSI package (http://transposonpsi.sourceforge.net/) and the miscoded genes were further filtered. Untranslated regions (UTRs) and alternative splicing regions were determined using PASA v2.3.3⁶⁵ based on RNA-seq assemblies. We retained the longest transcripts for each locus, and regions outside of the ORFs were designated UTRs. Furthermore, we performed functional annotation of the predicted protein-coding genes via assigning by comparing with public databases including SwissProt⁶⁷, the NCBI non-reduntant protein database (NR), Kyoto Encyclopedia of Gene and Genomes (KEGG)⁶⁸, Eukaryotic Orthologous Groups of protein (KOG)⁶⁹, and Gene Ontology (GO)⁷⁰. The putative domains and GO terms of genes were identified using the InterProScan program with default parameters. For the other four databases, BLASTp (https://blast.ncbi.nlm.nih.gov/Blast.cgi) was used to compare the EVidenceModeler-integrated protein sequences against the four well-known public protein database with parameters of “-evalue 1e-5, -max_target_seqs. 1”⁶⁵. Results from the five database searches were concatenated using EVidenceModeler v1.1⁶⁵.

A total of 24,541 protein-coding genes were successfully predicted within the genome, with an average gene length and an average CDS length of 20,681.6 bp and 1,743.35 bp in each gene, respectively. The average exons number of 10.22, average exon length of 170.5 bp and average intron length of 2,052.99 in each gene (Table 6 and Fig. 2b). Further, 22,509 genes were successfully annotated, accounting for 91.72% of all predicted genes (Table 7 and Fig. 2f).

Table 6 Statistical results of gene structure prediction.

Full size table

Table 7 Summary of gene annotation in the assembled genome.

Full size table

To obtain the non-coding RNA (ncRNA), two strategies were used: searching against database and prediction with model. Transfer RNAs (tRNAs) were predicted using tRNAscan-SE v2.0⁷¹ with parameters “–thread 4 -E -I”. Micro RNA (miRNA), rRNA, small nuclear RNA, and small nucleolar RNA were detected using “cmscan” subprogram from Infernal v1.1.2⁷² to search the Rfam database⁷³ with following parameter. The rRNAs and their subunits were predicted using RNAmmer v1.2⁷⁴ with parameters “-S euk -m lsu,ssu,tsu -gff”. As a result, we annotated 1,295 rRNA, 1,946 miRNA, 276 regulatory and 2,391 tRNA (Table 8 and Fig. 2b).

Table 8 Statistics of annotated non-coding RNAs.

Full size table

Data Records

The raw sequence data, including the PacBio long-read data, MGI short-read genomic sequencing data, Hi-C data and Transcriptomic sequences, (including RNA-Seq and Iso-Seq data), have been deposited in the Genome Sequence Archive (GSA⁷⁵) in National Genomics Data Center⁷⁶ under the accession CRA013097⁷⁷. Additionally, the raw data has also been deposited at NCBI with the accession number SRP479893⁷⁸. The assembled genome sequences have been deposited in the NCBI GenBank with the accession number GCA_035609425.1⁷⁹. The whole genome sequence data and the genome annotation files reported in this paper have been deposited in the Genome Warehouse in National Genomics Data Center^76,80, Bei**g Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation, under accession number GWHEQBJ00000000⁸¹.

Technical Validation

Assembly completeness and accuracy were evaluated by multiple methods. First, the MGI short-read clean reads and PacBio long-read data (Table 1), were re-mapped onto the assembly using BWA v 0.7.12-r1039⁴⁴ and minimap2⁴⁵, respectively. The coverage rate of MGI WGS short-read and PacBio CCS long-read reached 98.87% and 97.66% of the assembly have at least 20× coverage, respectively (Table 9), demonstrating a high level of assembly accuracy. Then, the Merqury v1.3⁸² was used to assess the genome quality, with consensus quality value (QV) and completeness statistic values of 52.10 and 91.84%, respectively, indicating a high level of accuracy and completeness in the assembled genome (Fig. 2c). The CEGMA v3⁸³ was employed to assess the accuracy and completeness of core genes within the assembled genome. A total of 243 core genes were assembled, accounting for 97.98% of the expected core genes. Among these, 194 were fully assembled, representing 78.23% completeness, indicating a relatively comprehensive representation of core genes in the assembled genome (Fig. 2d). Benchmarking Universal Single-Copy Orthologues (BUSCO) software v5.3.1⁸⁴ also used to evaluate the completeness of the assembly with parameters “-l actinopterygii_odb10 -g genome”. We identified 3589 complete BUSCOs (98.60%) out of the 3640 BUSCO groups, including 3536 complete and single-copy BUSCOs (97.14%) and 53 complete and duplicated BUSCOs (1.46%). The number of fragmented BUSCOs and missing BUSCOs was 16 (0.44%) and 35 (0.96%), respectively (Fig. 2e).

Table 9 The alignment of short and long-read genome sequencing to the assembled genome.

Full size table

Furthermore, the completeness of gene annotations were evaluated using BUSCO v5.3.1⁸⁴ with the actinopterygii_odb10 database. The annotated genes covered a total of 96.87% (3526) of the complete vertebrate core gene set, indicating a high reliable of the gene prediction results (Fig. 2g). RNA-Seq reads were mapped with the annotation results, we used Stringtie v1.3.4d⁶⁴ with default parameters and achieved an overall map** rate of 91.76%. Next, we compared the number of genes, gene length, coding DNA sequence (CDS) length, exons number per gene, exon length, and intron length with those of other teleost fish species (Table 10).

Table 10 The comparison of gene models annotated from the yellow grouper genome with those from teleost fishes.

Full size table

Genome collinearity analysis and visualizations were performed using the MCScan tool from jcvi v1.3.8⁸⁵, obtained from https://github.com/tanghaibao/jcvi/wiki/MCscan-(Python-version). We illustrated the collinearity between the yellow grouper genome and other grouper species using collinearity plots. The yellow grouper genome demonstrates strong collinearity with related species within its genus and with the humpback grouper (C. altivelis) from a distinct genus (Fig. 3a,b). However, compared to another genus, the leopard coral grouper (P. leopardus), it exhibits more frequently chromosomes are rearranged. (Fig. 3b).

Code availability

No custom code was used in this study. All bioinformatics tools, commands and pipelines used in data processing were executed following the manual and protocols provided by the respective software developers. The versions of the software used, along with their corresponding parameters, have been thoroughly described in the Methods section.

References

Sabetian, A. The Association of Physical and Environmental Factors with Abundance and Distribution Patterns of Groupers around Kolombangara Island, Solomon Islands. Environ. Biol. Fishes 68, 93–99, https://doi.org/10.1023/A:1026048115070 (2003).
Article Google Scholar
Rimmer, M. A. & Glamuzina, B. A review of grouper (Family Serranidae: Subfamily Epinephelinae) aquaculture from a sustainability science perspective. Rev. Aquac. 11, 58–87, https://doi.org/10.1111/raq.12226 (2019).
Article Google Scholar
FAO. The State of World Fisheries and Aquaculture 2022. Towards Blue Transformation. (Rome, FAO, 2022).
Fisheries Administration Bureau, M. o. A. China Fishery Statistics Yearbook (2023). (China Agriculture Press, 2023).
Cao, X. et al. Chromosome-Level Genome Assembly of the Speckled Blue Grouper (Epinephelus cyanopodus) Provides Insight into Its Adaptive Evolution. Biology 11, 1810, https://doi.org/10.3390/biology11121810 (2022).
Article CAS PubMed PubMed Central Google Scholar
Ge, H. et al. De novo assembly of a chromosome-level reference genome of red-spotted grouper (Epinephelus akaara) using nanopore sequencing and Hi-C. Mol. Ecol. Resour. 19, 1461–1469, https://doi.org/10.1111/1755-0998.13064 (2019).
Article CAS PubMed PubMed Central Google Scholar
Li, S. et al. Mechanisms of sex differentiation and sex reversal in hermaphrodite fish as revealed by the Epinephelus coioides genome. Mol. Ecol. Resour. 23, 920–932, https://doi.org/10.1111/1755-0998.13753 (2023).
Article CAS PubMed Google Scholar
Wang, L. et al. A chromosome-level genome assembly of the potato grouper (Epinephelus tukula). Genomics 114, 110473, https://doi.org/10.1016/j.ygeno.2022.110473 (2022).
Article CAS PubMed Google Scholar
Yang, Y. et al. Whole-genome sequencing of brown-marbled grouper (Epinephelus fuscoguttatus) provides insights into adaptive evolution and growth differences. Mol. Ecol. Resour. 22, 711–723, https://doi.org/10.1111/1755-0998.13494 (2022).
Article CAS PubMed Google Scholar
Zhou, Q., Gao, H., Xu, H., Lin, H. & Chen, S. A Chromosomal-scale Reference Genome of the Kelp Grouper Epinephelus moara. Mar Biotechnol 23, 12–16, https://doi.org/10.1007/s10126-020-10003-6 (2021).
Article CAS Google Scholar
Zhou, Q. et al. A chromosome-level genome assembly of the giant grouper (Epinephelus lanceolatus) provides insights into its innate immunity and rapid growth. Mol. Ecol. Resour. 19, 1322–1332, https://doi.org/10.1111/1755-0998.13048 (2019).
Article CAS PubMed Google Scholar
Wang, D. et al. Whole Genome Sequencing of the Giant Grouper (Epinephelus lanceolatus) and High-Throughput Screening of Putative Antimicrobial Peptide Genes. Mar. Drugs 17, 503, https://doi.org/10.3390/md17090503 (2019).
Article CAS PubMed PubMed Central Google Scholar
Yang, Y. et al. Assembly of Genome and Resequencing Provide Insights into Genetic Differentiation between Parents of Hulong Hybrid Grouper (Epinephelus fuscoguttatus ♀ × E. lanceolatus ♂). Int J Mol Sci. 24, 12007, https://doi.org/10.3390/ijms241512007 (2023).
Article CAS PubMed PubMed Central Google Scholar
Han, W. et al. Improved chromosomal-level genome assembly and re-annotation of leopard coral grouper. Sci. Data 10, 156, https://doi.org/10.1038/s41597-023-02051-z (2023).
Article CAS PubMed PubMed Central Google Scholar
Wang, Y. et al. Chromosome Genome Assembly of the Leopard Coral Grouper (Plectropomus leopardus) With Nanopore and Hi-C Sequencing Data. Front Genet. 11, https://doi.org/10.3389/fgene.2020.00876 (2020).
Yang, Y. et al. Whole-genome sequencing of leopard coral grouper (Plectropomus leopardus) and exploration of regulation mechanism of skin color and adaptive evolution. Zool. Res. 41, 328, https://doi.org/10.24272/j.issn.2095-8137.2020.038 (2020).
Article CAS PubMed PubMed Central Google Scholar
Zhou, Q. et al. De novo sequencing and chromosomal-scale genome assembly of leopard coral grouper, Plectropomus leopardus. Mol. Ecol. Resour. 20, 1403–1413, https://doi.org/10.1111/1755-0998.13207 (2020).
Article CAS PubMed Google Scholar
Yang, Y. et al. Chromosome Genome Assembly of Cromileptes altivelis Reveals Loss of Genome Fragment in Cromileptes Compared with Epinephelus Species. Genes 12, 1873, https://doi.org/10.3390/genes12121873 (2021).
Article CAS PubMed PubMed Central Google Scholar
**e, Z. et al. Chromosome-Level Genome Assembly and Transcriptome Comparison Analysis of Cephalopholis sonnerati and Its Related Grouper Species. Biology 11, 1053, https://doi.org/10.3390/biology11071053 (2022).
Article CAS PubMed PubMed Central Google Scholar
Ma, K. Y., Craig, M. T., Choat, J. H. & van Herwerden, L. The historical biogeography of groupers: Clade diversification patterns and processes. Mol. Phylogenet. Evol. 100, 21–30, https://doi.org/10.1016/j.ympev.2016.02.012 (2016).
Article PubMed Google Scholar
Zhang, W. et al. The genetic mechanism of body size variation in groupers: insights from phylotranscriptomics. Zool Res. https://doi.org/10.24272/j.issn.2095-8137.2023.222 (2024).
Craig, M. T., Sadovy de Mitcheson, Y. & Heemstra, P. C. Groupers of the World: A Field and Market Guide. (2011).
Liu, M. et al. Primary male development of two sequentially hermaphroditic groupers, Epinephelus akaara and Epinephelus awoara (Perciformes: Epinephelidae). J. Fish Biol. 88, 1598–1613, https://doi.org/10.1111/jfb.12936 (2016).
Article CAS PubMed Google Scholar
Li, Z. et al. The complete mitochondrial genome of the hybrid offspring Epinephelus awoara ♀ × Epinephelus tukula ♂. Mitochondrial DNA B Resour 5, 1025–1026, https://doi.org/10.1080/23802359.2020.1721356 (2020).
Article PubMed PubMed Central Google Scholar
Chen, B. et al. Biology and hatchery of Epinephelus awoara. Hebei Fisheries 2, 29–31 (2006).
CAS Google Scholar
Wang, S. et al. Characterization of yellow grouper Epinephelus awoara (Serranidae) karyotype by chromosome bandings and fluorescence in situ hybridization. J. Fish Biol. 80, 866–875, https://doi.org/10.1111/j.1095-8649.2012.03230.x (2012).
Article CAS PubMed Google Scholar
Yang, K. et al. Genetic Structure and Demographic History of Yellow Grouper (Epinephelus awoara) from the Coast of Southeastern Mainland China, Inferred by Mitochondrial, Nuclear and Microsatellite DNA Markers. Diversity 14, 439, https://doi.org/10.3390/d14060439 (2022).
Article CAS Google Scholar
Zhao, L. et al. Twelve novel polymorphic microsatellite loci for the Yellow grouper (Epinephelus awoara) and cross-species amplifications. Curr. Biol. 10, 743–745, https://doi.org/10.1007/s10592-008-9635-9 (2009).
Article CAS Google Scholar
Qu, M. et al. Complete mitochondrial genome of yellow grouper Epinephelus awoara (Perciformes, Epinephelidae). Mitochondrial DNA 23, 432–434, https://doi.org/10.3109/19401736.2012.710217 (2012).
Article CAS PubMed Google Scholar
Gong, G. et al. A chromosome-level genome assembly of the darkbarbel catfish Pelteobagrus vachelli. Sci. Data 10, 598, https://doi.org/10.1038/s41597-023-02509-0 (2023).
Article CAS PubMed PubMed Central Google Scholar
Zhou, Z. et al. The sequence and de novo assembly of Takifugu bimaculatus genome using PacBio and Hi-C technologies. Sci. Data 6, 187, https://doi.org/10.1038/s41597-019-0195-2 (2019).
Article CAS PubMed PubMed Central Google Scholar
Yekefenhazi, D. et al. Chromosome-level genome assembly of Nibea coibor using PacBio HiFi reads and Hi-C technologies. Sci. Data 9, 670, https://doi.org/10.1038/s41597-022-01804-6 (2022).
Article CAS PubMed PubMed Central Google Scholar
Eid, J. et al. Real-Time DNA Sequencing from Single Polymerase Molecules. Science 323, 133–138, https://doi.org/10.1126/science.1162986 (2009).
Article ADS CAS PubMed Google Scholar
Rao, S. S. P. et al. A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Loo**. Cell 159, 1665–1680, https://doi.org/10.1016/j.cell.2014.11.021 (2014).
Article CAS PubMed PubMed Central Google Scholar
Oikonomopoulos, S. et al. Methodologies for Transcript Profiling Using Long-Read Technologies. Front. Genet. 11, https://doi.org/10.3389/fgene.2020.00606 (2020).
Zhao, L. et al. Analysis of Transcriptome and Epitranscriptome in Plants Using PacBio Iso-Seq and Nanopore-Based Direct RNA Sequencing. Front. Genet. 10, 253, https://doi.org/10.3389/fgene.2019.00253 (2019).
Article CAS PubMed PubMed Central Google Scholar
Gonzalez-Garay, M. L. in Transcriptomics and Gene Regulation (ed Jiaqian, Wu) 141–160 (Springer Netherlands, 2016).
Chen, S. et al. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890, https://doi.org/10.1093/bioinformatics/bty560 (2018).
Article CAS PubMed PubMed Central Google Scholar
Deorowicz, S. et al. KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31, 1569–1576, https://doi.org/10.1093/bioinformatics/btv022 (2015).
Article CAS PubMed Google Scholar
Sun, H. et al. findGSE: estimating genome size variation within human and Arabidopsis using k-mer frequencies. Bioinformatics 34, 550–557, https://doi.org/10.1093/bioinformatics/btx637 (2018).
Article CAS PubMed Google Scholar
Vurture, G. W. et al. GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics 33, 2202–2204, https://doi.org/10.1093/bioinformatics/btx153 (2017).
Article CAS PubMed PubMed Central Google Scholar
Cheng, H. et al. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175, https://doi.org/10.1038/s41592-020-01056-5 (2021).
Article CAS PubMed PubMed Central Google Scholar
Hu, J. et al. NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics 36, 2253–2255, https://doi.org/10.1093/bioinformatics/btz891 (2020).
Article CAS PubMed Google Scholar
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595, https://doi.org/10.1093/bioinformatics/btp698 (2010).
Article CAS PubMed PubMed Central Google Scholar
Li, H. Minimap and miniasm: fast map** and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110, https://doi.org/10.1093/bioinformatics/btw152 (2016).
Article CAS PubMed PubMed Central Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079, https://doi.org/10.1093/bioinformatics/btp352 (2009).
Article CAS PubMed PubMed Central Google Scholar
Danecek, P. & McCarthy, S. A. BCFtools/csq: haplotype-aware variant consequences. Bioinformatics 33, 2037–2039, https://doi.org/10.1093/bioinformatics/btx100 (2017).
Article CAS PubMed PubMed Central Google Scholar
Chaisson, M. J. & Tesler, G. Map** single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13, 238, https://doi.org/10.1186/1471-2105-13-238 (2012).
Article CAS PubMed PubMed Central Google Scholar
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359, https://doi.org/10.1038/nmeth.1923 (2012).
Article CAS PubMed PubMed Central Google Scholar
Servant, N. et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biol. 16, 259, https://doi.org/10.1186/s13059-015-0831-x (2015).
Article CAS PubMed PubMed Central Google Scholar
Burton, J. N. et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31, 1119–1125, https://doi.org/10.1038/nbt.2727 (2013).
Article CAS PubMed PubMed Central Google Scholar
Wang, X. & Wang, L. GMATA: An Integrated Software Package for Genome-Scale SSR Mining, Marker Development and Viewing. Front. Plant Sci. 7, 1350, https://doi.org/10.3389/fpls.2016.01350 (2016).
Article PubMed PubMed Central Google Scholar
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580, https://doi.org/10.1093/nar/27.2.573 (1999).
Article CAS PubMed PubMed Central Google Scholar
Han, Y. & Wessler, S. R. MITE-Hunter: a program for discovering miniature inverted-repeat transposable elements from genomic sequences. Nucleic Acids Res. 38, e199–e199, https://doi.org/10.1093/nar/gkq862 (2010).
Article CAS PubMed PubMed Central Google Scholar
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proceedings of the National Academy of Sciences 117, 9451–9457, https://doi.org/10.1073/pnas.1921046117 (2020).
Article ADS CAS Google Scholar
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 35, W265–W268, https://doi.org/10.1093/nar/gkm286 (2007).
Article PubMed PubMed Central Google Scholar
Ellinghaus, D., Kurtz, S. & Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics 9, 18, https://doi.org/10.1186/1471-2105-9-18 (2008).
Article CAS PubMed PubMed Central Google Scholar
Ou, S. & Jiang, N. LTR_retriever: A Highly Accurate and Sensitive Program for Identification of Long Terminal Repeat Retrotransposons. Plant Physiol. 176, 1410–1422, https://doi.org/10.1104/pp.17.01310 (2018).
Article CAS PubMed Google Scholar
Abrusán, G., Grundmann, N., DeMester, L. & Makalowski, W. TEclass—a tool for automated classification of unknown eukaryotic transposable elements. Bioinformatics 25, 1329–1330, https://doi.org/10.1093/bioinformatics/btp084 (2009).
Article CAS PubMed Google Scholar
Bedell, J. A., Korf, I. & Gish, W. MaskerAid: a performance enhancement to RepeatMasker. Bioinformatics 16, 1040–1041, https://doi.org/10.1093/bioinformatics/16.11.1040 (2000).
Article CAS PubMed Google Scholar
Keilwagen, J. et al. Using intron position conservation for homology-based gene prediction. Nucleic Acids Res. 44, e89–e89, https://doi.org/10.1093/nar/gkw092 (2016).
Article CAS PubMed PubMed Central Google Scholar
Zebrafish Genome Assembly GRCz11 Statistics, Genome Reference Consortium. https://www.ncbi.nlm.nih.gov/grc/zebrafish/data. (2018).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21, https://doi.org/10.1093/bioinformatics/bts635 (2013).
Article CAS PubMed Google Scholar
Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Bio. 20, 278, https://doi.org/10.1186/s13059-019-1910-1 (2019).
Article CAS Google Scholar
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 9, R7, https://doi.org/10.1186/gb-2008-9-1-r7 (2008).
Article CAS PubMed PubMed Central Google Scholar
Stanke, M. et al. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24, 637–644, https://doi.org/10.1093/bioinformatics/btn013 (2008).
Article CAS PubMed Google Scholar
Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Res. 27, 49–54, https://doi.org/10.1093/nar/27.1.49 (1999).
Article CAS PubMed PubMed Central Google Scholar
Ogata, H. et al. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 27, 29–34, https://doi.org/10.1093/nar/27.1.29 (1999).
Article CAS PubMed PubMed Central Google Scholar
Galperin, M. Y. et al. Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Res. 43, D261–D269, https://doi.org/10.1093/nar/gku1223 (2015).
Article CAS PubMed Google Scholar
Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25–29, https://doi.org/10.1038/75556 (2000).
Article CAS PubMed PubMed Central Google Scholar
Chan, P. P. et al. tRNAscan-SE 2.0: improved detection and functional classification of transfer RNA genes. Nucleic Acids Res. 49, 9077–9096, https://doi.org/10.1093/nar/gkab688 (2021).
Article CAS PubMed PubMed Central Google Scholar
Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29, 2933–2935, https://doi.org/10.1093/bioinformatics/btt509 (2013).
Article CAS PubMed PubMed Central Google Scholar
Griffiths-Jones, S. et al. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res. 33, D121–D124, https://doi.org/10.1093/nar/gki081 (2005).
Article CAS PubMed Google Scholar
Lagesen, K. et al. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 35, 3100–3108, https://doi.org/10.1093/nar/gkm160 (2007).
Article ADS CAS PubMed PubMed Central Google Scholar
Chen, T. et al. The Genome Sequence Archive Family: Toward Explosive Data Growth and Diverse Data Types. Genomics Proteomics Bioinformatics 19, 578–583, https://doi.org/10.1016/j.gpb.2021.08.001 (2021).
Article PubMed PubMed Central Google Scholar
Members, C. N. & Partners. Database Resources of the National Genomics Data Center, China National Center for Bioinformation in 2023. Nucleic Acids Res. 51, D18–D28, https://doi.org/10.1093/nar/gkac1073 (2023).
Article CAS Google Scholar
NGDC Genome Sequence Archive https://bigd.big.ac.cn/gsa/browse/CRA013097 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP479893 (2023).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_035609425.1 (2024).
Chen, M. et al. Genome Warehouse: A Public Repository Housing Genome-scale Data. Genomics Proteomics Bioinformatics 19, 584–589, https://doi.org/10.1016/j.gpb.2021.04.001 (2021).
Article PubMed PubMed Central Google Scholar
NGDC Genome Warehouse https://ngdc.cncb.ac.cn/gwh/Assembly/82944/show (2023).
Rhie, A. et al. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biology 21, 245, https://doi.org/10.1186/s13059-020-02134-9 (2020).
Article CAS PubMed PubMed Central Google Scholar
Parra, G. et al. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061–1067, https://doi.org/10.1093/bioinformatics/btm071 (2007).
Article CAS PubMed Google Scholar
Manni, M. et al. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Mol. Biol. Evol. 38, 4647–4654, https://doi.org/10.1093/molbev/msab199 (2021).
Article CAS PubMed PubMed Central Google Scholar
Tang, H. et al. Synteny and Collinearity in Plant Genomes. Science 320, 486–488, https://doi.org/10.1126/science.1153917 (2008).
Article ADS CAS PubMed Google Scholar

Download references

Acknowledgements

This work is supported by National Natural Science Foundation of China (32273136, U22A20531), China Agriculture Research System of MOF and MARA (CARS-47), the Science and Technology Planning Project of Guangdong Province (2023B1212060023) and Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai) (No. SML2023SP201). We also thank GrandOmics Technologies (Wuhan, China) for their invaluable technical support in this study.

Author information

Authors and Affiliations

State Key Laboratory of Biocontrol, Institute of Aquatic Economic Animals and Guangdong Province Key Laboratory of Aquatic Economic Animals, School of Life Sciences, Sun Yat-sen University, Guangzhou, 510275, China
Weiwei Zhang, Yang Yang, Sijie Hua, Qingxin Ruan, Duo Li, **aochun Liu & Zining Meng
Key Laboratory of Tropical Marine Fish Germplasm Innovation and Utilization, Ministry of Agriculture and Rural Affairs, Sanya, 570000, China
Yang Yang
Hainan Engineering Research Center for Germplasm Innovation and Utilization, Sanya, 570000, China
Yang Yang
Molecular Population Genetics Group, Temasek Life Sciences Laboratory, National University of Singapore, Singapore City, 119077, Singapore
Le Wang
Area of Ecology and Biodiversity, School of Biological Sciences, University of Hong Kong, Hong Kong SAR, 999077, China
** Wang
School of Marine Biology and Fisheries, Hainan Aquaculture Breeding Engineering Research Center, Hainan Academician Team Innovation Center, Hainan University, Haikou, 570228, China
**n Wen
Southern Laboratory of Ocean Science and Engineering (Zhuhai), Zhuhai, 519000, China
**aochun Liu & Zining Meng

Authors

Weiwei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Sijie Hua
View author publications
You can also search for this author in PubMed Google Scholar
Qingxin Ruan
View author publications
You can also search for this author in PubMed Google Scholar
Duo Li
View author publications
You can also search for this author in PubMed Google Scholar
Le Wang
View author publications
You can also search for this author in PubMed Google Scholar
** Wang
View author publications
You can also search for this author in PubMed Google Scholar
**n Wen
View author publications
You can also search for this author in PubMed Google Scholar
**aochun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zining Meng
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Z.M. and W.Z. conceived and designed the study. Z.M. and X.L. coordinated and supervised the whole study. W.Z. and Y.Y. conducted the genome assembly and bioinformatics analysis. S.H. and Q.R. prepared the samples and the figures. W.Z. drafted the manuscript. D.L., L.W., X.W. and X.W. participated in discussions and provided suggestions for manuscript improvement. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Zining Meng.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, W., Yang, Y., Hua, S. et al. Chromosome-level genome assembly and annotation of the yellow grouper, Epinephelus awoara. Sci Data 11, 151 (2024). https://doi.org/10.1038/s41597-024-02989-8

Download citation

Received: 23 October 2023
Accepted: 18 January 2024
Published: 31 January 2024
DOI: https://doi.org/10.1038/s41597-024-02989-8
Springer Nature Limited

Chromosome-level genome assembly and annotation of the yellow grouper, Epinephelus awoara

Abstract

Similar content being viewed by others

Chromosome-level genome assembly of Acrossocheilus fasciatus using PacBio sequencing and Hi-C technology

The sequencing and de novo assembly of the Larimichthys crocea genome using PacBio and Hi-C technologies

Chromosome-level genome assembly and annotation of eel goby (Odontamblyopus rebecca)

Background & Summary

Genome survey

De novo assembly of the yellow grouper genome

Pseudochromosome construction

Repeat annotation

Gene model prediction and functional annotations

Data Records

Technical Validation

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Navigation

Chromosome-level genome assembly and annotation of the yellow grouper, Epinephelus awoara

Abstract

Similar content being viewed by others

Chromosome-level genome assembly of Acrossocheilus fasciatus using PacBio sequencing and Hi-C technology

The sequencing and de novo assembly of the Larimichthys crocea genome using PacBio and Hi-C technologies

Chromosome-level genome assembly and annotation of eel goby (Odontamblyopus rebecca)

Background & Summary

Genome survey

De novo assembly of the yellow grouper genome

Pseudochromosome construction

Repeat annotation

Gene model prediction and functional annotations

Data Records

Technical Validation

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation