Chromosome-level genome assembly and annotation of eel goby (Odontamblyopus rebecca)

Lü, Zhenming; Yu, Ziwei; Luo, Wenkai; Liu, Tianwei; Wang, Yuzheng; Liu, Yantang; Liu, **g; Liu, Bingjian; Gong, Li; Liu, Liqin; Li, Yongxin

doi:10.1038/s41597-024-02997-8

Chromosome-level genome assembly and annotation of eel goby (Odontamblyopus rebecca)

Data Descriptor
Open access
Published: 02 February 2024

Volume 11, article number 160, (2024)
Cite this article

Download PDF

You have full access to this open access article

Scientific Data

Chromosome-level genome assembly and annotation of eel goby (Odontamblyopus rebecca)

Download PDF

Zhenming Lü¹^na1,
Ziwei Yu²^na1,
Wenkai Luo²^na1,
Tianwei Liu¹,
Yuzheng Wang¹,
Yantang Liu¹,
**g Liu¹,
Bingjian Liu¹,
Li Gong¹,
Liqin Liu¹ &
…
Yongxin Li ORCID: orcid.org/0000-0002-9555-1387²

704 Accesses
Explore all metrics

Abstract

The eel gobies fascinate researchers with many important features, including its unique body structure, benthic lifestyle, and degenerated eyes. However, genome assembly and exploration of the unique genomic composition of the eel gobies are still in their infancy. This has severely limited research progress on gobies. In this study, multi-platform sequencing data were generated and used to assemble and annotate the genome of O. rebecca at the chromosome-level. The assembled genome size of O. rebecca is 918.57 Mbp, which is similar to the estimated genome size (903.03 Mbp) using 17-mer. The scaffold N50 is 41.67 Mbp, and 23 chromosomes were assembled using Hi-C technology with a mounting rate of 99.96%. Genome annotation indicates that 53.29% of the genome is repetitive sequences, and 22,999 protein-coding genes are predicted, of which 21,855 have functional annotations. The chromosome-level genome of O. rebecca will not only provide important genomic resources for comparative genomic studies of gobies, but also expand our knowledge of the genetic origin of their unique features fascinating researchers for decades.

Chromosome-level genome assembly of Acrossocheilus fasciatus using PacBio sequencing and Hi-C technology

Article Open access 03 February 2024

Chromosome-level genome assembly and annotation of the Antarctica whitefin plunderfish Pogonophryne albipinna

Article Open access 12 December 2023

Chromosome-level genome assembly and annotation of the yellow grouper, Epinephelus awoara

Article Open access 31 January 2024

Background & Summary

Gobies have evolved many distinctive morphological features. For example, their pelvic fins have healed to form a suction cup for clinging to rocks in case they are swept away by rapids¹. The eel goby (Odontamblyopus rebecca) (Fig. 1), which belongs to the genus Odontamblyopus (Gobiidae: Amblyopinae)², is an eel-like benthic burrowing fish that lives mainly in warm waters such as the South China Sea and the Indo-West Pacific³. As a new component of gobies named in 2003, O. rebecca has evolved even more fascinating features that include a unique eel-like body plan, particularly degenerated eyes, and a benthic lifestyle. In recent years, studies of O. rebecca have mainly focused on geographic distribution patterns^4,5 and phylogeny based on mitochondrial genome data⁶. However, the unique phenotypic characteristics of O. rebecca and its molecular mechanism cannot be fully understood by analyzing the mitochondrial genome data alone. Therefore, a high-quality genome assembly to obtain an accurate annotation of the protein-coding genes as a basis for a full understanding of the genetic mechanism of the unique phenotypes is particularly important for O. rebecca.

In recent years, the rapid development of high-throughput sequencing technology and their gradual reduction in costs has made large-scale genome sequencing and assembly feasible in non-model taxa. Among them, next-generation sequencing (NGS) is highly accurate but is limited to short-read-length (typically 100 bp or 150 bp) sequencing and is thus not ideal for handling repetitive sequences. Meanwhile, third-generation sequencing (TGS) takes the advantage of long-read-length (typically 20–30 Kb) sequencing, but compromise in sequencing accuracy at single-base level⁷. Therefore, the prevailing genome-assembly strategy is to incorporate the merits of both sequencing technology by assembling the reference genome using TGS data while correcting assembly errors using NGS data. In combination with high-throughput chromosome conformation capture technology (Hi-C), the genome can be further assembled to the chromosome-level⁸. Such genome assembly strategy has been employed to address many important scientific problems in teleost fishes, to date^9,10,11.

In this study, we used next-generation DNBSEQ short reads (MGI Tech Co., Ltd, Shenzhen, China), third-generation Nanopore long reads (Oxford Nanopore Technologies (ONT)), Hi-C and RNA-Seq sequencing data to assemble and annotate the O. rebecca genome. The results revealed an assembled genome size of 918.57 Mb with 23 pseudochromosomes anchored. The completeness of the genome assembly was assessed using a number of parameters, which include scaffold N50 score (41.67 Mbp), BUSCO score (97.75%), map** ratio of short reads (99.65%) and transcripts (99.82%), indicating the high contiguity and quality of the genome assembly. In addition, 22,999 protein-coding genes were successfully predicted, of which 21,855 gene were functionally annotated in the public database, indicating the reliability of our predictions. The assembled chromosome-level genome of O. rebecca would not only provide important genomic resources for phylogenetic and comparative genomic studies of eel gobies, but also expand our understanding on the possible genetic origin of their unique features such as eel-like body plan, particular degenerated eyes fascinating researchers for decades.

Methods

Sampling, library construction, and sequencing

The O. rebecca sample was collected from the intertidal zone of Zhangzhou, Fujian Province, China. Briefly, dissection was performed in a sterilized environment, and organs including muscle, liver, and intestine were sampled and snap-frozen in liquid nitrogen for nucleic acid extraction. All anatomical procedures comply with relevant ethical regulations provided by the Institutional Animal Care and Use Committee of Zhejiang Ocean University, Zhejiang, China (Protocol Number: 2023082). Genomic DNA was extracted from muscle using the QIAGEN kit (QIAGEN, Cat. No. 13343). The total RNA was extracted from muscle, liver, and intestine using TRIzol reagent (Invitrogen, Carlsbad, CA, USA)¹². After extraction, the size and integrity of the extracted DNA and RNA were evaluated using 1% agarose gel electrophoresis, and the concentration and purity of DNA and RNA were further analyzed using a Nanodrop 2000c ultraviolet spectrophotometer. For genome assembly of O. rebecca, Nanopore sequencing libraries were first prepared with the SQK-LSK109 Ligation Sequencing Kit (Oxford Nanopore Technologies) following the manufacturer’s instruction. The prepared libraries were sequenced on R9.4.1 flow cells using a PromethION DNA sequencer (Oxford Nanopore Technologies) platform to generate the Nanopore long reads data. Secondly, short-insert (350–700 bp) paired-end libraries were constructed using the MGIEasy FS DNA Library Prep Kit (BGI, Cat. No.1000006988) and sequenced on the MGIDNB (MGIDNB T7) platform to generate the DNBSEQ short reads data to correct and evaluate the assembly from the extracted genomic DNA of O. rebecca. In addition, the Hi-C libraries were also constructed to generate Hi-C data to obtain chromosome-level genome assemblies using the isolated genomic DNA after fragmented and purified using magnetic beads. For genome annotation of O. rebecca, the complementary DNA libraries were constructed from RNA isolated from muscle, liver, and intestine using VAHTS Universal V6 RNA-seq Library Prep Kit for MGI (Vazyme, NRM604) according to the manufacturer’s instructions. For this purpose, the oligo dT magnetic beads were used to capture the mRNA, and then interrupted with the magnesium ions. The interrupted mRNA is reverse transcribed into a short cDNA using random primers, and end repair and A-tail addition were performed and sequenced also on the MGIDNB platform.

Quality control of raw sequencing data

All raw sequencing data generated in this study were filtered to remove adaptors, low-quality bases, and duplicate reads using different strategies depending on the platform used. For the DNBSEQ short reads, we used fastp software v0.23.2¹³ to remove adaptor sequences, low-quality reads, and short sequences with parameters set as “-l = 50, -w = 6”. Then, we checked the quality of the cleaned data using FastQC software v0.11.9¹⁴ and found very high base scores in these data, indicating the high-quality of the sequencing data we obtained (Fig. 2). For the Nanopore long reads, the reads were filtered using the NanoFilt software v2.8.0¹⁵ with the parameter of “-q = 7”. The Hi-C data and RNA-seq data were filtered using the same method and parameter settings as for the DNBSEQ short reads. Finally, we obtained 48.21 Gbp of DNBSEQ short reads (Table 1), 84.64 Gbp of Nanopore long reads with an N50 length of 27.72 Kb (Table 2), and 146.02 Gbp of Hi-C sequencing data (Table 3). In addition, we obtained 41.10 Gbp of liver transcriptome data, 15.78 Gbp of muscle transcriptome data, and 6.62 Gbp of intestine transcriptome data (Table 4).

Table 1 Statistics of the genome sequencing data generated from MGIDNB T7 platform.

Full size table

Table 2 Statistics of the sequencing reads generated from Nanopore platform.

Full size table

Table 3 Statistics of the Hi-C sequencing data generated from MGIDNB T7 platform.

Full size table

Table 4 Statistics of RNA-seq data generated from MGIDNB T7 platform.

Full size table

Genome size estimation

DNBSEQ short reads were used to estimate the genome size based on k-mer analyses. To this end, all filtered high-quality DNBSEQ short reads data were calculated using kmerfreq v1.0¹⁶ with the parameters of “-k = 17, -l”. Here, the 17-mer was selected because such k-mer size was demonstrated capable of generating adequate unique k-mer sequences for a sound genome size evaluation when the genome size falls into a scope of what is typical in Gobiidae^17,18,19. The genome size was estimated using the formula: genome size = TKN_17-mer/PKFD_17-mer, where TKN_17-mer is the total number of k-mers and PKFD_17-mer is the peak frequency depth of the 17-mer. The estimated genome size was then used to evaluate the subsequent result of the genome assembly. The results revealed an estimated genome size of ~903.03 Mbp in O. rebecca. The kmer distribution of the genome consists of three peaks (Fig. 3), which may correspond to the heterozygous, homozygous, and repeated k-mers, respectively, as usually observed in many other teleost fishes^20,21.

Genome assembly

Nanopore long reads have a relatively higher error rate at the single-base level compared to DNBSEQ short reads. Therefore, we first performed error correction on the raw sequencing data, and the resulting Nanopore long clean reads were thereafter assembled into the genome using NextDenovo software v2.4.0 (https://github.com/Nextomics/Next Denovo) with parameters set as “read_type = ont, read_cutoff = 1k, and pa_correction = 3”. To this end, the filtered Nanopore clean data were split and compared with each other using Minimap2 software v2.9²² to find overlap areas between reads and remove redundant overlap areas. The string graph algorithm was then applied to assemble high-quality genomes. NextPolish software v1.4.1²³ was further employed to correct the base errors (SNV/Indel) to improve the accuracy of the genome assembly using the DNBSEQ short reads with the parameters set as “sgs_options = -max_depth 100-bwa, lgs_options = -min_read_len 1k -max_depth 100, lgs_minimap2_options = -x map -ont”. The redundant heterozygous contigs were identified and removed based on sequence similarity and the proportion of redundant parts in total contig length calculated by the Purge_haplotigs software v1.0.4²⁴. The preliminary assembly yielded a genome size of 918.80 Mbp with 191 contigs and a contig N50 of 24.75 Mbp (Table 5). Hi-C sequencing data were further used for chromosome assembly by using 3D denovo assembly software v170123²⁵ with parameters set as “rounds = 0, stage = polish”. Juicer software²⁶ and JuiceBox software v1.11.08²⁷ were then used for interaction map generation and error correction (Fig. 4). Finally, 23 chromosomes were obtained with a scaffold N50 of 41.67 Mb (Table 5; Figs. 4, 5), and the assembly rate of contigs into chromosomes was up to 99.96% (Table 6). Such a chromosomes number was consistent with what was observed in other closely related species of Boleophthalmus pectinirostris, Periophthalmus modestus (Gobiidae: Oxudercinae) and Taenioides sp (Gobiidae: Amblyopina). In addition, all the 23 pseudochromosomes could be distinguished easily based on the heatmap (Fig. 4), and the interaction signal around the diagonal was considerably strong, indicating the high-quality of this genome assembly.

Table 5 Statistics of the assembled genome based on the Nanopore and Hi-C data.

Full size table

Table 6 Summary of the chromosome assemblies for O. Rebecca based on Hi-C data.

Full size table

Genome evaluation

The completeness and accuracy of the genome assembly could have been reflected by the statistics of contig/scaffolds N50 analyses (contig: 24.75 Mbp; scaffolds: 41.67 Mbp) as indicated above. Here, the quality of the genome assembly was further assessed using three extra statistics resulting from BUSCO, short reads map** ratio, and transcripts map** ratio analyses. (1) For BUSCO analysis, the Benchmarking Universal Single-Copy Orthologs (BUSCO) v5.3.1²⁸ software was used to search against a single-copy orthologous gene library of Actinopterygii (https://busco-data.ezlab.org/v5/data/lineages/ actinopterygii_odb10.2021-02-19.tar.gz) to assess the integrity of coding regions from the assembled genome. The results showed that a total of 3,640 core genes were identified, including 3,558 complete genes, 3,517 single-copy genes, 41 multi-copy genes, 30 fragmented genes, and 52 deletion genes, which account for 97.75%, 96.62%, 1.13%, 0.82% and 1.43% of the total genes, respectively (Table 7). (2) For short reads map** ratio analysis, the genome index was first built by the BWA-MEM software v0.7.17-r1188²⁹ using the parameters of “-a bwtsw”. The DNBSEQ short reads were then mapped to the genome to assess the completeness of the assembly. The map** ratio was calculated by the flagstat function of SAMtools software. The results showed that the total map** rate of DNBSEQ short reads to the genome was 99.65%, the paired map** rate was 99.64%, and the properly paired map** rate was 94.74% (Table 8). (3) For transcripts map** ratio analysis, all the RNA-Seq reads (99.16 Mb) were first assembled into transcripts using StringTie software 1.3.5. Linux_x86_64³⁰, and then it was mapped to the genome using BLAT software v37x1³¹. The results showed that a total of 41,624 reads were mapped to the genome, with a map** rate of 99.82% (Table 9). Taken together, all the results indicated that we had obtained a high-quality chromosome-level assembly of the O. rebecca genome.

Table 7 Results of the BUSCO assessment for genome assembly in O. Rebecca.

Full size table

Table 8 The map** ratio of the short reads to the assembled genome of O. rebecca.

Full size table

Table 9 The map** ratio of transcript to the assembled genome of O. rebecca.

Full size table

Annotation of repetitive sequences

To annotate the repetitive elements in the O. rebecca genome, including tandem repeats and transposable elements (TEs), we integrated a homology prediction using the Repbase library³² (http://www.girinst.org/repbase) and a de novo prediction based on self-sequence alignment and repetitive sequence features. The tandem repeat was annotated using Tandem Repeat Finder software v4.09³³ with parameters were set as “Match = 2, Mismatch = 7, Delta = 7, PM = 80, PI = 10, Minscore = 50, MaxPeriod = 2000 -d -h”. TEs were de novo predicted on both DNA and protein levels. On the DNA level, RepeatModeler software v1.0.11³⁴ (-database mydb -pa 10) and LTR-FINDER v1.0.7³⁵ (-w 2 -o 3 -t 1 -e 1 -m 2 -u -2 -D 20000 -d 1000 -L 3500 -l 100 -p 20 -g 50 -G 2 -T 4 -S 6.00 -M 0.00 -B 0.400 -b 0.400 -O 40 -F 0) were used to build de novo repeat library. RepeatMasker software open-4.0.9³⁶ (http://repeatmasker.org) (-nolow -no_is -norna -parallel 2) was then run against the de novo library and repbase (RepBase v.16.02) separately to identify homologous repeats. On the protein level, RepeatProteinMask v4.0.9 was used to search TEs in its protein database. Finally, the annotation results of all repetitive sequences were merged as a final result. The results showed that a total of 489.68 Mb of sequences were identified as repetitive sequences (including TEs, satellite, simple repeat, others, and unknowns) in the O. rebecca genome, accounting for 53.29% of the genome size (Table 10). Among them, 297.90 Mb of transposable elements (TEs) were annotated, accounting for 32.43% of the genome (Table 11). There are four major types of TEs, of which, DNA elements, long interspersed nuclear elements (LINEs), short interspersed nuclear elements (SINEs), and long erminal repeats (LTRs) (Fig. 5) account for 16.12% (148.10 Mbp), 8.49% (78.00 Mbp), 1.07% (9.82 Mbp), and 6.75% (61.98 Mbp) of the genome, respectively.

Table 10 Statistics of the annotated repeat sequences in O. rebecca genome.

Full size table

Table 11 Statistics of the repetitive elements in O. rebecca genome.

Full size table

Prediction of protein-coding genes

To obtain a high-confidence gene set, a combination of three strategies of de novo prediction, homology-based prediction, and transcripts-based prediction were used to annotate the protein-coding genes. (1) For De novo prediction, GlimmerHMM software v3.0.4³⁷, Genscan software v1.0³⁸, and Augustus software v3.3.2³⁹ (-species = zebrafish, -uniqueGeneId = true, -noInFrameStop = true, -gff3 = on, -strand = both) were performed. (2) For homology-based prediction, the already predicted protein-coding gene sequences of close-related species, including Oryzias latipes (GCF_002234675.1), Boleophthalmus pectinirostris (GCF_026225935.1), Periophthalmus magnuspinnatus (GCF_009829125.3), Takifugu rubripes (GCF_901000725.2) and Danio rerio (GCF_000002035.6), were first downloaded from public databases. Then, all sequences were aligned to the O. rebecca genome using TBLASTN software v2.11.0+⁴⁰ with an e-value of 0.01. The TBLASTN results were further processed to obtain the final homology-based prediction results for each species with parameters set as “model = protein2genome, showtargetgff = 1” using exonerate software v2.2.0⁴¹. (3) For transcripts-based prediction, StringTie software v1.3.5 was first used to assemble transcripts with parameters set as “-f 0.1 -m 200 -a 10 -c 2.5 -g 50 -M 1.0”. HISAT2 software v2.1.0⁴² was thereafter used to map the RNA-seq data to the genome with parameters set as “-dta -summary-file -S -x -1 -2”. TransDecoder software v5.5.0 (https://github.com/TransDecoder/TransDecoder) was used to predict the coding region of each transcript with parameters set as “-retain_long_orfs_mode dynamic -retain_long_orfs_length 150 -T 500”. Finally, Maker2 software v2.31.10⁴³ was used to integrate the gene annotation results generated by the three methods to obtain the final gene set with parameters set as “-r local -o tmp -p 4”. The results revealed a total of 22,999 protein-coding genes that were successfully predicted in the O. rebecca genome (Table 12). We checked the quality of the annotated genes by comparing them with several species that share evolutionary affinity, and the results indicated a similarity in the distributions of mRNA length, CDS length, exon length, and intron length between genomes of O. rebecca and those closely related species (Fig. 6), possibly incating they share similar patterns of gene structure distribution as the published genomes.

Table 12 Functional annotation of the predicted protein-coding genes in O. rebecca genome.

Full size table

Functional annotation of protein-coding genes

To evaluate the annotation quality and obtain the biological function information of the predicted protein-coding gene set, we compared the protein sequences output in this study with all the existing public protein databases, including InterPro⁴⁴ (2021) (https://www.ebi.ac.uk/interpro/), GO⁴⁵ (5.61–93.0) (http://geneontology.org/docs/go-annotations/), Kyoto Encyclopedia of Genes and Genomes (KEGG)⁴⁶ (3.0) (http://www.genome.jp/kegg/), SwissProt⁴⁷ (2021) (http://www.uniprot.org/), TrEMBL (2021) (http://www.uniprot.org/), TF (AnimalTFDB3.0), Pfam⁴⁸ (01.34.0) (http://pfam.xfam.org), NCBI Non-Redundant Protein Sequence Database (NR) (2021) (https://www.ncbi.nlm.nih.gov/refseq/about/non-redundantproteins/), and Eukaryotic Orthologous Groups of Proteins (KOG) (2003) (ftp://ftp.ncbi.nih.gov/pub/COG/KOG/kyva). Functional information was analyzed using BLAST software v2.31.10⁴⁹. The results showed that a total of 21,855 genes could be annotated, accounting for 95.03% protein-coding genes, and only 1,144 genes could not be annotated, accounting for 4.97% protein-coding genes (Table 12), further suggesting we got a reliable assembly and annotation of O. rebecca genome.

Data Records

The genomic DNBSEQ short-insert sequencing data were deposited in the Sequence Read Archive at NCBI SRR25064244⁵⁰. The genomic Nanopore sequencing data were deposited in the Sequence Read Archive at NCBI SRR25064242⁵¹. The transcriptome sequencing data were deposited in the Sequence Read Archive at NCBI SRR25064238⁵², SRR25064239⁵³, SRR25064240⁵⁴, and SRR25064243⁵⁵. The Hi-C sequencing data were deposited in the Sequence Read Archive at NCBI SRR25064241⁵⁶. The final genome assembly was deposited in GenBank at NCBI with the accession number ASM3068695v1⁵⁷, the Submitted GenBank assembly number is GCA_030686955.1, the BioProject number is PRJNA977196, and the BioSample ID is SAMN35453534. The annotation results of repetitive sequences, gene structure, and functional prediction were deposited in the Figshare database under DOI code: https://doi.org/10.6084/m9.figshare.23689398⁵⁸.

Technical Validation

Genome evaluation

The quality of O. rebecca genome assembly was evaluated using N50, BUSCO, short reads map** ratio, and transcripts map** ratio analyses. Results showed that the assembly contained good contiguity, a high percentage of complete and single-copy genes, had a high map** rate of short reads and transcripts, indicating a high-quality assembly.

Code availability

The software used in this study is in the public domain, with parameters clearly described in Methods. Where detailed parameters were not provided for the software, default parameters were used instead, as suggested by the developers. No custom script or code was used.

References

Forker, G. K., Schoenfuss, H. L., Blob, R. W. & Diamond, K. M. Bendy to the bone: Links between vertebral morphology and waterfall climbing in amphidromous gobioid fishes. J. Anat. 239, 747–754, https://doi.org/10.1111/joa.13449 (2021).
Article PubMed PubMed Central Google Scholar
Murdy, E. O. & Shibukawa, K. A revision of the gobiid fish genus Odontamblyopus (Gobiidae: Amblyopinae). Ichthyol. Res. 48, 31–43, https://doi.org/10.1007/s10228-001-8114-9 (2001).
Article Google Scholar
Murdy, E. O. & Shibukawa, K. Odontamblyopus rebecca, a new species of amblyopine goby from Vietnam with a key to known species of the genus (Gobiidae: Amblyopinae). Zootaxa 138, 1–6, https://doi.org/10.11646/zootaxa.138.1.1 (2003).
Article Google Scholar
Lü, Z. M. Climate adaptation and drift shape the genomes of two eel-goby sister species endemic to contrasting latitude. Animals 13, 3240, https://doi.org/10.3390/ani13203240 (2023).
Article PubMed PubMed Central Google Scholar
Tang, W. X. et al. Cryptic species and historical biogeography of eel gobies (Gobioidei: Odontamblyopus) along the Northwestern Pacific Coast. Zool. Sci. 27, 8–13, https://doi.org/10.2108/zsj.27.8 (2010).
Article Google Scholar
Liu, Z. S. et al. Complete mitochondrial genome of three fish species (Perciformes: Amblyopinae): genome description and phylogenetic relationships. Pak. J. Zool. 49, 107–115, https://doi.org/10.17582/journal.pjz/2017.49.1.107 (2017).
Article CAS Google Scholar
Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351, https://doi.org/10.1038/nrg.2016.49 (2016).
Article CAS PubMed PubMed Central Google Scholar
Belton, J. M. et al. Hi-C: A comprehensive technique to capture the conformation of genomes. Methods 58, 268–276, https://doi.org/10.1016/j.ymeth.2012.05.001 (2012).
Article CAS PubMed Google Scholar
Bi, X. P. et al. Tracing the genetic footprints of vertebrate landing in non-teleost ray-finned fishes. Cell 184, 1377–1391, https://doi.org/10.1016/j.cell.2021.01.046 (2021).
Article CAS PubMed Google Scholar
Lü, Z. M. et al. Large-scale sequencing of flatfish genomes provides insights into the polyphyletic origin of their specialized body plan. Nature Genet. 53, 742–751, https://doi.org/10.1038/s41588-021-00836-9 (2021).
Article CAS PubMed Google Scholar
Wang, K. et al. African lungfish genome sheds light on the vertebrate water-to-land transition. Cell 184, 1362–1376, https://doi.org/10.1016/j.cell.2021.01.047 (2021).
Article CAS PubMed Google Scholar
Rio, D. C., Ares, M. Jr., Hannon, G. J. & Nilsen, T. W. Purification of RNA using trIzol (TRI reagent). Cold Spring Harb Protoc 6, pdb.prot5439, https://doi.org/10.1101/pdb.prot5439 (2010).
Article Google Scholar
Chen, S. F., Zhou, Y. Q., Chen, Y. R. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, 884–890, https://doi.org/10.1093/bioinformatics/bty560 (2018).
Article CAS Google Scholar
Andrews, S. FastQC A Quality Control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/ doi:citeulike-article-id:11583827 (2010).
De Coster, W., D’Hert, S., Schultz, D. T., Cruts, M. & Van Broeckhoven, C. NanoPack: visualizing and processing long-read sequencing data. Bioinformatics 34, 2666–2669, https://doi.org/10.1093/bioinformatics/bty149 (2018).
Article CAS PubMed PubMed Central Google Scholar
Marcais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770, https://doi.org/10.1093/bioinformatics/btr011 (2011).
Article CAS PubMed PubMed Central Google Scholar
You, X. X. et al. Mudskipper genomes provide insights into the terrestrial adaptation of amphibious fishes. Nat. Commun. 5, 5594, https://doi.org/10.1038/ncomms6594 (2014).
Article ADS CAS PubMed Google Scholar
Bian, C. et al. Genomics comparisons of three chromosome-level mudskipper genome assemblies reveal molecular clues for water-to-land evolution and adaptation. J. Adv. Res. 21, S2090–1232, https://doi.org/10.1016/j.jare.2023.05.005 (2023).
Article CAS Google Scholar
Liu, Y. T. et al. Genome sequencing provides novel insights into mudflat burrowing adaptations in eel goby Taenioides sp. (Teleost: Amblyopinae). Int. J. Mol. Sci. 24, 12892, https://doi.org/10.3390/ijms241612892 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Cai, M. Y. et al. Chromosome assembly of Collichthys lucidus, a fish of Sciaenidae with a multiple sex chromosome system. Sci. Data 6, 132, https://doi.org/10.1038/s41597-019-0139-x (2019).
Article CAS PubMed PubMed Central Google Scholar
Zhang, K. et al. A chromosome-level reference genome assembly of the Reeve’s moray eel (Gymnothorax reevesii). Sci. Data 10, 501, https://doi.org/10.1038/s41597-023-02394-7 (2023).
Article CAS PubMed PubMed Central Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100, https://doi.org/10.1093/bioinformatics/bty191 (2018).
Article CAS PubMed PubMed Central Google Scholar
Hu, J., Fan, J. P., Sun, Z. Y. & Liu, S. L. NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics 36, 2253–2255, https://doi.org/10.1093/bioinformatics/btz891 (2020).
Article CAS PubMed Google Scholar
Roach, M. J., Schmidt, S. A. & Borneman, A. R. Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies. BMC Bioinformatics 19, 460, https://doi.org/10.1186/s12859-018-2485-7 (2018).
Article CAS PubMed PubMed Central Google Scholar
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95, https://doi.org/10.1126/science.aal3327 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Durand, N. C. et al. Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments. Cell Syst. 3, 95–98, https://doi.org/10.1016/j.cels.2016.07.002 (2016).
Article CAS PubMed PubMed Central Google Scholar
Durand, N. C. et al. Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom. Cell Syst. 3, 99–101, https://doi.org/10.1016/j.cels.2015.07.012 (2016).
Article CAS PubMed PubMed Central Google Scholar
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212, https://doi.org/10.1093/bioinformatics/btv351 (2015).
Article CAS PubMed Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760, https://doi.org/10.1093/bioinformatics/btp324 (2009).
Article CAS PubMed PubMed Central Google Scholar
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295, https://doi.org/10.1038/nbt.3122 (2015).
Article CAS PubMed PubMed Central Google Scholar
Kent, W. J. BLAT - The BLAST-like alignment tool. Genome Res. 12, 656–664, https://doi.org/10.1101/gr.229202 (2002).
Article CAS PubMed PubMed Central Google Scholar
Jurka, J. et al. Repbase update, a database of eukaryotic repetitive elements. Cytogenet. Genome Res. 110, 462–467, https://doi.org/10.1159/000084979 (2005).
Article CAS PubMed Google Scholar
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580, https://doi.org/10.1093/nar/27.2.573 (1999).
Article CAS PubMed PubMed Central Google Scholar
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc. Natl. Acad. Sci. USA 117, 9451–9457, https://doi.org/10.1073/pnas.1921046117 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 35, W265–W268, https://doi.org/10.1093/nar/gkm286 (2007).
Article PubMed PubMed Central Google Scholar
Tarailo-Graovac, M. & Chen, N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr. Protocols. BioInf. 25, 4–10, https://doi.org/10.1002/0471250953.bi0410s25 (2009).
Article Google Scholar
Majoros, W. H., Pertea, M. & Salzberg, S. L. TigrScan and GlimmerHMM:: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879, https://doi.org/10.1093/bioinformatics/bth315 (2004).
Article CAS PubMed Google Scholar
Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94, https://doi.org/10.1006/jmbi.1997.0951 (1997).
Article CAS PubMed Google Scholar
Stanke, M. et al. AUGUSTUS:: ab initio prediction of alternative transcripts. Nucleic. Acids. Res. 34, W435–W439, https://doi.org/10.1093/nar/gkl200 (2006).
Article CAS PubMed PubMed Central Google Scholar
Gertz, E. M., Yu, Y. K., Agarwala, R., Schäffer, A. A. & Altschul, S. F. Composition-based statistics and translated nucleotide searches: Improving the TBLASTN module of BLAST. BMC Biol. 4, 41, https://doi.org/10.1186/1741-7007-4-41 (2006).
Article CAS PubMed PubMed Central Google Scholar
Slater, G. S. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31, https://doi.org/10.1186/1471-2105-6-31 (2005).
Article CAS PubMed PubMed Central Google Scholar
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genoty** with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915, https://doi.org/10.1038/s41587-019-0201-4 (2019).
Article CAS PubMed PubMed Central Google Scholar
Holt, C. & Yandell, M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12, 491, https://doi.org/10.1186/1471-2105-12-491 (2011).
Article PubMed PubMed Central Google Scholar
Finn, R. D. et al. InterPro in 2017-beyond protein family and domain annotations. Nucleic Acids Res. 45, D190–D199, https://doi.org/10.1093/nar/gkw1107 (2017).
Article CAS PubMed Google Scholar
Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nature Genet. 25, 25–29, https://doi.org/10.1038/75556 (2000).
Article CAS PubMed Google Scholar
Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic. Acids. Res. 28, 27–30, https://doi.org/10.1093/nar/28.1.27 (2000).
Article CAS PubMed PubMed Central Google Scholar
Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic. Acids. Res. 28, 45–48, https://doi.org/10.1093/nar/28.1.45 (2000).
Article CAS PubMed PubMed Central Google Scholar
Mistry, J. et al. Pfam: The protein families database in 2021. Nucleic. Acids. Res. 49, D412–D419, https://doi.org/10.1093/nar/gkaa913 (2021).
Article CAS PubMed Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410, https://doi.org/10.1016/s0022-2836(05)80360-2 (1990).
Article CAS PubMed Google Scholar
NCBI Sequence Read Archive https://www.ncbi.nlm.nih.gov/sra/SRR25064244 (2023).
NCBI Sequence Read Archive https://www.ncbi.nlm.nih.gov/sra/SRR25064242 (2023).
NCBI Sequence Read Archive https://www.ncbi.nlm.nih.gov/sra/SRR25064238 (2023).
NCBI Sequence Read Archive https://www.ncbi.nlm.nih.gov/sra/SRR25064239 (2023).
NCBI Sequence Read Archive https://www.ncbi.nlm.nih.gov/sra/SRR25064240 (2023).
NCBI Sequence Read Archive https://www.ncbi.nlm.nih.gov/sra/SRR25064243 (2023).
NCBI Sequence Read Archive https://www.ncbi.nlm.nih.gov/sra/SRR25064241 (2023).
NCBI GenBank https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_030686955.1/ (2023).
Lü, Z. M. Chromosome-level genome assembly and annotation of eel goby, Odontamblyopus rebecca. figshare. Dataset. https://doi.org/10.6084/m9.figshare.23689398.v1 (2023).

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (NSFC) (41976121) and (42171069).

Author information

These authors contributed equally: Zhenming Lü, Ziwei Yu, Wenkai Luo.

Authors and Affiliations

National Engineering Laboratory of Marine Germplasm Resources Exploration and Utilization, College of Marine Sciences and Technology, Zhejiang Ocean University, Zhoushan, 316022, China
Zhenming Lü, Tianwei Liu, Yuzheng Wang, Yantang Liu, **g Liu, Bingjian Liu, Li Gong & Liqin Liu
School of Ecology and Environment, Northwestern Polytechnical University, **’an, 710072, China
Ziwei Yu, Wenkai Luo & Yongxin Li

Authors

Zhenming Lü
View author publications
You can also search for this author in PubMed Google Scholar
Ziwei Yu
View author publications
You can also search for this author in PubMed Google Scholar
Wenkai Luo
View author publications
You can also search for this author in PubMed Google Scholar
Tianwei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yuzheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yantang Liu
View author publications
You can also search for this author in PubMed Google Scholar
**g Liu
View author publications
You can also search for this author in PubMed Google Scholar
Bingjian Liu
View author publications
You can also search for this author in PubMed Google Scholar
Li Gong
View author publications
You can also search for this author in PubMed Google Scholar
Liqin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yongxin Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Yongxin Li and Zhenming Lü conceived and designed the research. Tianwei Liu, Yuzheng Wang, and Yantang Liu, collected the samples and extracted the genomic DNA. Zhenming Lü, Ziwei Yu, Wenkai Luo, Bingjian Liu, Li Gong, and Liqin Liu conducted the experiments, and analyzed part of the data, Ziwei Yu, WenkaiLuo and Zhenming Lü wrote the manuscript. All authors read, revised, and approved the final version of the manuscript.

Corresponding author

Correspondence to Yongxin Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Lü, Z., Yu, Z., Luo, W. et al. Chromosome-level genome assembly and annotation of eel goby (Odontamblyopus rebecca). Sci Data 11, 160 (2024). https://doi.org/10.1038/s41597-024-02997-8

Download citation

Received: 25 August 2023
Accepted: 25 January 2024
Published: 02 February 2024
DOI: https://doi.org/10.1038/s41597-024-02997-8
Springer Nature Limited

Chromosome-level genome assembly and annotation of eel goby (Odontamblyopus rebecca)

Abstract

Similar content being viewed by others

Chromosome-level genome assembly of Acrossocheilus fasciatus using PacBio sequencing and Hi-C technology

Chromosome-level genome assembly and annotation of the Antarctica whitefin plunderfish Pogonophryne albipinna

Chromosome-level genome assembly and annotation of the yellow grouper, Epinephelus awoara

Background & Summary

Methods