Background

In shotgun metagenomic approaches, limitations in the read length (i.e., ~ 100–250 bp sequences) often translate into fragmented reconstructed metagenome-assembled genomes (MAGs) with uncertain levels of genomic completion during de novo assembly. These complications are primarily due to highly repetitive regions, high levels of sequence microdiversity, multiple copies of genes, and AT-rich/GC-rich regions [1]. Overcoming these limitations is paramount to understanding the role of microorganisms in natural processes and analyzing their diversity in environmental and gut microbiomes.

The emergence of long-read sequencing technologies restores the hopes of overcoming these limitations in genomic sequence recovery. Sequencing platforms from Oxford Nanopore and Pacific Biosciences (PacBio) can produce longer reads, although at the expense of a higher sequencing error rate and less sequencing depth compared to Illumina short-reads [2] (SR). For instance, the median length of reads ranges from 5 to 20 kbp and throughputs from 15 to 50 Gbp in LR technologies. Current sequencing chemistries yield observed modal read accuracies of 99.99%, 99.14%, and 99.9% for Illumina, Oxford Nanopore, and PacBio, respectively [3, 4]. Moreover, advances in technological and bioinformatic approaches are closing the gaps between short- and long-read sequencing technology applications, especially for recovering high-quality MAGs from the environment. Thus, long-read (LR) shotgun metagenomics is poised to set new standards for MAG quality. For instance, current PacBio Sequel II technology offers circular consensus sequencing (CCS), providing a low-error rate in high fidelity reads, although at a shorter read length than the traditional long-read technology [3]. Additionally, better genome statistics (low number of contigs and high N50 values) [https://github.com/PacificBiosciences/pbmm2/; --preset HIFI -× 97 -N 1), which is a wrapper for minimap2 [28]. All MAGs were filtered based on a quality metric based on completion and contamination values obtained from checkM [29] v1.1.3 ([completion%] - 5*[contamination%] >  = 50). The de-replication of MAGs was done to assess the number of MAGs sharing > 99% ANI obtained from each sequencing platform using dRep [30] v3.0.0. For statistical tests between pairs of MAGs, the Shapiro-Wilk normality test and Wilcoxon rank tests were performed in the R statistical software v4.1.1.

For Illumina metagenomes, MAG abundances were determined as relative abundance (mapped reads/total reads) and as the quotient between the truncated average sequencing depth (TAD) [31] and the total sequencing depth of microbial genomes “genome equivalents” as determined in MicrobeCensus [32] v1.1.1. The truncated average sequencing depth was determined using BedGraph files considering zero-coverage positions (bedtools genomecov -bga) [33] and the “BedGraph.tad.rb” script (-r 0.8) from the enveomics collection [18]. Abundances for MAGs derived from LR were determined using the average sequencing depth (i.e., non-truncated) as specified above and normalized using the median sequencing depth of 16 single-copy gene markers predicted in unassembled long-reads (rpl2, rpl3, rpl4, rpl5, rpl6, rpl14, rpl15, rpl16, rpl18, rpl22, rpl24, rps3, rps8, rps10, rps17, and rps19; see gene prediction and annotation below).

MAGs defined as “shared” or detectable using both technologies were defined as those MAGs sharing >  = 99% ANI, as determined in fastANI [34] v1.32, obtained from each technology at one specific sampling date. Taxonomic classification of MAGs was performed using GTDB-tk [35] v1.7 and the GTDB [36] release r202. In GTDB-tk, MAGs are classified into species using a 95% ANI threshold.

Comparison of gene predictions in unassembled and assembled long-read metagenomes

Gene predictions in unassembled LR were performed using FragGeneScan [37] v1.31. However, we compared different tools to ensure better gene predictions. First, for the March 10, 2020, LR sample gene predictions were performed using Prodigal [38] v2.6.3 (meta option), MetaGeneMark [39] v3.38, and FragGeneScan [37] v1.31. For the last algorithm, we compared predictions using complete/short sequences (-w 0 or 1) and different sequencing error models (sanger_5 and sanger_10). All predicted sequences were compared against the TrEMBL protein sequence database (downloaded April 27, 2021) using DIAMOND [7). For the most part, the cross-map** of SR and LR on unique MAG species resulted in low sequencing depth (median = 6.4 vs. 2.8) and breadth of the coverage (median = 96 vs. 88.3%) for SR and LR technologies. Thus, uniquely detected species in each dataset are likely due to a combined effect of differences in GC content [46] and sequencing depth between technologies.

Other considerations when choosing LR technologies

Currently, PacBio LR shotgun metagenomics is of higher cost per Gbp than SR (~ 2.4 times higher for our project, a further breakdown of costs is available in Table S1). The cost per Gbp of Nanopore is currently between Illumina and PacBio. Nanopore technologies offer the affordability and benefits of recovering longer reads or the possibility of including short technologies for the better recovery of high-quality MAGs [4, 10, 50]. Nonetheless, sequencing error and read lengths should be considered when selecting between LR technologies [4]. Despite the cost differences, the results presented here can guide researchers in deciding if LR metagenomics would be beneficial over SR approaches.

The current stage of algorithms and approaches for LR metagenomics is still limited compared to the large toolbox of SR technologies. While the methodology used here reflects the most appropriate tools and algorithms available at the time, we recommend that future studies pursue a critical assessment of newer approaches [12] when using LR techniques. The dataset presented here can also serve as a reference for testing and comparing algorithms and approaches for shotgun LR metagenomic sequencing.

Conclusions

Our results highlight that switching from SR to LR metagenomic sequencing for microbial community analyses would still capture similar taxonomic composition from population genomes but recover higher-quality MAGs. Nonetheless, SR technologies offered more sequenced bases (e.g., three more times base pairs on average) than LR sequencing on single runs. This higher sequencing effort also translated into a higher number of dereplicated MAGs compared to LR metagenomic samples (i.e., a higher diversity of population genomes). This observation is relevant when the goal is to recover low-abundant organisms. Our work indicates a strongly decreased genome fragmentation and increased recovery of 16S rRNA genes in LR MAGs. These two features translate into better preservation of the order of genes in unassembled LR or LR-derived contigs. For instance, the generation of 16S rRNA probes for fluorescence in situ hybridization for single-cell identification and quantification. Even though a high fraction of overlap** reads was detected between technologies, differences in GC content likely resulted in slight differences in the recovery and abundance of some population genomes.