Background

The human microbiota are complex systems that contribute to a large part of human physiological activities and diseases. Knowing the genomic sequences of the microbiota content allows us to study its functions. However, microbial genome sequences are difficult to obtain. While a few microbes can survive isolation and be cultured in vitro for sequencing, the remaining microbial content remains as “microbial dark matter”. Alternatively, there have been attempts to use computational means to reconstruct the microbial genomes from a mixture of short-reads sequenced from them. However, such metagenome assembly faces the difficulties of having repetitive sequences of both intra- and inter-species origin, horizontal gene transfers, and mobilization events [1], complicated by uneven abundance of microbes in the sample.

Current algorithms such as IDBA-UD [2], MEGAHIT [3], and MetaSPAdes [4] make use of read depth and fragment insert size constraint to unravel the repetitive sequences and estimate microbial abundance. However, their reliability is affected by the low continuity of short-read assembly. Long-read sequencing has been used to attempt to mitigate these problems, e.g., Nicholls et al. [5] and Sevim et al. [6]. In particular, Moss et al. [7] optimized the long-read library preparation protocol of nanopore sequencing and produced more complete bacterial genomes. However, the application of long-read sequencing in practical application remains costly (the “Discussion” section).

Alternative sequencing platforms that provide long-range sequence information for metagenomics are available in the form of Illumina Truseq Synthetic Long Reads (SLR) and linked-reads. SLR arranges long DNA fragments into 384 well plates, which are further amplified and pooled sequenced with sufficiently deep sequencing depth (~ 50X per fragment), thus allowing long fragments to be assembled individually [8, 9]. Linked-reads are short-reads where reads from the same fragment are marked with the same barcode. The 10x linked-read microfluidic system assigns long DNA fragments into around 1 million partitions, where each fragment is sequenced with a shallow depth (0.1X–0.4X). A method for linked-read metagenome assembly, Athena-meta [10], bridges the gaps between contigs by local assembly on co-barcoded reads and outperformed the methods for short-reads and SLR in assembling human gut and environmental microbiome.

There are four key parameters in linked-read sequencing which may impact metagenome assembly [11] (Fig. 1): (i) CR, average depth of short-reads per fragment; (ii) CF: average physical depth of the genome by long DNA fragments; (iii) NF/P, number of fragments per partition; (iv) Fragment length distribution, which is specified using two parameters, namely, μFLaverage unweighted DNA fragment length and WμFLlength-weighted average of DNA fragment length. Several of these parameters are interdependent. For example, a greater amount of input DNA increases both CF and NF/P and decrease CR; and the absolute values of CF and CR are set by how much total read coverage (C) is generated because CR × CF = C. In a previous study, we investigated the effects of these parameters on human diploid assembly [11].

Fig. 1
figure 1

Parameters of linked-read sequencing to be investigated

The present study evaluates these parameters with respect to their impact on metagenome assembly. We used three sets of linked-reads, one from simulation, one from a mock community, and another from a real human gut microbiome sample. The simulated data consists of twenty datasets (Table S1) generated by an improved LRTK-SIM [11] that enables to deal with microbial samples with uneven abundance for this study (the “Methods” section). The mock community (ATCC MSA-1003) is a pool of 20 strains with staggered abundance, while the human gut microbiome is from a healthy Chinese stool sample. Because of an absence of ground truth to evaluate human gut microbiome, we annotated contig bins as draft genomes and assigned them to the corresponding taxonomic classification (the “Methods” section).

Our results show that deeper C resulted in more assembled sequences and enabled better genomic coverage, but it was irrelevant to the assembly quality. C was not a dominating factor for contig continuity, which could be influenced more by genome characteristics. We further found CR to affect the number of draft genomes and that CF was associated with assembly quality. The μFL had marginal effect on assemblies, and lower NF/P values would lead to better assemblies by reducing the ambiguities of off-target reads. Compared to Illumina short-reads, 10x linked-reads significantly improved the metagenome assembly in both contig continuity and genome completeness.

Results

Three sets of linked-reads are used. The first is simulated from the MBARC-26 [6] community (Table S1 and S2), and the twenty simulated datasets are annotated as \( {C}_F^{-} \), \( {C}_R^{-} \),\( {\mu}_{FL}^{-} \), and \( {N}_{F/P}^{-} \)(where superscript “-” represents the actual values of corresponding parameters, Table S1). The second and third sets are sequenced from a mock community of 20 strains (one lane reads from Illumina XTen, 108.7 GB, Table S3) and a human gut microbiome (two lane reads from Illumina XTen, 208.97 GB; the “Methods” section and Supplementary Note) followed by reads subsampling to match the expected parameter values. The microbial complexity in the human gut microbiome was evaluated by aligning linked-reads to the reference sequences from human microbiome project [12] (Supplementary Note). To obtain the datasets of different CR and CF, we subsampled short-reads (MSCR-) and long DNA fragments (MSCF-) of the mock community (the “Methods” section), where value of subscript “-” represents the reciprocal of sequenced lanes—for example, MSCR4/MSCF4 means quarter lane reads were subsampled. Since the composition of the human gut microbiome is unknown, SCR- and SCF- (where subscript “-” represents the reciprocal of sequenced lanes) were generated by subsampling short-reads and barcodes instead. To avoid confusion, we used MSC1 and SCall to denote total one lane and two lanes linked-reads from the mock community and human gut microbiome, respectively.

According to microbial relative abundance, the microbes were classified into low- (Lsim), medium- (Msim), and high-abundance (Hsim) in the simulated data (Table S2); and classified into low- (Lmock), medium- (Mmock), high- (Hmock), and ultrahigh-abundance (UHmock) in the mock community (Table S3). The contigs from the simulation and mock community were evaluated using two reference-based metrics (total aligned length and genomic coverage) and two measures for contig continuity (contig NG50 and NGA50). For human gut microbiome data, we annotated the contig bins as draft genomes and classified them into high-, medium-, and low-quality [13] (the “Methods” section). The number and quality of annotated draft genomes and contig N50 were used to evaluate the assemblies.

The influence of total read depth C

C has little effect on both total aligned length and genomic coverage for Lsim and Hsim microbes in the simulated data. For Msim microbes, their abundance correlates positively with total aligned length and genomic coverage, indicating that a low abundance could reduce assembly completeness even when C is high (Fig. 2a, b, e, and f).

Fig. 2
figure 2

Trends of total assembly length, genomic coverage, contig NG50, and NGA50 by subsampling CR (ad), CF (eh), μFL (il), and NF/P (mp) in simulated data. Microbes in cyan, blue, and red represent Lsim, Msim, and Hsim species, respectively

Similarly, we fail to observe any clear trend between NG50 (or NGA50) and C. Two microbes with the deepest C, NC_014212 and NC_017095 (with the highest abundance), were assembled into fragmented contigs (Fig. 2c and g), suggesting that C was not a dominating factor for contig continuity; which is also seen in \( {C}_F^{333} \) and \( {C}_R^{0.77} \), which have deeper C (C = 120X) than the other configurations. They achieved the largest total aligned length and genomic coverage, but their contig NG50 and NGA50 fluctuated and were not always the best. For example, NC_019904 and NC_002737, which have the lowest abundance among Msim microbes, yielded the largest total aligned length in \( {C}_F^{333} \) (NC_019904, 1.29 Mb; NC_002737, 1.42 Mb; Fig. 2a) and \( {C}_R^{0.77} \) (NC_019904, 1.77 Mb, NC_002737, 1.49 Mb; Fig. 2e). \( {C}_R^{0.77} \) assembled fragmented contigs for both of the microbes on NG50 (NC_019904 < 500 bp, NC_002737 = 64.14 kb; Fig. 2g) and NGA50 (NC_019904 < 500 bp, NC_002737 = 62.47 kb; Fig. 2h). Although \( {C}_F^{333} \) produced better NG50 on NC_002737 (NG50, 2.04 Mb, Fig. 2c), misassemblies were dispersed in its contigs (NGA50, 109.43 kb; Fig. 2d).

The results for the mock community are consistent with those from the simulated data. The total aligned length and genomic coverage were fairly stable for Lmock, Hmock, and UHmock microbes regardless of the value of C (Fig. 3a, b, e, and f). For Mmock microbes, the contigs from MSCF8/MSCR8 covered the reference genomes poorly due to the insufficient read depth. A quarter lane reads (MSCF4/MSCR4) appeared to suffice for the read depth, achieving around full genomic coverage for all Mmock microbes, except for ATCC_33323, which required a quarter lane reads more. No consistent trend could be observed for NG50 and NGA50; a quarter lane reads was necessary to generate contigs with non-zero NGA50 for Mmock microbes.

Fig. 3
figure 3

Trends of total assembly length, genomic coverage, contig NG50, and NGA50 by subsampling CR (ad) and CF (e–h) for the mock community. Microbes in cyan, blue, red, and black represent Lmock, Mmock and Hmock, and UHmock species, respectively

In the results with human gut microbiome, deeper C extends the assembly length but has no impact on the assembly quality. After binning contigs and classifying the bins into draft genomes (the “Methods” section), SCall produced the largest number of bins (148) and the longest assembly length (399.41 Mb). These statistics were reduced along subsampling reads progressively (Table 1). The proportions of bins annotated as draft genomes were reduced by increasing C (SCR8 77.78%, SCR4 69.39%, SCR2 63.75%, SCR1 65.38%, SCall 54.05%; SCF8 68.75%, SCF4 60.94%, SCF2 59.55%, SCF1 58.16%, SCall 54.05%). C negatively correlates with bin average contamination (SCall 14.40%, SCR2 10.46%, SCR4 9.08%, SCR8 8.94%; Table S4 and Figure S1). We annotated the draft genomes as genus or species (> 60% confidence) based on their k-mer similarities with known microbial genomes (the “Methods” section). Most of the taxonomical classifications were observed by at least two parameter configurations, although some were unique to only one (Figure S2). Considering the qualities of annotated draft genomes, C demonstrated a positive correlation with the number of medium- and low-quality bins (Table 1); SCR4 has the most high-quality bins and the largest average bin completeness (73.3%) compared to the other configurations (Table 1 and Table S4). The N50s of high-quality bins are significantly greater than medium- (p value = 0.01) and low-quality (p value = 5.3E−9) bins, suggesting that bin quality (determined by completeness and contamination) is highly correlated with contig continuities (Fig. 4a–c). Interestingly, high-quality bins required read coverage of at least 50X (SCall = 85.81X; SCF1 = 132.71X; SCF2 = 111.11X; SCF4 = 96.43X; SCF8 = 64.67X; SCR1 = 75.66X; SCR2 = 151.55X; SCR4 = 63.42X; SCR8 = 53.08X), suggesting that the low abundance microbes were not assembled into high-quality genomes. Nevertheless, the contigs with extremely high depth may come from repetitive sequences and reduce the qualities of bins they belong to (Fig. 4d–f; C (high) = 81.4X; C (medium) = 140.1X; C (low) = 1636.5X).

Table 1 Summary of the assemblies for subsampled linked-reads from human gut microbiome and Illumina short-reads
Fig. 4
figure 4

Contig N50 and read depth for high-, medium-, and low-quality bins

The tradeoffs between CR and CF

There are tradeoffs between CR and CF in maintaining the same C. Because the product of PCR amplification per partition can generate around 500 Mb short-reads, loading DNA with greater density (deeper CF) results in more fragments per partition and fewer reads sequenced for each fragment (shallower CR). For Msim and Hsim species in the simulated data, we found that increasing CR is more effective than increasing CF when C is around 10X, and they are comparably effective when C is beyond 30X (Fig. 2a, b, e, and f). As a rule, deep CR is more pressing to reconstruct DNA fragment if C is low. For the examples of \( {C}_F^{28} \) and \( {C}_R^{0.064} \) (C = 10x), \( {C}_F^{28} \) (CR = 0.36X) was significantly better than \( {C}_R^{0.064} \) (CR = 0.064X) in total aligned length (\( {C}_F^{28}:{C}_R^{0.064} \) = 2.17 Mb: < 500 bp) and genomic coverage (\( {C}_F^{28}:{C}_R^{0.064} \) = 62.93%: < 1%) for NC_012982. \( {C}_F^{28} \) generated more continuous contigs than \( {C}_R^{0.064} \) for the five Hsim species, NC_018068, NC_014364, NC_017033, NC021184, and NC_019792, (Fig. 2 c, d, g and h). In the mock community, MSCF- and MSCR- produced comparable assemblies when C was kept constant.

In human gut microbiome, SCF- generated more assembled sequences than SCR-(SCF1:SCR1 = 305.24 Mb:290.49 MB; SCF2:SCR2 = 244.55 Mb:225.73 Mb; SCF4:SCR4 = 188.90 Mb:159.40 Mb; SCF8:SCR8 = 152.65 Mb:115.96 Mb, Table 1), but had higher average bin contamination (SCF1:SCR1 = 14.04%:12.10%; SCF2:SCR2 = 14.80%:10.46%; SCF4:SCR4 = 12.66%:9.08%; SCF8:SCR8 = 12.17%:8.95%) and worse contig N50 (SCF1:SCR1 = 137.66 kb:168.67 kb; SCF2:SCR2 = 127.58 kb:151.49 kb; SCF4:SCR4 = 136.40 kb:181.46 kb; SCF8:SCR8 = 115.29 kb:118.0 kb). These observations suggest that deeper CR would result in more assembled sequences, while deeper CF would help in improving assembly quality.

DNA fragment length and metagenome assembly

DNA long fragment information is critical for linked-read assembly, as it can help in spanning the gaps between contig breaks that are due to genome variations and repetitive sequences. On the other hand, it may lead to the loss of barcode specificity in disentangling short tandem repeats if the fragments are exceedingly long.

In practice, it is difficult to extract very long DNA fragments from metagenomic sample; even on the gentlest DNA extractions, the mean fragment length (μFL) is usually at most 10 to 20 kb. Our simulated data of μFL from 5 to 100 kb showed that the assembly was not sensitive to μFL. In some special cases, extremely long DNA fragments could improve the assemblies of Msim microbes with high repeat rates. For example, \( {\mu}_{FL}^{100} \) (μFL = 100 kb) improved the contigs NG50 (3.17 Mb, Fig. 2k) and NGA50 (2.71 Mb, Fig. 2l) of NC_018014, which was the one with the highest repeat rate (18.3%).

Barcode specificity is important in microbial deconvolution

For human genome sequencing, each partition contains ten fragments (NF/P = 10) on average [14]. NF/P is supposed to be larger (NF/P = 40) for metagenomic sequencing due to the limited fragment size (WμFL = 11.15 kb) and relatively small microbial genome size (Table S5). Large NF/P also increases the difficulties in recognizing the fragments that short-reads belong to. The assembly on NF/P1, the smallest NF/P (NF/P = 10) in simulation, had much better NG50 and NGA50 for most of the Hsim and Msim microbes (14 out of 18, the remaining 4 microbes are comparable, Fig. 2 o and p). Small NF/P also failed to assemble Lsim microbes (Fig. 2 m and n).

Assembly on Illumina short-reads and PacBio CCS long-reads

Illumina short-read sequencing is a mainstream technology for metagenomic sequencing, but its quality for metagenome assembly is unsatisfactory due to the lack of long-range connectivity. We downloaded the short-read data of the mock community from the Sequence Read Archive [15] (the “Methods” section) and performed an assembly. The assembly on linked-reads (total aligned length 52.04 Mb; genomic coverage 77.20%) is much better than that on short-reads (total aligned length 38.13 Mb; genomic coverage 56.69%, NG50 and NGA50, see Figure S3). For human gut microbiome, the assembly from 8.8 Gb short-reads showed a comparable number of bins (53) and total assembly length (145.50 Mb vs. 159.40 Mb for SCR4, Table 1). However, the short-read assembly generated no bins with high-quality because it had known issue to detect rRNAs and tRNAs [16, 17] (Table 1). SCF8, with the worst N50 in linked-read assembly, was also 4.49 times (115.29 kb vs. 25.69 kb) greater than Illumina short-reads. The average bin contamination rate of 17.39% for the assembly from short-reads was also much worse than linked-reads (Table S6).

In mock community, we further compared linked-reads to PacBio CCS, which have both extreme long (N50 = 9.08Kb) and highly accurate (> 99% base accuracy) reads. The total aligned length and genomic coverage were comparable between CCS reads (54.04 Mb and 78.68%) and MSC1 (52.02 Mb and 77.18%), but CCS reads improved the contig continuity substantially (Figure S4).

Comparison to human genome parameter statistics

10x linked-read sequencing was originally developed for human genome assembly, so we compared the parameter distributions between human genome and human gut microbiome. Because no reference genome was available for human gut microbiome, we collected the sequences of all the non-redundant high-quality bins from SCF- and SCR- datasets as “pseudo” reference genomes and reconstructed 15,994,284 long fragments (> 2 kb). For human gut microbiome, CR was comparable (CR 0.30X vs. 0.32X, Table S5), and CF was 6.26 times larger than human genome (NA24385, CF 595.85X vs. 95.20X, Table S5); also, the DNA fragments were obviously much shorter (μFL 7.91 kb vs. 28.06 kb; WμFL 11.15 kb vs. 44.53 kb, Table S5, Figure S5 and S6).

Discussion

Human microbiota provide rich information to understand microbial activities impacting human health and disease. Projects such as HMP (Human Microbiome Project) [12] and MetaHIT (Metagenomics of the Human Intestinal Tract) [18] have been proposed to collect microbiomes from diverse places of human body and aimed to understand their compositions and functions. De novo metagenome assembly on short-reads is commonly used to assemble microbial genomes from a mixture of culture-free microbes. Although it has been widely applied to assemble thousands of bacterial genomes [19, 20], there are four difficulties that remain: (1) assembly for low-abundance microbes; (2) repetitive sequences assembly such as 16S, 23S rRNA; (3) assembly of regions with genetic variation; (4) strain level assembly based on haplotype phasing. Besides metaSPAdes used in the current study, IDBA-UD [2] and MEGAHIT [3] were also tested and achieved comparable results with metaSPAdes. They all showed much worse assembly than linked-reads (Table S6).

Long-read sequencing has the potential to assemble more complete genomes and is believed to dominate the field in the future. However, linked-reads are still worth to be considered as a transitional technology. First, both PacBio and Oxford Nanopore are several times more costly than 10x linked-reads (especially for library preparation). Second, high base error rate of long-reads lacks strength for haplotype phasing and strain level assembly. Third, clinical samples benefit from the small amount of input DNA required by linked-read sequencing. A previous study also observed some high-quality bins generated by linked-reads missed in long-reads assembly [7].

In this study, we comprehensively investigated the four parameters of linked-read sequencing on metagenome assembly, which could be fine-tuned in either library preparation or short-read sequencing. Read depth C and microbial abundance are the two most important parameters to determine genome coverage and the number of bins annotated as draft genomes. Low-abundance microbes were almost impossible to be assembled by any of the technologies; the assemblies of medium-abundance microbes were substantially improved by deep C, and they were fairly stable for high-abundance ones.

According to our observation, C should be chosen from 120X to 400X to optimize the assembly quality. There is a tradeoff between CF and CR, where deep CF can generate more high-quality bins and CR controls total assembly length. Large μFL enables DNA fragments spanning distant contigs, but it is unnecessary to produce extremely long fragments for microbial genomes. The repetitive sequences spread in microbial genomes are usually short (e.g.,16S: ~ 1.5 kb, 23S: ~ 2.9 kb), which could be resolved by assembling the co-barcoded reads with small NF/P.

Athena-meta includes four steps: (1) generate “seed” contigs using short-reads without barcodes; (2) link contigs into scaffold graph using aligned paired-end reads; (3) local assembly by recruiting co-barcoded reads that spanning both “seed” contigs; (4) pool and assemble the locally assembled sequences and “seed” contigs. We can link and interpret our observations with the corresponding strategies in Athena-meta. C is critical to construct “seed” contigs, as high-quality seed contigs are the prerequisite for local assembly using co-barcoded reads. CR and CF impact reconstruction of long DNA fragments, and the probability of two distant contigs spanned by the same fragment, respectively. Small NF/P can reduce off-target reads and make local-assembly more efficiently. Our study revealed that the probable best practice in using linked-reads for metagenome assembly is to merge the linked-reads from multiple libraries, where each has sufficient CR but a smaller amount of input DNA.

Methods

Simulate linked-reads for microbes with uneven abundance

LRTK-SIM [11] was initially built for human diploid assembly by simulating 10x linked-reads. In this study, we extended it to allow genomes with uneven depth to reflect different microbial abundance (Figure S7). We downloaded the reference genomes (denoted as M) of 23 bacterial and 3 archaeal strains from MBARC-26 [21] and categorized them into Lsim (Molarity < 10−15), Msim (10−15 < Molarity < 10−14) and Hsim (Molarity > 10−14) (Table S1). The molarity was normalized to sum to 1 as microbial relative abundance (\( {\sum}_{i=1}^{26}{A}_i=1 \)), and CF for microbe i (CFi) was calculated as CFi = CF × Ai × 26 (CF was predefined). The total fragment length for microbe i (Mi) was CFi× Li, where Li was genome size of Mi. The estimated input nucleotides were calculated as \( {\sum}_{i=1}^M{A}_i\times {L}_i\times 26 \). We simulated a wide range of CF (from 28X to 333X), CR (from 0.064X to 0.77X), μFL (from 5 to 100 kb), and NF/P (from 10 to 160) to investigate their impact on metagenome assembly (Table S1).

DNA extraction, library preparation, and sequencing

For mock community, DNA from ATCC 20 strain staggered mix genomic material (ATCC-MSA1003) was extracted without size-selection. For human gut microbiome from stool sample, we extracted the DNA using Qiagen QiAaMP Stool Mini Kit and removed the DNA fragments below 5 kb. After that, the molecular weight of isolated DNA was assayed by pulsed-field electrophoresis. For 10x Chromium library preparation, 1 ng of isolated high molecular weight DNA was denatured according to the manufacturer recommendations, added to the reaction master mix and mixed with gel bead and emulsification oil to generate droplets within a Chromium Genome chip. The rest part of library preparation was done following the manufacturer protocol (Chromium Genome v1, PN-120229).

The two libraries were sequenced by Illumina XTen with 2 × 150 bps paired-end reads, respectively. The DNA of human gut microbiome was also prepared for standard Illumina XTen short-read sequencing.

DNA long fragment reconstruction and linked-read subsampling

Long Ranger v2.2.1 [22] was used to correct barcode base errors, calculate PCR duplication rate, and perform barcode-aware linked-read alignment. BWA-MEM v0.7.17 [Metagenome assembly

For linked-read assembly, the linked-reads without barcodes were first assembled into seed contigs by metaSPAdes v3.11.1 [4] with default parameters and aligned to contigs by BWA-MEM v0.7.17. Athena-meta v1.3 was applied for local assembly by collecting co-barcoded reads shared by two “seed” contigs in scaffold graph (Figure S8). For mock community, the Illumina short-reads (SRR8359173) and PacBio CCS reads (SRR9202034 and SRR9328980) were assembled by metaSPAdes v3.12.0 and Canu v2.0 [24], respectively. The command lines were included in the Supplementary Note.

Assembly evaluation

We implemented a pipeline (Figure S9) to compare different metagenome assemblies by integrating off-the-shelf software and in-house scripts. First, MaxBin v2.2.4 [25] grouped contigs (longer than 1 kb) into bins, and their completeness and contaminations were assessed by CheckM v1.0.12 [26]. Quast v5.0.0 [27] calculated basic statistics such as contig N50, NG50, NGA50, total aligned length, and genomic coverage; Aragorn v1.2.38 [28] and Barrnap (https://github.com/tseemann/barrnap) were used to infer tRNA and rRNA (5S, 16S, and 23S), respectively; Kraken v0.10.6 [29] annotated taxonomic classification of bins based on its built-in database MiniKrakenDB. The bin abundance was calculated by \( \frac{\mathrm{size}\left(\mathrm{bin}\right)\mathrm{dp}\left(\mathrm{bin}\right)}{\mathrm{len}\left(\mathrm{read}\right)\mathrm{sum}\left(\mathrm{read}\right)} \). For each bin, size(bin) is its total nucleotides, dp(bin) denotes its read depth, len(read) is short-read length, and sum(read) is total number of aligned short-reads. Bins were recognized as draft genomes if they were classified as high-quality (completeness > 90%, contamination < 5%, presence of the 5S, 16S, 23S rRNAs, and at least 18 tRNA), medium-quality (completeness ≥ 50% and contamination < 10%), and low-quality (completeness < 50% and contamination < 10%). The command lines were included in the Supplementary Note.

Conclusion

In this study, we comprehensively investigated four parameters of linked-read sequencing on metagenome assembly and compared with Illumina short-reads and PacBio CCS reads. Our study revealed that the probable best practice in using linked-reads for metagenome assembly is to merge the linked-reads from multiple libraries, where each has sufficient CR but a smaller amount of input DNA.