Abstract
Uncultivated Bacteria and Archaea account for the vast majority of species on Earth, but obtaining their genomes directly from the environment, using shotgun sequencing, has only become possible recently. To realize the hope of capturing Earth’s microbial genetic complement and to facilitate the investigation of the functional roles of specific lineages in a given ecosystem, technologies that accelerate the recovery of high-quality genomes are necessary. We present a series of analysis steps and data products for the extraction of high-quality metagenome-assembled genomes (MAGs) from microbiomes using the U.S. Department of Energy Systems Biology Knowledgebase (KBase) platform (http://www.kbase.us/). Overall, these steps take about a day to obtain extracted genomes when starting from smaller environmental shotgun read libraries, or up to about a week from larger libraries. In KBase, the process is end-to-end, allowing a user to go from the initial sequencing reads all the way through to MAGs, which can then be analyzed with other KBase capabilities such as phylogenetic placement, functional assignment, metabolic modeling, pangenome functional profiling, RNA-Seq and others. While portions of such capabilities are available individually from other resources, the combination of the intuitive usability, data interoperability and integration of tools in a freely available computational resource makes KBase a powerful platform for obtaining MAGs from microbiomes. While this workflow offers tools for each of the key steps in the genome extraction process, it also provides a scaffold that can be easily extended with additional MAG recovery and analysis tools, via the KBase software development kit (SDK).
Similar content being viewed by others
Data availability
The analyses and data discussed are available via the ‘dynamic’ KBase Narratives https://narrative.kbase.us/narrative/33233 (Compost) and https://narrative.kbase.us/narrative/62384 (Moab Desert Crust). Additionally, ‘static’ HTML narratives have been published on KBase [https://docs.kbase.us/getting-started/narrative/share#publishing-a-static-narrative] from each of these dynamic Narratives. They are available at https://kbase.us/n/33233/628/ (Compost78, https://doi.org/10.25982/33233.606/1831502) and https://kbase.us/n/62384/334/ (Moab Desert Crust79, https://doi.org/10.25982/62384.253/1831503). All input and derived data objects can be exported using standard formats from the Narratives by clicking on the given object, and then on the download arrow in the data panel in the upper left of the dynamic Narrative, as described at https://docs.kbase.us/data/upload-download-guide/downloads.
Code availability
All KBase code is open source under the Massachusetts Institute of Technology license and available from Github at https://github.com/kbase and https://github.com/kbaseapps. All externally developed software run in KBase is also open source by policy and available from the respective repositories, typically Github, Gitlab, Bitbucket or Sourceforge (‘Code versions’ section).
Change history
30 November 2022
A Correction to this paper has been published: https://doi.org/10.1038/s41596-022-00794-4
References
Hug, L. A. et al. A new view of the tree of life. Nat. Microbiol. 1, 16048 (2016).
Spang, A. et al. Complex archaea that bridge the gap between prokaryotes and eukaryotes. Nature 521, 173–179 (2015).
Tyson, G. W. et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428, 37–43 (2004).
Anantharaman, K. et al. Thousands of microbial genomes shed light on interconnected biogeochemical processes in an aquifer system. Nat. Commun. 7, 13219 (2016).
Parks, D. H. et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 2, 1533–1542 (2017).
Tully, B. J. & Graham, E. D. & Heidelberg, J. F. The reconstruction of 2,631 draft metagenome-assembled genomes from the global oceans. Sci. Data 5, 170203 (2018).
Stewart, R. D. et al. Assembly of 913 microbial genomes from metagenomic sequencing of the cow rumen. Nat. Commun. 9, 870 (2018).
Pasolli, E. et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography and lifestyle. Cell 176, 649–662 (2019).
Nayfach, S. et al. A genomic catalog of Earth’s microbiomes. Nat. Biotechnol. 39, 499–509, https://doi.org/10.1038/s41587-020-0718-6 (2021).
Gilbert, J. A., Jansson, J. K. & Knight, R. The Earth Microbiome project: successes and aspirations. BMC Biol 12, 69 (2014).
Saheb Kashaf, S., Almeida, A., Segre, J. A. & Finn, R. D. Recovering prokaryotic genomes from host-associated, short-read shotgun metagenomic sequencing data. Nat. Protoc. 16, 2520–2541 (2021).
Chong, J., Liu, P., Zhou, G. & **a, J. Using MicrobiomeAnalyst for comprehensive statistical, functional, and meta-analysis of microbiome data. Nat. Protoc. 15, 799–821 (2020).
Arkin, A. P. et al. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nat. Biotechnol. 36, 566–569 (2018).
Sayers, E. W. et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 49, D10–D17 (2021).
Kluyver, T., et al. Jupyter Notebooks – a publishing format for reproducible computational workflows. In: Loizides F, Schmidt B, editors. Positioning and Power in Academic Publishing: Players, Agents and Agendas. p. 87–90 (2016).
Banfield, J. Development of a Knowledgebase to Integrate, Analyze, Distribute, and Visualize Microbial Community Systems Biology Data. (2015). Report number: DOE-UCB-4918, OSTI ID: 1167269.
Chen, I.-M. A. et al. IMG/M v.5.0: an integrated data management and comparative analysis system for microbial genomes and microbiomes. Nucleic Acids Res 47, D666–D677 (2019).
Afgan, E. et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Res 44, W3–W10 (2016).
Devisetty, U. K., Kennedy, K., Sarando, P., Merchant, N. & Lyons, E. Bringing your tools to CyVerse discovery environment using Docker. F1000Res. 5, 1442 (2016).
Wang, L., Lu, Z., Van Buren, P. & Ware, D. SciApps: a bioinformatics workflow platform powered by XSEDE and CyVerse. in Proceedings of the Practice and Experience on Advanced Research Computing 1–5 (Association for Computing Machinery, 2018).
Eren, A. M. et al. Community-led, integrated, reproducible multi-omics with anvi’o. Nat. Microbiol. 6, 3–6 (2021).
Wattam, A. R. et al. Improvements to PATRIC, the all-bacterial bioinformatics database and analysis resource center. Nucleic Acids Res 45, D535–D542 (2017).
Mitchell, A. L. et al. MGnify: the microbiome analysis resource in 2020. Nucleic Acids Res. 48, D570–D578 (2020).
Wu, Y.-W. et al. Ionic liquids impact the bioenergy feedstock-degrading microbiome and transcription of enzymes relevant to polysaccharide hydrolysis. mSystems 1, e00120–16 (2016).
Rajeev, L. et al. Dynamic cyanobacterial response to hydration and dehydration in a desert biological soil crust. ISME J 7, 2178–2191 (2013).
Foster, I. Globus Online: accelerating and democratizing science through cloud-based services. IEEE Internet Comput 15, 70–73 (2011).
Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. A. metaSPAdes: a new versatile metagenomic assembler. Genome Res 27, 824–834 (2017).
Zhang, H. et al. dbCAN2: a meta server for automated carbohydrate-active enzyme annotation. Nucleic Acids Res 46, W95–W101 (2018).
Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics 36, 1925–1927 (2019).
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinforma 10, 421 (2009).
Nordberg, H. et al. The genome portal of the Department of Energy Joint Genome Institute: 2014 updates. Nucleic Acids Res 42, D26–D31 (2014).
Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120 (2014).
Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. J. 17, 10–12 (2011).
Menzel, P., Ng, K. L. & Krogh, A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun. 7, 11257 (2016).
Freitas, T. A. K., Li, P.-E., Scholz, M. B. & Chain, P. S. G. Accurate read-based metagenome characterization using a hierarchical suite of unique signatures. Nucleic Acids Res 43, e69 (2015).
Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol 20, 257 (2019).
Truong, D. T. et al. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat. Methods 12, 902–903 (2015).
Milanese, A. et al. Microbial abundance, activity and population genomic profiling with mOTUs2. Nat. Commun. 10, 2014 (2019).
Youngblut, N. D. & Ley, R. E. Struo2: efficient metagenome profiling database construction for ever-expanding microbial genome datasets. Peer J 9, e12198 (2021).
Ondov, B. D., Bergman, N. H. & Phillippy, A. M. Interactive metagenomic visualization in a Web browser. BMC Bioinform 12, 385 (2011).
Li, D., Liu, C.-M., Luo, R., Sadakane, K. & Lam, T.-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31, 1674–1676 (2015).
Peng, Y., Leung, H. C. M., Yiu, S. M. & Chin, F. Y. L. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28, 1420–1428 (2012).
Orakov, A. et al. GUNC: detection of chimerism and contamination in prokaryotic genomes. Genome Biol 22, 178 (2021).
Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).
Wu, Y.-W., Simmons, B. A. & Singer, S. W. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics 32, 605–607 (2016).
Kang, D. D. et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ 7, e7359 (2019).
Alneberg, J. et al. Binning metagenomic contigs by coverage and composition. Nat. Methods 11, 1144–1146 (2014).
Sieber, C. M. K. et al. Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy. Nat. Microbiol. 3, 836–843 (2018).
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res 25, 1043–1055 (2015).
Delcher, A. L., Salzberg, S. L. & Phillippy, A. M. Using MUMmer to identify similar regions in large sequence sets. Curr. Protoc. Bioinform. Chapter 10, Unit 10.3 (2003).
Darling, A. C. E., Mau, B., Blattner, F. R. & Perna, N. T. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res 14, 1394–1403 (2004).
Parks, D. H. et al. GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res 50, D785–D794 (2022).
Bowers, R. M. et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 35, 725–731 (2017).
Brettin, T. et al. RASTtk: a modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes. Sci. Rep. 5, 8365 (2015).
Overbeek, R. et al. The SEED and the rapid annotation of microbial genomes using Subsystems Technology (RAST). Nucleic Acids Res 42, D206–D214 (2014).
Seemann, T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069 (2014).
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinform 11, 119 (2010).
Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat. Biotechnol. 36, 996–1004 (2018).
Rinke, C. et al. A standardized archaeal taxonomy for the Genome Taxonomy Database. Nat. Microbiol. 6, 946–959 (2021).
Haft, D. H. et al. RefSeq: an update on prokaryotic genome annotation and curation. Nucleic Acids Res 46, D851–D860 (2018).
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2–approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).
Shaffer, M. et al. DRAM for distilling microbial metabolism to automate the curation of microbiome function. Nucleic Acids Res 48, 8883–8900 (2020).
Galperin, M. Y., Makarova, K. S., Wolf, Y. I. & Koonin, E. V. Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Res 43, D261–D269 (2015). (Database Issue).
El-Gebali, S. et al. The Pfam protein families database in 2019. Nucleic Acids Res 47, D427–D432 (2019).
Haft, D. H. et al. TIGRFAMs and Genome Properties in 2013. Nucleic Acids Res 41, D387–D395 (2013). (Database issue).
Eddy, S. R. Accelerated Profile HMM Searches. PLoS Comput. Biol. 7, e1002195 (2011).
Lombard, V., Golaconda Ramulu, H., Drula, E., Coutinho, P. M. & Henrissat, B. The carbohydrate-active enzymes database (CAZy) in 2013. Nucleic Acids Res 42, D490–D495 (2014).
Chivian, D., Dehal, P. S., Keller, K. & Arkin, A. P. MetaMicrobesOnline: phylogenomic analysis of microbial communities. Nucleic Acids Res 41, D648–D654 (2013).
Karaoz, U. & Brodie, E. L. microTrait: a toolset for a trait-based representation of microbial genomes. Front. Bioinform. https://doi.org/10.3389/fbinf.2022.918853 (2022).
Wood-Charlson, E. M. et al. The National Microbiome Data Collaborative: enabling microbiome science. Nat. Rev. Microbiol. 18, 313–314 (2020).
Hofmeyr, S. et al. Terabase-scale metagenome coassembly with MetaHipMer. Sci. Rep. 10, 10689 (2020).
Kolmogorov, M. et al. metaFlye: scalable long-read metagenome assembly using repeat graphs. Nat. Methods 17, 1103–1110 (2020).
Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res 27, 722–736 (2017).
Bertrand, D. et al. Hybrid metagenomic assembly enables high-resolution analysis of resistance determinants and mobile elements in human microbiomes. Nat. Biotechnol. 37, 937–944 (2019).
Chen, L.-X. et al. Accurate and complete genomes from metagenomes. Genome Res 30, 315–333 (2020).
Lui, L. M., Nielsen, T. N. & Arkin, A. P. A method for achieving complete microbial genomes and improving bins from metagenomics data. PLoS Comput Biol 17, e1008972 (2021).
Miller, C. S., Baker, B. J., Thomas, B. C., Singer, S. W. & Banfield, J. F. EMIRGE: reconstruction of full-length ribosomal genes from microbial community short read sequencing data. Genome Biol 12, R44 (2011).
Chivian, D. et al. Genome extraction from shotgun metagenome sequence data. KBase n/33233/628 https://doi.org/10.25982/33233.606/1831502 (2022).
Chivian, D., et al. Moab desert crust – sample 4E. KBase n/62384/334 (2022). https://doi.org/10.25982/62384.253/1831503
Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 9, 5114 (2018).
Matsen, F. A., Kodner, R. B. & Armbrust, E. V. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinform 11, 538 (2010).
Benson, D. A. et al. GenBank. Nucleic Acids Res 46, D41–D47 (2018).
Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, 186–194 (1998).
Teiling, C. BaseSpace: Simplifying metagenomic analysis. 26th European Congress of Clinical Microbiology and Infectious Diseases (2016) 10.26226/morressier.56d5ba2ed462b80296c9509d
Reich, M. et al. The GenePattern notebook environment. Cell Syst 5, 149–151.e1 (2017).
Uritskiy, G. V., DiRuggiero, J. & Taylor, J. MetaWRAP-a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome 6, 158 (2018).
Karp, P. D. et al. A comparison of microbial genome web portals. Front. Microbiol. 10, 208 (2019).
Yue, Y. et al. Evaluating metagenomics tools for genome binning with real metagenomic datasets and CAMI datasets. BMC Bioinform 21, 334 (2020).
Nelson, W. C., Tully, B. J. & Mobberley, J. M. Biases in genome reconstruction from metagenomic data. PeerJ 8, e10119 (2020).
Olm, M. R., Brown, C. T., Brooks, B. & Banfield, J. F. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J 11, 2864–2868 (2017).
Li, L., Stoeckert, C. J. Jr & Roos, D. S. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13, 2178–2189 (2003).
Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32, 1792–1797 (2004).
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genoty** with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15, 550 (2014).
Kumari, S. et al. A KBase case study on genome-wide transcriptomics and plant primary metabolism in response to drought stress in sorghum. Curr. Plant Biol. 28, 100229 (2021).
Seaver, S. M. D. et al. The ModelSEED biochemistry database for the integration of metabolic annotations and the reconstruction, comparison and analysis of metabolic models for plants, fungi and microbes. Nucleic Acids Res 49, D575–D588 (2021).
Schloss, P. D. et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol. 75, 7537–7541 (2009).
Caporaso, J. G. et al. QIIME allows analysis of high-throughput community sequencing data. Nat. Methods 7, 335–336 (2010).
Acknowledgements
The authors thank S. Singer for the use of the Compost sequence data and T. Northen for the Desert Crust sequence data. We thank U. Karaoz and E.L. Brodie for the use of the MicroTrait HMMs. We thank K. Wrighton, M. Shaffer and M. Borton for the use of their DRAM app and P. Chain, M. Flynn and C. Lo for the use of their GOTTCHA2 app. We thank D. Parks and G. Tyson for the use of CheckM and P.-A. Chaumeil, D. Parks, A. J. Mussig and P. Hugenholtz for the use of GTDB-Tk. KBase especially thanks all primary developers whose tools have been wrapped as apps in KBase; please make sure to cite their primary publications if you use any of those apps. KBase greatly appreciates funding by the Genomic Science program within the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research under award nos. DE-AC02-05CH11231, DE-AC02-06CH11357, DE-AC05-00OR22725 and DE-AC02-98CH10886.
Author information
Authors and Affiliations
Contributions
D.C., P.S.D. and A.P.A. conceived the workflow. D.C., P.S.D., R.S.C., E.W.C. and S.P.J. designed the workflow. D.C., S.P.J., P.S.D., G.A.P., W.J.R., T.G., R.S.C., M.L., Q.Z., M.W.S. and R.S. wrote the KBase Genome Extraction and related apps and developed the KBase platform. D.C. built the Narratives. D.C. and M.C. wrote the Narrative tutorial. D.C., E.W.C., S.P.J. and A.P.A. wrote the manuscript. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Protocols thanks Ami S. Bhatt and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Related links
Key references using this protocol
Romero Victorica, M. et al. Sci. Rep. 10, 3864 (2020): https://doi.org/10.1038/s41598-020-60850-5
Buongiorno, J. et al. PLoS One 15, e0234839 (2020): https://doi.org/10.1371/journal.pone.0234839
Quoc, B. N. et al. Water Res. 198, 117119 (2021): https://doi.org/10.1016/j.watres.2021.117119
Supplementary information
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chivian, D., Jungbluth, S.P., Dehal, P.S. et al. Metagenome-assembled genome extraction and analysis from microbiomes using KBase. Nat Protoc 18, 208–238 (2023). https://doi.org/10.1038/s41596-022-00747-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41596-022-00747-x
- Springer Nature Limited