Unraveling the functional dark matter through global metagenomics

Georgios A. Pavlopoulos, Fotis A. Baltoumas, Sirui Liu, Oguz Selvitopi, Antonio Pedro Camargo, Stephen Nayfach, Ariful Azad, Simon Roux, Lee Call, Natalia N. Ivanova, I. Min Chen, David Paez-Espino, Evangelos Karatzas, Silvia G. Acinas, Nathan Ahlgren, Graeme Attwood, Petr Baldrian, Timothy Berry, Jennifer M. Bhatnagar, Devaki BhayaKay D. Bidle, Jeffrey L. Blanchard, Eric S. Boyd, Jennifer L. Bowen, Jeff Bowman, Susan H. Brawley, Eoin L. Brodie, Andreas Brune, Donald A. Bryant, Alison Buchan, Hinsby Cadillo-Quiroz, Barbara J. Campbell, Ricardo Cavicchioli, Peter F. Chuckran, Maureen Coleman, Sean Crowe, Daniel R. Colman, Cameron R. Currie, Jeff Dangl, Nathalie Delherbe, Vincent J. Denef, Paul Dijkstra, Daniel D. Distel, Emiley Eloe-Fadrosh, Kirsten Fisher, Christopher Francis, Aaron Garoutte, Amelie Gaudin, Lena Gerwick, Filipa Godoy-Vitorino, Peter Guerra, Jiarong Guo, Mussie Y. Habteselassie, Steven J. Hallam, Roland Hatzenpichler, Ute Hentschel, Matthias Hess, Ann M. Hirsch, Laura A. Hug, Jenni Hultman, Dana E. Hunt, Marcel Huntemann, William P. Inskeep, Timothy Y. James, Janet Jansson, Eric R. Johnston, Marina Kalyuzhnaya, Charlene N. Kelly, Robert M. Kelly, Jonathan L. Klassen, Klaus Nüsslein, Joel E. Kostka, Steven Lindow, Erik Lilleskov, Mackenzie Lynes, Rachel Mackelprang, Francis M. Martin, Olivia U. Mason, R. Michael McKay, Katherine McMahon, David A. Mead, Monica Medina, Laura K. Meredith, Thomas Mock, William W. Mohn, Mary Ann Moran, Alison Murray, Josh D. Neufeld, Rebecca Neumann, Jeanette M. Norton, Laila P. Partida-Martinez, Nicole Pietrasiak, Dale Pelletier, T. B.K. Reddy, Brandi Kiel Reese, Nicholas J. Reichart, Rebecca Reiss, Mak A. Saito, Daniel P. Schachtman, Rekha Seshadri, Ashley Shade, David Sherman, Rachel Simister, Holly Simon, James Stegen, Ramunas Stepanauskas, Matthew Sullivan, Dawn Y. Sumner, Hanno Teeling, Kimberlee Thamatrakoln, Kathleen Treseder, Susannah Tringe, Parag Vaishampayan, David L. Valentine, Nicholas B. Waldo, Mark P. Waldrop, David A. Walsh, David M. Ward, Michael Wilkins, Thea Whitman, Jamie Woolet, Tanja Woyke, Ioannis Iliopoulos, Konstantinos Konstantinidis, James M. Tiedje, Jennifer Pett-Ridge, David Baker, Axel Visel, Christos A. Ouzounis, Sergey Ovchinnikov, Aydin Buluç, Nikos C. Kyrpides

Research output: Contribution to journalArticlepeer-review

37 Scopus citations

Abstract

Metagenomes encode an enormous diversity of proteins, reflecting a multiplicity of functions and activities1,2. Exploration of this vast sequence space has been limited to a comparative analysis against reference microbial genomes and protein families derived from those genomes. Here, to examine the scale of yet untapped functional diversity beyond what is currently possible through the lens of reference genomes, we develop a computational approach to generate reference-free protein families from the sequence space in metagenomes. We analyse 26,931 metagenomes and identify 1.17 billion protein sequences longer than 35 amino acids with no similarity to any sequences from 102,491 reference genomes or the Pfam database3. Using massively parallel graph-based clustering, we group these proteins into 106,198 novel sequence clusters with more than 100 members, doubling the number of protein families obtained from the reference genomes clustered using the same approach. We annotate these families on the basis of their taxonomic, habitat, geographical and gene neighbourhood distributions and, where sufficient sequence diversity is available, predict protein three-dimensional models, revealing novel structures. Overall, our results uncover an enormously diverse functional space, highlighting the importance of further exploring the microbial functional dark matter.

Original languageEnglish
Pages (from-to)594-602
Number of pages9
JournalNature
Volume622
Issue number7983
DOIs
StatePublished - Oct 19 2023

Funding

We thank H. Maughan for reading the paper; and all of the colleagues who contributed to the many facets of metagenomics, from sample collection to sequencing and annotation that made this work possible. The list of the JGI Proposal Award DOIs is available in Supplementary Table 13. This work used resources of the National Energy Research Scientific Computing Center (NERSC), supported by the Office of Science of the US Department of Energy (DOE). Additional computations were performed with the use of the Greek Research and Technology Network (GRNET) Aris High Processing Computing (HPC) infrastructure (project code: PR009008-BOLOGNA). This work was supported in part by the US DOE Joint Genome Institute (DE-AC02–05CH11231, in part), a DOE Office of Science User Facility; the Applied Mathematics program of the DOE Office of Advanced Scientific Computing Research (DE-AC02–05CH11231, in part), Office of Science of the US DOE; Exascale Computing Project (17-SC-20-SC), a collaborative effort of the US DOE Office of Science and the National Nuclear Security Administration; DOE grant DE-SC0022098. G.A.P., F.A.B. and E.K. were supported by Fondation Santé and the Hellenic Foundation for Research and Innovation (H.F.R.I.) under the ‘First Call for H.F.R.I. Research Projects to support faculty members and researchers and the procurement of high-cost research equipment grant’ (grant ID HFRI-17-1855-BOLOGNA). G.A.P. also acknowledges the Marie Skłodowska-Curie Individual Fellowships (MSCA-IF-EF-CAR, grant ID 838018, H2020-MSCA-IF-2018) and ‘The Greek Research Infrastructure for Personalized Medicine (pMedGR)’ (MIS 5002802), which is implemented under the Action ‘Reinforcement of the Research and Innovation Infrastructure’, funded by the Operational Program ‘Competitiveness, Entrepreneurship and Innovation’ (NSRF 2014-2020) and co-financed by Greece and the European Union (European Regional Development Fund). C.A.O. and I.I. acknowledge support by the project Elixir-GR (MIS 5002780), implemented under the Action ‘Reinforcement of the Research and Innovation Infrastructure’, funded by the Operational Program Competitiveness, Entrepreneurship and Innovation (NSRF 2014-2020) and co-financed by Greece and the European Union (European Regional Development Fund). S.O. and S.L. are supported by NIH grant DP5OD026389 and the Moore–Simons Project on the Origin of the Eukaryotic Cell, Simons Foundation 735929LPI (https://doi.org/10.46714/735929LPI). J.P.-R. was supported by the US DOE Genomic Sciences Program, award SCW1632; and work conducted at the LLNL was conducted under the auspices of the US DOE under Contract DE-AC52-07NA27344. Work from the consortium was supported by NSF grants OIA-1826734, DEB-1441717 and OCE-1232982; NSF 1921429; CONACyT grants A1-S-9889 and CB-2010-01-151007; US DOE, Office of Science, Office of Biological and Environmental Research (BER), Great Lakes Bioenergy Research Center (DOE BER DE-SC0018409 and DE-FC02-07ER64494); NSF grant OCE-082546, US DOE, Office of Science, Facilities Integrating Collaborations for User Science (FICUS) program, Office of Workforce Development for Teachers and Scientists, Office of Science Graduate Student Research (SCGSR) program; New Zealand Foundation for Research, Science and Technology grant CO1X0306 and National Science Foundation grant 1745341; NSF Division of Chemical, Bioengineering, Environmental and Transport Systems grants 1438092 and 1643486; NSF OCE-1559179, NSF OCE-1537951, NSF OCE-1459200, Gordon & Betty Moore Foundation Investigator Award 3789; the G. Unger Vetlesen and Ambrose Monell Foundations; the Natural Sciences and Engineering Research Council of Canada; Genome Canada and Genome British Columbia; the PR-INBRE BiRC program (NIH/NIGMS- award number P20 GM103475); Great Lakes Bioenergy Research Center, US DOE, Office of Science, Office of Biological and Environmental Research under award numbers DE-SC0018409 and DE-FC02-07ER64494; the Agriculture and Food Research Initiative, competitive grant 2009-447 35319-05186 from the US Department of Agriculture, National Institute of Food and Agriculture; Sol Leshin Foundation and the Shanbrom Family Fund; Towards Sustainability Foundation, Cornell Sigma Xi, NSERC PGS-D, NSF-BREAD (IOS-0965336), Cornell Biogeochemistry Program, Cornell Crop and Soil Science Department, USDA-NIFA Carbon Cycle (2014-6700322069) and the Cornell Atkinson Center for a Sustainable Future; Office of Science (BER), US DOE (DE-SC0014395); NSF grant OCE 0424602; US DOE, Office of Science, Office of Biological and Environmental Research, Environmental System Science (ESS) Program; Australian Research Council: DP150100244; NSF-OPP 1641019; NSF 1754756; NSF 1442231; NSF award OCE-173723; USDA National Institute of Food and Agriculture Foundational Program (award 2017-67019-26396); USDA NIFA award 2011-67019-30178; BER grant DE-SC0014395; National Science Foundation grant DEB-1927155; US DOE, Office of Science, Office of Biological and Environmental Research, Environmental System Science (ESS) Program; River Corridor Scientific Focus Area (SFA) project at Pacific Northwest National Laboratory (PNNL); grant NNX16AJ62G from NASA Exobiology; NASA Exobiology awards 80NSSC19K1633 and NNX17AK85G; NSF award DEB-1146149; US NSF (DEB 1912525); US DOE Office of Biological and Environmental Research (DE-SC0020382); NSF EAR-1820658; DE-FG02-94ER20137 from the Photosynthetic Systems Program, Division of Chemical Sciences, Geosciences and Biosciences (CSGB), Office of Basic Energy Sciences of the US DOE; Max Planck Society and the BioEnergy Science Center (BESC), a US DOE Bioenergy Research Center supported by the Office of Biological and Environmental Research in the DOE Office of Science; and US DOE, Office of Science, Biological and Environmental Research, as part of the Plant Microbe Interfaces Scientific Focus Area at Oak Ridge National Laboratory. We thank H. Maughan for reading the paper; and all of the colleagues who contributed to the many facets of metagenomics, from sample collection to sequencing and annotation that made this work possible. The list of the JGI Proposal Award DOIs is available in Supplementary Table . This work used resources of the National Energy Research Scientific Computing Center (NERSC), supported by the Office of Science of the US Department of Energy (DOE). Additional computations were performed with the use of the Greek Research and Technology Network (GRNET) Aris High Processing Computing (HPC) infrastructure (project code: PR009008-BOLOGNA). This work was supported in part by the US DOE Joint Genome Institute (DE-AC02–05CH11231, in part), a DOE Office of Science User Facility; the Applied Mathematics program of the DOE Office of Advanced Scientific Computing Research (DE-AC02–05CH11231, in part), Office of Science of the US DOE; Exascale Computing Project (17-SC-20-SC), a collaborative effort of the US DOE Office of Science and the National Nuclear Security Administration; DOE grant DE-SC0022098. G.A.P., F.A.B. and E.K. were supported by Fondation Santé and the Hellenic Foundation for Research and Innovation (H.F.R.I.) under the ‘First Call for H.F.R.I. Research Projects to support faculty members and researchers and the procurement of high-cost research equipment grant’ (grant ID HFRI-17-1855-BOLOGNA). G.A.P. also acknowledges the Marie Skłodowska-Curie Individual Fellowships (MSCA-IF-EF-CAR, grant ID 838018, H2020-MSCA-IF-2018) and ‘The Greek Research Infrastructure for Personalized Medicine (pMedGR)’ (MIS 5002802), which is implemented under the Action ‘Reinforcement of the Research and Innovation Infrastructure’, funded by the Operational Program ‘Competitiveness, Entrepreneurship and Innovation’ (NSRF 2014-2020) and co-financed by Greece and the European Union (European Regional Development Fund). C.A.O. and I.I. acknowledge support by the project Elixir-GR (MIS 5002780), implemented under the Action ‘Reinforcement of the Research and Innovation Infrastructure’, funded by the Operational Program Competitiveness, Entrepreneurship and Innovation (NSRF 2014-2020) and co-financed by Greece and the European Union (European Regional Development Fund). S.O. and S.L. are supported by NIH grant DP5OD026389 and the Moore–Simons Project on the Origin of the Eukaryotic Cell, Simons Foundation 735929LPI ( https://doi.org/10.46714/735929LPI ). J.P.-R. was supported by the US DOE Genomic Sciences Program, award SCW1632; and work conducted at the LLNL was conducted under the auspices of the US DOE under Contract DE-AC52-07NA27344. Work from the consortium was supported by NSF grants OIA-1826734, DEB-1441717 and OCE-1232982; NSF 1921429; CONACyT grants A1-S-9889 and CB-2010-01-151007; US DOE, Office of Science, Office of Biological and Environmental Research (BER), Great Lakes Bioenergy Research Center (DOE BER DE-SC0018409 and DE-FC02-07ER64494); NSF grant OCE-082546, US DOE, Office of Science, Facilities Integrating Collaborations for User Science (FICUS) program, Office of Workforce Development for Teachers and Scientists, Office of Science Graduate Student Research (SCGSR) program; New Zealand Foundation for Research, Science and Technology grant CO1X0306 and National Science Foundation grant 1745341; NSF Division of Chemical, Bioengineering, Environmental and Transport Systems grants 1438092 and 1643486; NSF OCE-1559179, NSF OCE-1537951, NSF OCE-1459200, Gordon & Betty Moore Foundation Investigator Award 3789; the G. Unger Vetlesen and Ambrose Monell Foundations; the Natural Sciences and Engineering Research Council of Canada; Genome Canada and Genome British Columbia; the PR-INBRE BiRC program (NIH/NIGMS- award number P20 GM103475); Great Lakes Bioenergy Research Center, US DOE, Office of Science, Office of Biological and Environmental Research under award numbers DE-SC0018409 and DE-FC02-07ER64494; the Agriculture and Food Research Initiative, competitive grant 2009-447 35319-05186 from the US Department of Agriculture, National Institute of Food and Agriculture; Sol Leshin Foundation and the Shanbrom Family Fund; Towards Sustainability Foundation, Cornell Sigma Xi, NSERC PGS-D, NSF-BREAD (IOS-0965336), Cornell Biogeochemistry Program, Cornell Crop and Soil Science Department, USDA-NIFA Carbon Cycle (2014-6700322069) and the Cornell Atkinson Center for a Sustainable Future; Office of Science (BER), US DOE (DE-SC0014395); NSF grant OCE 0424602; US DOE, Office of Science, Office of Biological and Environmental Research, Environmental System Science (ESS) Program; Australian Research Council: DP150100244; NSF-OPP 1641019; NSF 1754756; NSF 1442231; NSF award OCE-173723; USDA National Institute of Food and Agriculture Foundational Program (award 2017-67019-26396); USDA NIFA award 2011-67019-30178; BER grant DE-SC0014395; National Science Foundation grant DEB-1927155; US DOE, Office of Science, Office of Biological and Environmental Research, Environmental System Science (ESS) Program; River Corridor Scientific Focus Area (SFA) project at Pacific Northwest National Laboratory (PNNL); grant NNX16AJ62G from NASA Exobiology; NASA Exobiology awards 80NSSC19K1633 and NNX17AK85G; NSF award DEB-1146149; US NSF (DEB 1912525); US DOE Office of Biological and Environmental Research (DE-SC0020382); NSF EAR-1820658; DE-FG02-94ER20137 from the Photosynthetic Systems Program, Division of Chemical Sciences, Geosciences and Biosciences (CSGB), Office of Basic Energy Sciences of the US DOE; Max Planck Society and the BioEnergy Science Center (BESC), a US DOE Bioenergy Research Center supported by the Office of Biological and Environmental Research in the DOE Office of Science; and US DOE, Office of Science, Biological and Environmental Research, as part of the Plant Microbe Interfaces Scientific Focus Area at Oak Ridge National Laboratory.

FundersFunder number
BioEnergy Science Center
Cornell Biogeochemistry Program
Cornell Crop and Soil Science Department
Cornell Sigma Xi
DOE Office of Advanced Scientific Computing Research
DOE Office of Science user facility
Fondation Santé
G. Unger Vetlesen and Ambrose Monell Foundations
Greek Research and Technology NetworkPR009008-BOLOGNA
MSCA-IF-EF-CARH2020-MSCA-IF-2018, MIS 5002802, ID 838018
NSF-BREADIOS-0965336
NSF-OPPNSF 1442231, 1641019, 1754756, OCE-173723
Office of Basic Energy Sciences of the US DOE
Office of Biological and Environmental Research, Environmental System Science
Office of Science Graduate Student Research
DOE Office of Science17-SC-20-SC
Operational Program Competitiveness, Entrepreneurship and Innovation
Operational Program ‘Competitiveness, Entrepreneurship and InnovationNSRF 2014-2020
River Corridor Scientific Focus Area
SCGSR
Shanbrom Family Fund
Simons Foundation 735929LPISCW1632, NSF 1921429, OIA-1826734, DEB-1441717, DE-AC52-07NA27344, OCE-1232982
Sol Leshin Foundation
USDA National Institute of Food and Agriculture Foundational Program2017-67019-26396
National Science Foundation1745341
National Institutes of HealthDP5OD026389
U.S. Department of Energy
National Institute of General Medical Sciences2009-447 35319-05186, DE-SC0018409, P20 GM103475
National Aeronautics and Space AdministrationDEB-1146149, DEB 1912525, NNX17AK85G, 80NSSC19K1633
Division of Chemical, Bioengineering, Environmental, and Transport Systems1643486, OCE-1559179, 1438092, OCE-1459200, OCE-1537951
U.S. Department of Agriculture
Gordon and Betty Moore Foundation
National Institute of Food and Agriculture2011-67019-30178, DEB-1927155, 2014-6700322069
Office of Science
National Nuclear Security AdministrationDE-SC0022098
Biological and Environmental ResearchDE-SC0020382, DE-FG02-94ER20137, EAR-1820658
Workforce Development for Teachers and Scientists
Oak Ridge National Laboratory
Genome Canada
H2020 Marie Skłodowska-Curie Actions
Pacific Northwest National LaboratoryNNX16AJ62G
Foundation for Research, Science and TechnologyCO1X0306
David R. Atkinson Center for a Sustainable Future , Cornell UniversityOCE 0424602, DE-SC0014395
Chemical Sciences, Geosciences, and Biosciences Division
Great Lakes Bioenergy Research CenterOCE-082546, BER DE-SC0018409, DE-FC02-07ER64494
Joint Genome InstituteDE-AC02–05CH11231
Towards Sustainability Foundation
Natural Sciences and Engineering Research Council of Canada
Genome British Columbia
European Commission
Australian Research CouncilDP150100244
Consejo Nacional de Ciencia y TecnologíaCB-2010-01-151007, A1-S-9889
Max-Planck-Gesellschaft
European Social Fund
National Forestry and Grassland Administration
European Regional Development FundMIS 5002780
Hellenic Foundation for Research and InnovationID HFRI-17-1855-BOLOGNA

    Fingerprint

    Dive into the research topics of 'Unraveling the functional dark matter through global metagenomics'. Together they form a unique fingerprint.

    Cite this