Unraveling the Functional Dark Matter through Global Metagenomics PDF
Document Details
2023
Georgios A. Pavlopoulos
Tags
Related
- Chapter 3: Bacterial Metagenomics PDF
- Quality Control in Metagenomics Data PDF
- Lecture 13: Fungi, Metagenomics and the Human Microbiome PDF Summer 2024
- Metagenomics: A Novel Tool For Livestock & Poultry Improvement (PDF)
- Introduction To Environmental DNA Applications of Metagenomics PDF
- Big Data Biology And Biological Databases (BT2053) PDF
Summary
This research article explores the functional diversity of microbial communities through global metagenomic analysis. It details an approach for generating reference-free protein families, analyzing 26,931 metagenomes, and identifies over 1.17 billion protein sequences. The study highlights the vast untapped functional space, also known as the functional dark matter.
Full Transcript
Article Unraveling the functional dark matter through global metagenomics https://doi.org/10.1038/s41586-023-06583-7 Georgios A. Pavlopoulos1,2,3 ✉, Fotis A. Baltoumas1, Sirui Liu4, Oguz Selvitopi5, Antonio Pedr...
Article Unraveling the functional dark matter through global metagenomics https://doi.org/10.1038/s41586-023-06583-7 Georgios A. Pavlopoulos1,2,3 ✉, Fotis A. Baltoumas1, Sirui Liu4, Oguz Selvitopi5, Antonio Pedro Camargo2, Stephen Nayfach2, Ariful Azad6, Simon Roux2, Lee Call2, Received: 18 March 2022 Natalia N. Ivanova2, I. Min Chen2, David Paez-Espino2, Evangelos Karatzas1, Accepted: 30 August 2023 Novel Metagenome Protein Families Consortium*, Ioannis Iliopoulos7, Konstantinos Konstantinidis8, James M. Tiedje9, Jennifer Pett-Ridge10, David Baker11,12,13, Published online: 11 October 2023 Axel Visel2, Christos A. Ouzounis2,14,15, Sergey Ovchinnikov4, Aydin Buluç5,16 & Open access Nikos C. Kyrpides2 ✉ Check for updates Metagenomes encode an enormous diversity of proteins, reflecting a multiplicity of functions and activities1,2. Exploration of this vast sequence space has been limited to a comparative analysis against reference microbial genomes and protein families derived from those genomes. Here, to examine the scale of yet untapped functional diversity beyond what is currently possible through the lens of reference genomes, we develop a computational approach to generate reference-free protein families from the sequence space in metagenomes. We analyse 26,931 metagenomes and identify 1.17 billion protein sequences longer than 35 amino acids with no similarity to any sequences from 102,491 reference genomes or the Pfam database3. Using massively parallel graph-based clustering, we group these proteins into 106,198 novel sequence clusters with more than 100 members, doubling the number of protein families obtained from the reference genomes clustered using the same approach. We annotate these families on the basis of their taxonomic, habitat, geographical and gene neighbourhood distributions and, where sufficient sequence diversity is available, predict protein three-dimensional models, revealing novel structures. Overall, our results uncover an enormously diverse functional space, highlighting the importance of further exploring the microbial functional dark matter. Metagenome shotgun sequencing has become the method of choice for contigs/scaffolds can provide invaluable insights into the presence of studying and classifying microorganisms from various biomes1. With previously undescribed organisms and their genetic makeup. Recent the latest advances in whole-genome sequencing technologies and technological advancements in assembly and binning tools5 have led to the constant improvements in quality and cost efficiency, large-scale a significant increase in the assembled fraction of the average metage- sequencing has become increasingly easier, faster and more affordable. nome, coupled with a parallel exponential increase in the number of This has led to a considerable increase in metagenomic sequencing metagenome-assembled genomes (MAGs). Data management and data over the past few years, therefore making them an indispensable comparative analysis systems supporting this type of data include resource for investigating the microbial dark matter2. Integrated Microbial Genomes & Microbiomes (IMG/M)6 and MGnify7. To elucidate the genetic composition of a metagenomic sample, two However, both approaches share the same major limitation with major approaches are typically used, each with distinct advantages and respect to gene functional annotation, which relies on predicting func- disadvantages. In the first, sequencing reads are accurately mapped to a tion by homology searching against reference protein databases, such known, annotated set of reference genome sequences to provide a quick as COG8, Pfam3 and KEGG Orthology9. As a result, any genes predicted overview of the presence of known organisms, genes and potential in assembled metagenomic data that do not map to reference protein functions. MG-RAST4 is one system that excels in this type of analysis. families are typically ignored and dropped from subsequent compara- In the second approach, massive de novo assembly of the reads into tive analysis. To eliminate this reliance on reference datasets and to 1 Institute for Fundamental Biomedical Research, Biomedical Science Research Center Alexander Fleming, Vari, Greece. 2DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA. 3Center for New Biotechnologies and Precision Medicine, School of Medicine, National and Kapodistrian University of Athens, Athens, Greece. 4John Harvard Distinguished Science Fellowship Program, Harvard University, Cambridge, MA, USA. 5Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA. 6Luddy School of Informatics, Computing and Engineering, Indiana University Bloomington, Bloomington, IN, USA. 7Department of Basic Sciences, School of Medicine, University of Crete, Heraklion, Greece. 8 School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, GA, USA. 9Center for Microbial Ecology, Michigan State University, East Lansing, MI, USA. 10Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, CA, USA. 11Department of Biochemistry, University of Washington, Seattle, WA, USA. 12Institute for Protein Design, University of Washington, Seattle, WA, USA. 13Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA. 14Biological Computation & Process Laboratory, Chemical Process & Energy Resources Institute, Centre for Research & Technology Hellas, Thessalonica, Greece. 15Biological Computation & Computational Biology Group, Artificial Intelligence & Information Analysis Lab, School of Informatics, Aristotle University of Thessalonica, Thessalonica, Greece. 16Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA. *A list of authors and their affiliations appears at the end of the paper. ✉e-mail: [email protected]; [email protected] 594 | Nature | Vol 622 | 19 October 2023 a 71,312,220 p1 p2 p3 pn proteins p1 Reference >70% identity, >80% length HipMCL p2 genomes 94,672,003 102,491 proteins p3 pn 5,313,956,680 Protein clusters edges Similarity matrix 570,198,677 p1 p2 p3 pn proteins p1 Metagenomes HipMCL Metatranscriptomes 8,364,611,943 1,171,974,849 p2 26,931 proteins proteins Removing hits to p3 >70% identity isolates and Pfam >80% length 5,196,499,560 pn edges Protein clusters Similarity matrix b c 70,000 d 75,000 400,000 Metagenomes 65,000 Metagenomes 70,000 Metagenomes 57,377 Reference genomes 60,000 Reference genomes 65,000 Reference genomes 350,000 60,000 55,000 50,000 55,000 45,631 300,000 Number of clusters Number of clusters Number of clusters 45,000 50,000 250,000 45,000 33,970 40,000 34,383 40,000 200,000 35,000 35,000 30,000 21,113 150,000 30,000 25,000 16,614 25,000 100,000 20,000 20,000 10,236 10,248 9,672 15,000 8,459 15,000 8,515 6,640 5,295 5,604 5,219 50,000 4,343 10,000 10,000 3,249 2,084 1,101 2,230 2,001 1,271 1,451 1,255 1,173 5,000 5,000 817 534 623 383 266 171 592 354 135 0 41 18 10 0 2 0 2 0 10 100 15 150 20 200 25 250 30 300 40 00 50 00 60 600 70 700 80 800 1– 00 >1 0 00 51 50 1– 0 15 150 1– 0 30 300 40 400 1– 0 60 600 1– 0 0 1, –1 0 1, –1 0 1– 00 0 0 00 00 00 00 00 00 00 10 10 20 20 50 50 70 –70 80 80 1 0 1 0 20 4 5 90 1–9 ,0 1– 90 1–9 00 ,0 10 ,1 ,0 ,0 ,0 ,0 ,0 ,0 1– 1– 0– 1– 1– 1– 1– 1– 1– 1– 1, – 1– 1– 1– 1, 1 10 20 30 40 50 60 Number of samples Number of sequences Number of samples Fig. 1 | Sequence clustering overview. a, Clustering proteins from the visualization and comparison of cluster components per cluster for the reference genome (blue) and ED (red) datasets. b, Rarefaction curves of protein number of sequences (c) and the number of genome or ED samples (d). clusters for reference genome (blue) and ED (red) datasets. c,d, Bar chart estimate the breadth of unexplored functional diversity, referred to as any protein sequence with hits to Pfam8 or to any sequences from the the functional dark matter, an all-versus-all metagenomic comparison reference genome set. The final non-redundant catalogue representing is required. Such a task requires considerable computational resources, the unexplored metagenomic protein space consisted of 1,171,974,849 yet reaching such levels of scalability remains technically challenging. protein sequences (14% of the total). Although some excellent efforts to address this issue have been recently reported10–12, metagenomes have not yet been comprehensively sur- veyed to uncover the functional dark matter. Novel protein families Here we present a scalable computational approach for identifying We next clustered the 1.1 billion ED proteins using a graph-based and characterizing functional dark matter found in metagenomes. First, approach. For comparative purposes, we followed the same approach we identified the novel protein space present in 26,931 metagenomic for the 94 million proteins from reference genomes. First, an datasets from IMG/M, after removing all genes with matches to the all-versus-all similarity matrix was built for each of the two gene cata- IMG database of over 100,000 reference genomes or Pfam. We next logues (that is, proteins from reference genomes and those from the clustered the remaining sequences into protein families and explored ED) by calculating all significant pairwise sequence similarities. Each of their taxonomic and biome distributions and, where possible, predicted the two graphs was then analysed to identify sequence-similarity-based their tertiary (three-dimensional (3D)) structures. protein clusters. For this purpose, we used HipMCL13, a massively paral- lel implementation of the original MCL algorithm14 that was previously developed for this scale of data and that can run on distributed-memory The novel protein sequence space computers. The whole process from data retrieval to cluster genera- We initially collected all protein sequences (longer than 35 amino acid tion is shown in Fig. 1a. residues) from all public reference genomes and assembled metagen- Although most clusters with at least 50 members (and possibly even omes and metatranscriptomes hosted in the IMG/M platform6. In total, those with at least 25 members) probably represent potentially func- we extracted all protein sequences from 89,412 bacterial, 9,202 viral, tionally important clusters, we restricted the subsequent analysis to 3,073 archaeal and 804 eukaryal genomes, resulting in a final dataset of the larger families with at least 100 members to focus on higher-quality 94,672,003 sequences. The reference genomes included in this study data as well as better candidates for predicting structures (Table 1). In consisted solely of isolate genomes, not MAGs or single-amplified total, we identified 106,198 families with at least 100 members that will genomes. Similarly, for unbinned metagenomes, we extracted all be referred to as novel metagenome protein families (NMPFs) (Table 1 predicted protein sequences from scaffolds of at least 500 bp and (right column)). For comparison, we identified 92,909 protein clusters with lengths of at least 35 amino acids from 26,931 datasets (20,759 in the corresponding set of protein clusters with at least 100 members metagenomes and 6,172 metatranscriptomes), hereafter referred to as from reference genomes. By directly comparing the two clustered sets the environmental dataset (ED). This resulted in a non-redundant set of (reference versus ED protein clusters), we observed an increase in the 8,364,611,943 predicted proteins or protein fragments. To identify the ED protein clusters by greater than 14-fold for clusters with at least 3 functional dark-matter component of this dataset, we first discarded members, greater than 3-fold for clusters with at least 25 members, Nature | Vol 622 | 19 October 2023 | 595 Article Table 1 | HipMCL clustering of proteins from reference genomes and metagenomes and their corresponding clusters of different protein family sizes (cumulative) Environmental dataset Proteins for clustering 570,198,677 Cluster size ≥3 members ≥25 members ≥50 members ≥75 members ≥100 members Clusters (NMPFs) 64,149,288 1,501,861 428,910 200,075 106,198 Datasets 24,477 23,208 21,447 20,274 19,326 Scaffolds 349,547,957 71,910,494 39,593,021 27,041,114 17,280,119 Proteins 400,241,252 77,892,505 42,280,078 28,621,670 19,986,348 Percentage of proteins 70.19 13.66 7.41 5.02 3.5 Reference genomes Proteins for clustering 71,312,320 Cluster size ≥3 members ≥25 members ≥50 members ≥75 members ≥100 members Clusters 4,572,038 415,591 197,965 128,324 92,909 Datasets 80,896 77,611 76,294 75,145 74,134 Proteins 64,427,269 38,509,539 31,100,303 26,906,094 23,860,313 Percentage of proteins 90.34 54.00 43.61 37.73 33.46 around a 2-fold increase for clusters with at least 50 and 75 members classification. For the same reason, we observed a notable overlap as well as an increase for clusters with at least 100 members (Table 1). between plants and freshwater NMPFs as well as between soil, fresh- Although the metagenome sequence space is intrinsically more water and plant NMPFs. Conversely, only 37% of freshwater and 46% of fragmented compared with the reference genomes (Supplementary marine NMPFs were shared with each other. Even fewer protein families Methods and Supplementary Fig. 1), and a higher percentage of genes were shared between ecosystem types such as human, non-human would be erroneous or incomplete (which is also one of the reasons mammals and host-associated. On the other hand, a rather substantial we decided to focus all further analysis on the larger clusters), these overlap in NMPFs between human and engineered environments was results also suggest that much of protein sequence space remains to observed (Fig. 2). This is not surprising, considering that engineered be explored. This is also supported by rarefaction curves generated environments largely contain samples from human-waste-related eco- from the ≥100-member clusters (Fig. 1b). These curves show that, as systems (such as solid waste and wastewater). Similarly, an overlap more samples become available, the cluster number increases linearly exists between freshwater and engineered environments, as well as for reference genomes but exponentially, without reaching a plateau, between freshwater and host-associated types (human, non-human for metagenomes. mammals and other host-associated), as shown in Extended Data Fig. 1. These overlaps could be indicative of phenomena such as faecal con- tamination of freshwater environments, leading to the co-occurrence of Biome distribution the same NMPFs—and, therefore, the same microbial communities—in To determine the biome distribution landscape of the NMPFs, the corre- different ecosystem types. sponding metadata were collected for each sample from IMG/M6 using The percentage of ecosystem-specific NMPFs varied significantly the GOLD database15 ecosystem classification scheme16 (Supplementary across each of the eight ecosystem types, with the highest percent- Table 1). The biome distribution of the NMPFs is shown in Fig. 2a,b and age observed for host-associated (non-human mammals) (85.6%) and Extended Data Fig. 1. Here the three main GOLD ecosystems (environ- host-associated (other) samples (79.2%), followed by marine (48.4%) mental, host-associated and engineered) are further divided into eight and then soil (14.2%) samples (Fig. 2c). This is explained by the unique more specific ecosystem types: freshwater, marine, soil, plants, human, characteristics of the environments contained in these ecosystem non-human mammals, other host-associated and engineered. Examin- categories, for example, oceanic environments of marine samples, ing the network topology, we observed minimum gene sharing within and even more so in the case of the host-associated category, which each NMPF across the three broad ecosystems, in accordance with contains a diverse array of microbiome hosts with significant biologi- recent observations of protein families from 13,174 metagenomes17, cal differences (for example, arthropods and annelids) (Extended Data with the exception of soil/plant associations (see below). However, Fig. 4). In contrast to marine samples, freshwater samples had a very 7,692 NMPFs (7%) were found to have members across all of the eight small percentage of ecosystem-type-specific families, mostly due to ecosystem types. The properties of the top NMPFs distributed across a large number of wetland and sediment samples with strong associa- all ecosystem types are shown in Supplementary Table 2, while the tions with soil, as did the plant/rhizosphere-related samples with soil properties of the top NMPFs of each distinct ecosystem type are shown samples (Fig. 2a). in Supplementary Tables 3 and 4. In addition to the analysis presented Finally, to investigate the ecosystem distribution of NMPFs, the above, each ecosystem was further divided into subcategories for finer ecosystem prevalence of the most-abundant NMPFs of each eco- analysis (Extended Data Figs. 2–5 and Supplementary Fig. 2). system type was evaluated. The prevalence of each NMPF in an The largest number of NMPFs was shared between soil and plant ecosystem (for example, freshwater) was calculated as the number environments (62% of the soil and 96% of plant-associated families), as of family ecosystem-associated datasets over the total number of would be expected due to the strong overlap of the sampling in these ecosystem-associated datasets in the study (Supplementary Table 3). ecosystems (that is, most of the plant samples are from the rhizosphere) Despite the existence of NMPFs strongly associated with a particu- (Fig. 2a and Extended Data Fig. 3). This was followed by NMPFs shared lar ecosystem type (for example, >80% of NMPFs), their prevalence between soil and freshwater, which could be primarily due to the assign- in the overall datasets associated with said ecosystems was rather ment of wetland and sediment samples under the freshwater ecosystem low, with most NMPFs distributed across 5–20% of the samples 596 | Nature | Vol 622 | 19 October 2023 a b c 40,000 47,410 Freshwater 3,052 35,000 Environmental Marine 37,919 Freshwater 30,000 Marine 18,372 25,000 Intersection size Non-human 64,327 mammals Soil 20,000 9,121 Plants Engineered 15,000 41,480 Soil Plants Host-associated 1,386 (others) Human 10,000 1,753 Host-associated Human 5,000 184 0 554 Non-human Non-human mammals 474 mammals Human 4,624 Engineered Host-associated (others) 3,661 Host-associated (others) Marine 2,052 Engineered Total Plants 219 Ecosystem specific Freshwater 00 00 0 00 00 00 00 ,0 ,0 ,0 ,0 ,0 ,0 Soil 50 60 10 20 30 40 NMPFs Fig. 2 | Ecosystem analysis of NMPFs. a, UpSet plot representation of b, Network representation of the protein clusters and their ecosystems. protein clusters overlapping across the eight ecosystem types. The various Eight ecosystem types were applied according to the GOLD ecosystem intersections among different categories are represented by the chart at the classification, represented by central, coloured nodes (hubs), whereas the bottom, with each category shown as a dot and intersecting categories grey peripheral nodes represent the protein clusters. The edges represent the connected by straight lines. The sizes of the intersection sets are represented protein cluster–ecosystem associations. c, The distribution of total versus by the vertical bar chart. Intersection sets of 15 NMPFs or higher are shown. ecosystem-type-specific NMPFs across the eight different ecosystem types. associated with each ecosystem type. The only exception was the Subsequently, we evaluated whether any of the NMPF proteins non-human-mammal-associated NMPFs, for which prevalence reached (and their corresponding families) were found in any of the recently up to around 45% of the total non-human mammalian datasets. identified MAGs from the Genomes from Earth’s Microbiomes (GEM) catalogue20. Specifically, we examined whether any of the scaffolds containing genes of the NMPFs were binned in any of the 52,515 MAGs Taxonomic distribution of the GEM catalogue. This revealed that only 17,953 genes, coming Taxonomic assignment of NMPFs was performed on the basis of the from 7,937 NMPFs (7.4% of total) (Fig. 3b,c), were found within the available taxonomy information of the corresponding scaffolds in IMG, GEM catalogue, of which the vast majority (93%) was from uncultured for each member of the clusters18. In cases in which no such annota- species. For those families that were present in two or more MAGs, tion was available, we used a combination of additional approaches we noticed a strong narrow taxonomic distribution, with more than to computationally infer the taxonomy of the scaffolds (Methods). Of two-thirds being restricted to a single species or genus, and only a the total 17,280,119 IMG/M scaffolds containing the NMPF members, very small number found across multiple families, classes or phyla 8,049,154 were classified as Bacteria, 382,761 as Archaea, 1,184,393 as (Fig. 3d). NMPFs were found to be statistically enriched in several Eukaryota and 1,406,588 as viruses, leaving 6,257,223 as unclassified. phyla common in soil environments (for example, Gemmatimonadota, The taxonomic distribution of the NMPFs, on the basis of their Acidobacteriota, Crenarchaeota and Myxococcota) and statistically corresponding scaffold taxonomic assignment, is shown in Fig. 3a depleted from several phyla found in humans and other host- and Extended Data Fig. 6. The majority of protein families included associated environments (Firmicutes, Proteobacteria and Bacteroi- sequences with multiple taxonomic assignments (such as bacterial dota; Fig. 3e). Taken together, these results reveal that a significant and unclassified, or bacterial and viral). The largest category consisted fraction of functional diversity remains taxonomically orphan des of families with bacterial/unclassified sequences, followed by viruses/ pite improvements in sequencing throughout and large-scale MAG unclassified and bacteria/viruses. A much smaller group of families was reconstructions. assigned to Eukarya and even fewer families to Archaea. Finally, 7,253 clusters had no taxonomic information at all. As no reliable de novo eukaryotic gene predictor exists for unbinned Metadata distribution metagenomes19, a lot of sequences may come from eukaryotic scaffolds We next examined the geographical distribution of NMPFs (Extended that may contain translation errors (such as mistranslated introns). Data Fig. 7). A very small number of families (1,372; 1.3%) was found to However, analysing the contents of these clusters (Supplementary have limited geographical distribution (within 1 km), and this number Methods) showed that their majority include proteins from Bacteria only moderately increased (4,330; 4%) when we allowed for a maxi- and Archaea alongside Eukarya, with very few NMPFs containing only mum distance of 1,000 km. Most of these families were found in plant, eukaryotic sequences (Supplementary Data 6). Moreover, more than soil and freshwater ecosystems. A very small number of these families half of these clusters are validated by metatranscriptomic data. These included members found in marine ecosystems or human samples two observations supported the quality of the eukaryotic-containing as expected from the higher microbial dispersal in these ecosystems NMPFs. (Extended Data Fig. 7f,g). Nature | Vol 622 | 19 October 2023 | 597 Article a 45,000 43,682 40,000 35,000 Intersection size 30,000 25,000 20,000 15,000 7,253 6,719 15,835 5,398 14,221 10,000 2,360 2,612 1,272 5,000 953 884 858 842 756 742 414 390 220 196 119 36 34 24 16 75 0 8 6 1 1 4,777 Archaea 8,946 Eukaryota 24,452 Viruses 73,884 Bacteria 81,092 Unclassified b 106,198 clusters e Campylobacterota (1) *** Firmicutes C (6) P value P value Firmicutes (30) *** (enriched) (depleted) Marinisomatota (26) ** 0.700) prediction. The Similarly, family F021307 was found within a probable chloroplast pTM-score integrates both the predicted confidence per position and ribosomal protein operon in 67% of encoding scaffolds. In total, 7,625 the predicted alignment error (pAE) for every pair of positions, indicat- NMPFs were found to have greater than 50% co-occurrence with specific ing the confidence of domain–domain orientations. Pfams, while 585 families had greater than 90% co-occurrence with a On the basis of structural clustering, these high-confidence predic- Pfam family (Supplementary Data 1). These associations can also be tions represented 4,361 unique structures. To examine the novelty or used to predict a functional role for NMPFs; a few examples are given functions of these structures, we compared them to experimentally in Extended Data Figs. 8 and 9, in which the gene neighbourhoods of determined structures from SCOP-Extended (SCOPe)26 and assemblies selected NMPFs are presented as association networks, combined with from the Protein Data Bank (PDB)27. In total, 3,808 structures (12,253 functional annotation from COG. NMPFs) had a significant structural overlap with at least 1 SCOPe domain (TM-score > 0.5). Of these, 2,718 (7,769 NMPFs) had a non-trivial hit, indicating that 62.3% of high-quality predictions had some similarity Structural distribution to at least one SCOPe domain or PDB assembly. Recent breakthroughs in protein structure prediction22 have enabled These novel assignments, based on structural similarity, can now fast and accurate structural characterization of protein sequences. be used for functional prediction of the corresponding sequences. A Metagenomic sequences have been shown to represent a particu- few examples are shown in Fig. 4c. For example, family F034396 had larly rich source for the discovery of novel structures23,24. Here we ran no hits to the PDB using HHsearch (top hit of e-value = 12), yet a strong AlphaFold222 on NMPFs with at least 16 diverse sequences, or where hit to the PDB using a structural search of the SCOPe domain d3cmba1 TrRosetta25 predicted a well-structured protein (Methods). The results (TM-score = 0.69), with the function of acetoacetate decarboxylase. are summarized in Fig. 4a. Out of the 81,345 NMPFs that met the above Other examples with no HHsearch hits (e-value > 10), yet strong struc- criteria, 80,585 3D models were predicted, with 13,096 NMPFs having a tural hits included F010804-d1z45a1 (TM-score = 0.73, galactose Nature | Vol 622 | 19 October 2023 | 599 Article mutarotase) and F097565-d1xkra_ (TM-score = 0.73, chemotaxis). We NMPFs (60%) comprised proteins encoded by genes identified in both stress that these cases should be treated as informed predictions that metagenomes and metatranscriptomes, indicating that most of those require experimental validation, as the same fold does not always cor- genes are actively expressed, further supporting the validity of those respond to the same function. A full list is provided in Supplementary clusters. The clustering quality was also supported by the observation Data 2. However, some validation and additional functional annotation that 92% of clusters had members spanning 50 samples or more, while can be performed by combining these novel assignments with other 50% of the clusters were from proteins distributed across 100 samples NMPF metadata, such as gene co-occurrence. A few examples are given or more (Fig. 2d). in Extended Data Fig. 8. The identification of 7.5% of the NMPFs on the recently reconstructed To confirm that the remaining 553 proteins with no SCOPe hit were MAGs of the GEM catalogue indicates that, as we continue to access novel folds, a more thorough search was performed against all PDB bio- the genetic content of uncultured microbial diversity, an increasing logical assemblies, including all possible chain permutations. In total, number of taxonomically orphan novel protein families will become 345 models had a hit to at least one PDB entry, of which 305 represented taxonomically assigned, an important step towards their functional additional novel assignments. The remaining 208 were processed for and ecological characterization. further filtering, removing predictions of which 50% of the structure There are several limitations underlying the metagenomic data and matched a SCOPe domain. Finally, 162 folds and/or domain–domain methodology used in this study. One limiting factor to consider is the orientations from 223 NMPFs were identified as novel (Fig. 4b). A com- short size (shorter than 5 kb) of the majority of scaffolds used in this plete list of these folds is provided in Supplementary Data 3. study. However, note that, due to the required alignment coverage of Although the absence of any significant structural homology pre- at least 80%, potentially truncated sequences have to be sufficiently cludes the reliable functional annotation for these novel folds, some complete to cluster with full-length sequences (defined as located in hints towards their potential function can be gleaned from their asso- the middle of longer scaffolds). This requirement has largely precluded ciated metadata. Characteristic examples are given in Extended Data the enrichment of NMPFs with fragmented proteins. However, even Fig. 9, showcasing the gene neighbourhood and ecosystem metadata in the case of NMPFs with a high percentage of these suspect sequences, of three NMPFs with novel structural folds. the clusters are found to produce stable 3D models (often with high structural quality, as evidenced by pLDDT and pTM scores), many of which have structural homologues to SCOPe domains. As a result, Discussion families containing such sequences could potentially represent pro- Arguably, the best approach for estimating and exploring microbial tein fragments or protein domains that form parts of multi-domain functional diversity is through systematically cataloguing and exhaus- sequences, or components in multimeric complexes. An additional tively characterizing sequence-diversity space. Over the past three potential limitation is the inclusion of eukaryotic sequences in the decades, genome sequencing of hundreds of thousands of cultured sequence dataset, which may introduce errors in the analysis. Yet, as microbial strains has enabled unprecedented growth and characteriza- shown (taxonomic distribution; Supplementary Methods), the contri- tion of this sequence space, revealing that sequencing efforts targeted butions from eukaryotic scaffolds are relatively small, and the majority to maximize phylogenetic diversity can lead to further discoveries and of the associated NMPFs also contain data from metatranscriptomes growth of currently known protein family diversity28. Although the and/or prokaryotic taxa in sequence alignments, supporting their exploration of the corresponding sequence-encoded functional diver- validity. However, until reliable eukaryotic gene predictors for metage- sity is lagging substantially29, the explosion in the number of identified nomes become available, eukaryotic, as well as unclassified NMPFs and novel protein families has, to a great extent, been accompanied by an sequences should be handled with care. increase in targeted functional characterization of some of those fami- Overall, as more metagenomic data become available, an increasing lies, particularly in areas of important biotechnological applications diversity of sequences will be incorporated into NMPFs, which will then such as the discovery of new CRISPR–Cas genes and systems30. The enable the generation of a much higher number of high-confidence advent of metagenomics has further fuelled the rush to discover new structures, and therefore further increase the numbers of assignments enzymatic activities by unearthing a hidden treasure trove of untapped to known structures as well as uncover novel folds. The identification sequence information. Yet, aside from generating for-the-first-time of NMPFs opens new paths for structural genomics and challenges for important habitat-specific environmental gene catalogues31, most fold recognition and the exploration of microbial dark matter. explorations of exponentially growing metagenomic sequence space have focused on expanding the diversity and characterization of previ- ously known protein families32. Online content To alleviate this limitation and pioneer global insights into the extent Any methods, additional references, Nature Portfolio reporting summa- of novel sequence space and, by effect, functional diversity across ries, source data, extended data, supplementary information, acknowl- the realm of sequenced biomes, we have amassed close to 27,000 edgements, peer review information; details of author contributions publicly available assembled metagenomic and metatranscriptomic and competing interests; and statements of data and code availability datasets. From these datasets, we generated the NMPF catalogue, are available at https://doi.org/10.1038/s41586-023-06583-7. consisting of 106,198 metagenome protein families of 100 members or more with no sequence similarity to genes from reference microbial 1. New, F. N. & Brito, I. L. What is metagenomics teaching us, and what is missed? Annu. Rev. Microbiol. 74, 117–135 (2020). genomes or Pfam entries. Although these families represented a mere 2. Rinke, C. et al. Insights into the phylogeny and coding potential of microbial dark matter. duplication over the number of families generated from more than Nature 499, 431–437 (2013). 100,000 reference genomes integrated into IMG from all domains 3. Mistry, J. et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 49, D412–D419 (2021). of life, far greater increases were observed in the families containing 4. Meyer, F. et al. MG-RAST version 4—lessons learned from a decade of low-budget more than 25, 50 or 75 members, strongly suggesting that extensive ultra-high-throughput metagenome analysis. Brief. Bioinform. 20, 1151–1159 (2019). sequence and functional diversity remains untapped. We anticipate 5. Ayling, M., Clark, M. D. & Leggett, R. M. New approaches for metagenome assembly with short reads. Brief. Bioinform. 21, 584–594 (2020). that this diversity in unexplored microbial protein space will continue 6. Chen, I.-M. A. et al. The IMG/M data management and analysis system v.6.0: new tools to increase over the next several years as more novel environmental and advanced capabilities. Nucleic Acids Res. 49, D751–D763 (2021). samples are sequenced. 7. Mitchell, A. L. et al. MGnify: the microbiome analysis resource in 2020. Nucleic Acids Res. 48, D570–D578 (2019). Although a much smaller number of metatranscriptomes was avail- 8. Galperin, M. Y. et al. COG database update: focus on microbial diversity, model organisms, able for this analysis (4,739; 17.6%), we observed that the majority of and widespread pathogens. Nucleic Acids Res. 49, D274–D281 (2021). 600 | Nature | Vol 622 | 19 October 2023 9. Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference Marina Kalyuzhnaya42, Charlene N. Kelly63, Robert M. Kelly64, Jonathan L. Klassen65, resource for gene and protein annotation. Nucleic Acids Res. 44, D457–D462 (2016). Klaus Nüsslein66, Joel E. Kostka67, Steven Lindow68, Erik Lilleskov69, Mackenzie Lynes53, 10. Vanni, C. et al. AGNOSTOS-DB: a resource to unlock the uncharted regions of the Rachel Mackelprang70, Francis M. Martin71, Olivia U. Mason72, R. Michael McKay73, coding sequence space. Preprint at bioRxiv https://doi.org/10.1101/2021.06.07.447314 Katherine McMahon74, David A. Mead75, Monica Medina76, Laura K. Meredith77,78, (2021). Thomas Mock79, William W. Mohn80, Mary Ann Moran81, Alison Murray82, Josh D. Neufeld57, 11. Rodríguez del Río, Á. et al. Functional and evolutionary significance of unknown genes Rebecca Neumann83, Jeanette M. Norton84, Laila P. Partida-Martinez85, Nicole Pietrasiak86, from uncultivated taxa. Preprint at bioRxiv https://doi.org/10.1101/2022.01.26.477801 Dale Pelletier62, T. B. K. Reddy2, Brandi Kiel Reese87, Nicholas J. Reichart53, Rebecca Reiss88, (2022). Mak A. Saito89, Daniel P. Schachtman90, Rekha Seshadri2, Ashley Shade91, 12. Modha, S., Robertson, D. L., Hughes, J. & Orton, R. J. Quantifying and cataloguing David Sherman92,93, Rachel Simister80, Holly Simon94,95, James Stegen96, unknown sequences within human microbiomes. mSystems https://doi.org/10.1128/ Ramunas Stepanauskas97, Matthew Sullivan98, Dawn Y. Sumner99, Hanno Teeling100, msystems.01468-21 (2022). Kimberlee Thamatrakoln24, Kathleen Treseder101, Susannah Tringe2, Parag Vaishampayan102, 13. Azad, A., Pavlopoulos, G. A., Ouzounis, C. A., Kyrpides, N. C. & Buluç, A. HipMCL: David L. Valentine103, Nicholas B. Waldo83, Mark P. Waldrop104, David A. Walsh105, a high-performance parallel implementation of the Markov clustering algorithm for David M. Ward60, Michael Wilkins106, Thea Whitman21, Jamie Woolet107 & Tanja Woyke2 large-scale networks. Nucleic Acids Res. 46, e33 (2018). 14. Enright, A. J., Van Dongen, S. & Ouzounis, C. A. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30, 1575–1584 (2002). 17 Department of Marine Biology and Oceanography, Institut de Ciències del Mar, Barcelona, 15. Mukherjee, S. et al. Genomes OnLine Database (GOLD) v.8: overview and updates. Spain. 18Biology Department, Clark University, Worcester, MA, USA. 19AgResearch, Grasslands Nucleic Acids Res. 49, D723–D733 (2021). Research Centre, Palmerston North, New Zealand. 20Laboratory of Environmental Microbiology, 16. Ivanova, N. et al. A call for standardized classification of metagenome projects. Environ. Institute of Microbiology of the Czech Academy of Sciences, Prague, Czech Republic. Microbiol. 12, 1803–1805 (2010). 21 Department of Soil Science, University of Wisconsin-Madison, Madison, WI, USA. 22Department 17. Coelho, L. P. et al. Towards the biogeography of prokaryotic genes. Nature 601, 252–256 of Biology, Boston University, Boston, MA, USA. 23Carnegie Institution for Science, Stanford, (2022). CA, USA. 24Department of Marine and Coastal Sciences, Rutgers University, New Brunswick, 18. Clum, A. et al. DOE JGI Metagenome Workflow. mSystems 6, e00804-20 (2021). NJ, USA. 25Department of Biology, University of Massachusetts, Amherst, MA, USA. 26Department 19. Baltoumas, F. A. et al. Exploring microbial functional biodiversity at the protein family of Microbiology and Cell Biology, Montana State University, Bozeman, MT, USA. 27Marine level-From metagenomic sequence reads to annotated protein clusters. Front. Bioinform. Science Center, Department of Marine and Environmental Sciences, Northeastern University, 3, 1157956 (2023). Nahant, MA, USA. 28Integrative Oceanography Division, Scripps Institution of Oceanography, 20. Nayfach, S. et al. A genomic catalog of Earth’s microbiomes. Nat. Biotechnol. 39, 499–509 UC San Diego, La Jolla, CA, USA. 29School of Marine Sciences, University of Maine, Orono, ME, (2021). USA. 30Earth and Environmental Sciences, Lawrence Berkeley National Laboratory, Berkeley, 21. Overbeek, R., Fonstein, M., D’Souza, M., Pusch, G. D. & Maltsev, N. The use of gene CA, USA. 31Research Group Insect Microbiology and Symbiosis, Max Planck Institute for clusters to infer functional coupling. Proc. Natl Acad. Sci. USA 96, 2896–2901 (1999). Terrestrial Microbiology, Marburg, Germany. 32Department of Biochemistry and Molecular 22. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, Biology, The Pennsylvania State University, University Park, PA, USA. 33Department of 583–589 (2021). Microbiology, The University of Tennessee, Knoxville, Knoxville, TN, USA. 34School of Life 23. Ovchinnikov, S. et al. Protein structure determination using metagenome sequence data. Sciences, Arizona State University, Tempe, AZ, USA. 35Department of Biological Sciences, Science 355, 294–298 (2017). Clemson University, Clemson, SC, USA. 36School of Biotechnology and Biomolecular 24. Hou, Q. et al. Using metagenomic data to boost protein structure prediction and Sciences, UNSW Sydney, Sydney, New South Wales, Australia. 37Center for Ecosystem discovery. Comput. Struct. Biotechnol. J. 20, 434–442 (2022). Science and Society, Northern Arizona University, Flagstaff, AZ, USA. 38Department of the 25. Yang, J. et al. Improved protein structure prediction using predicted interresidue Geophysical Sciences, University of Chicago, Chicago, IL, USA. 39Department of Microbiology orientations. Proc. Natl Acad. Sci. USA 117, 1496–1503 (2020). and Immunology, University of British Columbia, Vancouver, British Columbia, Canada. 26. Chandonia, J.-M. et al. SCOPe: improvements to the structural classification of 40 Department of Biology, San Diego State University, San Diego, CA, USA. 41Department of proteins—extended database to facilitate variant interpretation and machine learning. Bacteriology, University of Wisconsin-Madison, Madison, WI, USA. 42Department of Biology, Nucleic Acids Res. 50, D553–D559 (2022). University of North Carolina at Chapel Hill, Chapel Hill, NC, USA. 43Department of Ecology and 27. Berman, H., Henrick, K. & Nakamura, H. Announcing the worldwide Protein Data Bank. Evolutionary Biology, University of Michigan, Ann Arbor, MI, USA. 44Ocean Genome Legacy, Nat. Struct. Biol. 10, 980 (2003). Marine Science Center, Northeastern University, Nahant, MA, USA. 45Department of Biological 28. Mukherjee, S. et al. 1,003 reference genomes of bacterial and archaeal isolates expand Sciences, California State University, Los Angeles, CA, USA. 46Department of Earth System coverage of the tree of life. Nat. Biotechnol. 35, 676–683 (2017). Science, Stanford University, Stanford, CA, USA. 47Department of Plant Sciences, University 29. Roberts, R. J. et al. COMBREX: a project to accelerate the functional annotation of of California, Davis, Davis, CA, USA. 48Center for Marine Biotechnology and Biomedicine, prokaryotic genomes. Nucleic Acids Res. 39, D11–D14 (2011). Scripps Institution of Oceanography, University of California San Diego, La Jolla, CA, USA. 30. Koonin, E. V. & Makarova, K. S. Evolutionary plasticity and functional versatility of CRISPR 49 Department of Microbiology and Medical Zoology, School of Medicine, University of systems. PLoS Biol. 20, e3001481 (2022). Puerto Rico, San Juan, PR, USA. 50Lynker, Albuquerque, NM, USA. 51Department of Crops 31. Qin, J.