Members of an ecosystem seldomly live independently from each other; instead, they develop local interactions and form inter-member organizations to influence higher-level patterns and functions of the ecosystem [42]. In macro-ecology, an important form of inter-member organization is called “guild,” a term initially coined to represent “a group of species that exploit the same class of environmental resources in a similar way” [43]. Later, guilds became synonymous with “functional groups,” and members of the same guild also encompass those who perform similar functions within a community [43]. In this context of “guild” definitions, members of a guild tend to exhibit co-abundance patterns by thriving or declining together without regard to their taxonomic positions whenever resources become available or depleted.
Like macro-ecosystems, e.g., a rain forest [44], the human gut microbiome is a complex ecosystem, and patterns and functions of a complex ecosystem emerge from localized interactions among its individual components and inter-member organizations [42, 45]. Grouping bacterial members into potential guilds and studying individual organisms’ guild-level organization could help us understand the gut microbiota’s structural and functional relationships and highlight the individual components that are key to system processes and functions. We see a value in translating the concept of guild from macro-ecology to the study of microorganisms because guild focuses on interactions among individual members of an ecosystem without regard to their taxonomic positions. In addition, grouping bacterial members into guilds acknowledges the possibility of different species possessing potential molecular mechanisms that effectively “bind” them together to perform a specific functional role in the gut ecosystem. An example of a possible molecular mechanism is how one bacterium could produce a metabolic product used by another as a nutrient in a syntrophic relationship. Such a molecular mechanism could lead to these bacteria consistently increasing or decreasing in abundance and working together to fulfill certain metabolic functions or be more ecologically competitive. We propose that members of the same bacterial guild [46] are likely to work together to exert ecological functions and show consistent co-abundant behavior, which may be derived from exploiting the same resources in a similar way or possessing certain interactive molecular mechanisms.
The concept of guild has been used in various macro-ecological system studies, e.g., to explore the effects of environments on bird communities [47] or the assemblage of fish [48]. However, it is important to note that the definition of guild membership is sometimes arbitrary in macro-ecology because it is impossible to study all species living in an ecosystem at once, nor it is possible to monitor every member of a bird or fish community over an extensive period to capture their co-abundance patterns. In contrast, in the world of gut microbiota, one could capture data on every detectable member of the gut ecosystem, using high-throughput sequencing and along sizable spatial and temporal gradients, thus making it possible to statistically describe co-abundance patterns among all detectable members. This gives microbiome research the advantage to redefining guilds in the context of microbial ecosystems in a data-driven manner. The first step of partitioning potential guilds is to analyze bacterial abundance variations quantitatively. Whenever abundance data of genome markers (16S rRNA genes) or genomes are available, one could identify potential guilds by clustering co-abundance groups (CAGs) based on microorganisms’ co-variation of abundance, i.e., all members of the same potential guild (CAG) are positively correlated in the context of abundance changes. Consequently, clustering bacterial genomes into CAGs based on their abundance correlations could be the most parsimonious first step to study guild-level organization in microbiome research. The second step of studying guild-level organization is to investigate the relationships between guilds and host phenotypes. One effective strategy is to correlate guild abundance changes with variations of host bio-clinical parameters. The abundance of guilds can be represented by the sum of its members’ abundance since guild members have a concerted response to changes in the gut environment, e.g., increased availability of carbohydrates. Such correlation analysis can identify guilds into three categories: positively associated (potentially pathogenic), negatively associated (potentially beneficial), or have no association (potentially neutral) with disease phenotypes [46]. In this manner, guild-based analysis can potentially recapitulate the ecological interaction network of key members of the gut microbiota [49] and their relationship with host phenotypes.
Guild-based analysis can be an ecologically relevant aggregation method for identifying bacterial functional groups while reducing dimensionality and sparsity in data analysis. The concept of co-abundance group (CAG) was first introduced to the human microbiome field by Claesson et al. in 2012 [50]. Claesson and colleagues clustered correlated genera into six CAGs and investigated the transition of the gut microbiota, comparing between healthy community-dwelling subjects and frail long-term care residents. However, they clustered taxonomic groups instead of unique bacterial genomes and did not explore potential relationships between CAGs and host phenotypes. Since then, the concept of co-abundance group or co-abundance network has been applied in several dozens of microbiota studies [51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73]. Most of these studies [51,52,53,54,55,56,57,58,59,60,61,62,63,64], including Claesson et al. [50], first collapsed unique bacterial genomes into taxonomic units and then clustered taxonomic units into CAGs. The remaining publications [65,66,67,68,69,70,71,72,73] grouped unique genomes, operational taxonomic units (OTUs), or amplicon sequence variants (ASVs) into CAGs. However, these studies [65,66,67,68,69,70,71,72,73] did not treat CAGs as functional units nor directly investigated potential relationships between CAGs and host phenotypes. Though the concept of co-abundance analysis is not entirely new, our proposed strategy is different from previous work in two important aspects. First of all, we emphasize the importance of identifying CAGs at the highest resolution level possible (e.g., genome or strain in whole shotgun metagenome, OTU or ASV in 16S rRNA gene sequencing data), instead of analyzing co-abundant relationships at genus or higher taxon levels [50, 74]. Secondly, we treat CAGs as units with ecological functions (guilds) and directly explore their relationship with host phenotypes in down-stream analyses.
In practice, we suggest choosing suitable correlation methods based on specific dataset structures, such as the workflow proposed by Weisis et al. [75]. How to compare and select the best correlation methods for different microbiome datasets is beyond the scope of this article. However, in the next two sections of this opinion article, we present two guild-based analysis examples using two different correlation methods to demonstrate the possibility of integrating various correlation methods into guild-based analysis. Specifically, we recommend a dichotomic and tree-based group identification method for partitioning potential guilds without presetting the number of groups. In brief, the correlation coefficients between two genomes, two OTUs, or two ASVs could then be converted into distance metrics (1- correlation coefficient) and clustered using the Ward clustering algorithm [76]. From the top of the clustering tree, permutational multivariate analysis of variance (PERMANOVA test) is used to sequentially determine whether any of the two clades are significantly different [50]. For example, if the clustering tree has two big clades (A and B) and each one has two small clades (C and D belonging to A, and E and F belonging to B), PERMANOVA will be used to test the two larger clades first. If they are significantly different, one can test the two smaller clades under each big clade separately. If there is no significant difference between the smaller clades (C vs. D, E vs. F), the clustering will stop here and conclude that there are two potential guilds (A and B). The method we described above only serves as one example of clustering stop. Many conditions can be adopted to optimize the clustering results. One essential issue to note is that a minimum of 25 samples is needed for a robust co-abundance network analysis in microbiota studies [49]. Based on in silico stimulation work, Berry and Widder proposed several recommendations for applying co-abundance network analysis in microbiota studies [49]. A larger sample size (> 25 samples) will increase sensitivity for identifying more robust co-occurrence events. Lastly, we acknowledge that several other algorithms, such as autoencoding neural nets [77], and algorithms based on Singular Value Decomposition (SVD) [78], could also be used to reduce the dimensionality. However, whether these mathematical methods reflect the biological reality of the microbiome community remains elusive. These methods often reconstruct the dataset variables into new analysis features, such as encoder layers or principal components, making it more challenging to interpret the original variables (in this case, the individual bacterial strains). And finally, various traditional (e.g., model-based) and modern clustering (e.g., spectral graph theory based) algorithm [79] are also worth testing for the identification of guilds in future works.
Guild-based analysis for metagenomic datasets
We first applied the concept of guild in a clinical trial to understand the role of gut microbiota in body weight regulation of obese children [46]. Genetically obese children with Prader-Willi syndrome (PWS) lost a significant amount of weight by consuming a diet enriched with non-digestible carbohydrates for 90 days. Fecal samples were collected at four time points (days 0, 30, 60, and 90) to track the associated gut microbiota changes [46] and deep metagenomic sequencing was performed on a total of 109 time series fecal samples (including two time point samples from a group of obese children with no genetic cause within the same intervention). 76.0 ± 18.0 million (mean ± s.d.) high-quality reads were obtained from each sample using the Illumina HiSeq platform. After de novo assembly with IDBA-UD [80], gene prediction with MetaGeneMark [81], and gene de-redundancy with CD-HIT [82], we generated a non-redundant gut microbial gene catalog containing ~ 2 million genes. The gene abundance matrix had a sparsity at 79% with more than 2 million dimensions (variables) for only 109 samples.
Instead of searching the existing database for closest neighbors, we binned non-redundant genes into draft genomes based on the fact that the abundances of two genes located on the same genomic DNA molecule will highly correlate with each other across multiple samples [83]. A total of ~ 28,000 draft genomes (including both low-quality and high-quality ones) were binned and identified using a “canopy-based” algorithm [83]. Among all draft genomes, 376 had more than 700 genes and were identified as distinct bacterial genomes [83]. We further selected 161 genomes for subsequent analysis because they were shared by more than 20% of the samples and considered the prevalent gut bacteria. These selected genomes could be considered the predominant members of the gut microbiota because, together, they accounted for more than 60% of the total metagenomic sequences. At this point, we reduced the dataset to 161 variables with a sparsity of 52%. Then we calculated bootstrapped Spearman correlation coefficient to determine the associations between these 161 genomes. After converting the correlations into correlation distance, we clustered the 161 genomes into 18 guilds using the Ward clustering algorithm [76]. At this point, the guild abundance matrix was further reduced to a sparsity of 16%. Here we showed that a significant reduction of matrix dimensionality and sparsity is possible when we moved from genes to genomes and genomes to guilds (Fig. 1).
Our PWS study explored the relationship between these 18 identified guilds and disease phenotypes: three guilds showed negative correlations while nine guilds showed positive correlations with at least one disease phenotype. This result suggests that the former three guilds could potentially be beneficial, while the other nine guilds may be pathogenic and detrimental. The remaining six guilds had no correlations with any host disease phenotypes, indicating that they might be commensals. In addition, after reducing the data dimensionality, the number of variables was smaller than the sample size, allowing us to apply conventional/classical statistical models for modeling and predictions. Using a linear mixed-effect model trained with the day 0 and day 30 datasets, we showed that the guilds at day 60 had a predicting capacity for the anthropometric markers at day 90 (R2 = 0.64, P = 0.0002 for body mass index; R2 = 0.43, P = 0.0054 for hip circumference; R2 = 0.55, P = 0.0010 for waist circumference).
Interestingly, in our PWS study, none of the guilds we identified was taxonomically homogeneous. The nine genomes in guild #13 were from four different phyla (Firmicutes, Proteobacteria, Bacteroidetes, and Actinobacteria; Fig. 2a). Meanwhile, bacteria that belonged to the same taxon (e.g., species) were assigned to different guilds that responded differently to the intervention. For example, five high-quality draft genomes of Eubacterium eligens were found in three different guilds: one positively responded, one negatively responded, and the third did not respond to the high-fiber intervention (Fig. 2b). Specifically, from day 0 to day 30, one E. eligens strain dramatically increased in abundance, while the other four showed a sharp or steady decrease (Fig. 3a). When these strains were collapsed at the species level, the abundance change of these five strains would have been added together and represented by the black line in Fig. 3b. Such a process would have overlooked the different response patterns of these E. eligens strains to the intervention, leading us to a spurious conclusion that the abundance of these E. eligens strains consistently decreased over the course of the intervention (Fig. 3b). This observation indicates that taxa are functionally heterogenous and do not serve as a coherent functional unit for correlation analysis with host phenotypes even at the species level. This result again echoes the limitations of taxon-based analysis, suggesting that bacterial taxa (species to phyla) are functionally heterogeneous units. By contrast, guild-based analysis groups the E. eligens strains into three guilds and accurately captures all three types of response patterns to the intervention (Fig. 3b).
Guild-based analysis for 16S rRNA gene data
Guild-based analysis is also applicable for microbiome studies based on 16S rRNA gene sequencing. Conventionally, after quality control, reads are clustered into OTUs with 97% similarity using UPARSE [84] or other OTU picking methods. Recent advancement in the field is to perform denoising, which models and corrects sequencing amplicon errors, obtaining amplicon sequences variants (ASVs) with single-nucleotide resolution [85]. Studies can then identify guilds based on abundance correlations between OTUs/ASVs. Using a 16S rRNA gene V4 sequencing dataset from a study on green tea polyphenols, we found that the gut microbiome responded to polyphenols as guilds, and some guilds were associated with the polyphenols’ effect on lowering blood glucose in db/db mice [86]. In a series of studies on calorie restriction and gut microbiota, we applied the guild-based analysis to the 16S rRNA gene datasets. We identified a Lactobacillus murinus-dominated guild, which showed a positive correlation with lifespan and metabolic health and demonstrated a robust capacity to alleviate side-effects induced by a common chemotherapeutic agent [87,88,89].
In a study of patients with polycystic ovary syndrome (PCOS), 16S rRNA gene V3-V4 sequencing revealed the gut microbiome difference between PCOS patients and non-obese controls [90]. Out of 567 OTUs identified in total, 225 OTUs were shared by at least 20% of the samples and further clustered into 23 different guilds based on their co-abundance correlations calculated using the SparCC approach [91]. The abundance of these 23 guilds accounted for 95.53 ± 5.42% of the total sequences. The abundance of nine guilds showed a significant difference between patients and controls. When correlated with 26 host clinical parameters, three guilds (guilds #1, #4, and #7) showed positive correlations with disease parameters, and five (guilds #10, #11, #12, #13, and #18) were negatively correlated [90].
We performed taxon-based analysis at the genus level using the same OTU profile to demonstrate the difference between guild-based and taxon-based methods. Among the 567 OTUs, 266 OTUs were annotated into 96 genera [90]; 301 OTUs had no taxonomy annotation and were then excluded from subsequent analysis. We further narrowed the dataset to the 64 genera shared by more than 20% of the samples, a cut-off that was consistent with the guild-based analysis described above. These 64 genera, corresponding to 223 OTUs, accounted for 81.77 ± 7.5% of the total sequences. Fourteen genera showed a significant difference between the groups. Six genera had negative correlations with disease phenotypes, and four had positive correlations (Fig. 4a).
In this PCOS study, the taxon-based analysis excluded nearly 20% of raw sequencing data, an indication that a substantial part of the sequencing data was novel and had no close neighbor at the genus level in the reference database. Although using a similar number of OTUs as the taxon-based method, guild-based analysis kept novel sequencing data intact and did not restrict the analysis dataset to OTUs known in a reference database. Furthermore, guild-based analysis and taxon-based analysis of this PCOS dataset showed critical differences between their results on the potential role of specific OTUs in human health and disease phenotypes. First, there were discrepancies between results on Bacteroides from these two analysis strategies. A total of 13 prevalent OTUs were classified as Bacteroides. In taxon-based analysis, genus-level results showed positive correlations between Bacteroides genus and disease phenotypes (Fig. 4a), giving the impression that all OTUs in this genus may play a detrimental role in host health. Guild-based analysis clustered these 13 Bacteroides OTUs into seven different guilds. Specifically, Bacteroides OTU4 (classified into guild#1) showed a positive correlation with disease phenotype and were potentially detrimental, while Bacteroides OTU7 and Bacteroides OTU63 (both classified into guild#18) showed a negative correlation and were potentially beneficial (Fig. 4b). This result suggests that the assumed positive correlations between all OTUs in the genus Bacteroides and disease phenotypes, derived from taxon-based analysis results, could mislead our understanding of the roles of members of Bacteroides in PCOS. Secondly, when OTUs were collapsed at the genus level, the genus Alistipes did not correlate with host clinical parameters. In contrast, using guild-based analysis, we classified all eight prevalent Alistipes OTUs into five different guilds. Interestingly, Alistipes OTU130 was classified into guild#12, a guild that showed a negative correlation with host clinical parameters, while Alistipes OTU200 was classified into guild#4, a guild that was positively correlated with disease phenotypes (Fig. 4c). This finding suggests that different members of the same genus may affect the host in an opposite manner, some (as part of a guild) are positively associated with the disease phenotype, while others are negatively associated. When all the OTUs are collapsed into the same genus and used as a single variable in the analysis, OTUs with opposite relationships with the same disease phenotype could cancel each other out, resulting in a genus-level result of no correlation with the disease phenotype. Thirdly, only one OTU was annotated as Akkermansia at the genus level in the PCOS dataset. In the taxon-based analysis, this Akkermansia OTU80 had significantly different abundance among the groups and showed negative correlations with the host phenotype. In the guild-based analysis, the Akkermansia OTU80 was clustered into a guild, which also had significantly different abundance among the groups and showed negative correlations with the host phenotype. In addition to identifying Akkermansia as a potentially beneficial bacterium to host health, guild-based analysis revealed that Akkermansia OTU80 was co-abundant with one Clostridium IV OTU236 and another 8 OTUs without annotation at the genus level (unclassified OTU318, OTU272, OTU281, OTU397, OTU303, OTU300, OTU398, and OTU328). These findings suggest that the beneficial role of this Akkermansia OTU may require interactions with other bacteria [92]. Thus, guild-based analysis provides a complete picture of the ecological interactions between OTUs relevant to host health.