Introduction

In Brazil, pastures cover ~ 154 million hectares where about 65% exhibit signs of intermediate to severe degradation1. The pasture degradation reduces its capacity to produce biomass to support animals and maintain the ecosystem productivity. In tropical soils, the process of degradation is associated to overgrazing and reduced soil fertility2. Consequently, when converting native forests into pastures, it becomes necessary to enhance soil conditions. This involves increasing soil pH and exchangeable bases, while simultaneously reducing aluminum content and potential acidity3. Conversely, the transformation of native forests into pastures leads to a reduction in soil organic carbon content, a factor that contributes to degradation process4,5. Furthermore, the consequences of overgrazing—where pastures are burdened with more animals than they can sustain—are far-reaching. This not only drastically diminishes soil cover but also accelerates the desertification process2. The ripple effect of this is a significant reduction in the reservoirs of Carbon and Nitrogen in the soil, which has a profound impact on bacterial communities, disrupting the delicate balance of our ecosystem3,4,5. Furthermore, improved soil management, such as that promoted by integrated agricultural systems, can provide additional benefits. These include animal welfare and adaptation, as well as mitigation of climate change6.

The bacterial communities are essential to nutrient cycling in the soil, and consequently to proper ecosystem functioning6. While numerous publications have documented the soil bacterial communities in subhumid tropical pastures3,4,5,7,8, the relationship between soil fertility and the soil bacteriome in these pastures remains not fully understood. This is an area that continues to be explored in the scientific community. Indeed, previous studies have mostly focused on specific driver, such as specific nutrients and soil organic carbon4,5,7,8,9. For instance, Bastida et al.8 have reported that soil microbial diversity and biomass ratios are highest in arid environments with low carbon content, while Costa et al.7 revealed strong association between the quality of pastures in a sub-humid tropical region and the participation of organic carbon fractions and microbial biomass. These studies also reported that proper pasture management practices can significantly enhance microbial diversity and complexity, with pastures exhibiting no distinction from adjacent preserved forests.

Although knowledge about the soil microbiome in tropical regions has advanced in recent years3,4,6,7,8, especially about the occurrence of predominant taxonomic groups3,8, the knowledge on bacterial composition, structure and ecological interactions with soil fertility are still in diffusion7. These gaps need to be elucidated, including the interactions that modulate the bacteriomes of soils under pasture environments and their fertility9. Moreover, further studies focused on changes in microbial structure and their correlation with enzymatic processes are necessary for the development of new indicators of ecosystem health and sustainability1b).

Figure 1
figure 1

Grou** of pastures through principal components and K-Means clustering algorithm based on soil chemical attributes and leaf nitrogen content. (a) Biplot of the principal component analysis (PCA) highlighting the three most important clusters according to the K-Means grou**; (b) Contribution of the main soil variables to the variance explained by the two main axes of the PCA; (c) New PCA biplot highlighting the difference in soil fertility levels between the two most contrasting clusters. CEC cation exchange capacity, Ngrass leaf-N, TOC total organic carbon, V% base saturation. Created in the R environment (v.4.3.1).

The K-means analysis subdivided the samples into groups of 15 (I), 10 (II), and 11 (III). Groups I and II were used in the downstream analyses, corresponding to the soils with the most significant contrasts among the studied variables, primarily pH and cation saturation (Table S1). The isolation of clusters I and II increased the explanatory power (68.4% of variability) of PCA, where there was a clear distinction in soil fertility (Fig. 1c). Most environmental variables, favorable to soil fertility, showed significant (p < 0.05) and positive correlations with the first dimension of PCA (pH, CEC, TOC, and all the cations from the saturation calculation) and the opposite was observed for Al3+ and H + Al (Table S3). In this case, cluster-I continued to be strongly associated with chemical attributes that positively influenced soil fertility (high fertility—HF) while cluster-II was characterized by more acidic soils, poor (lower V% and CEC—total cation exchange capacity) and with significant predominance of Al3+ (low fertility—LF) (Fig. 1c). These results suggested that the comparative analysis of LF and HF clusters demonstrated the greatest probability of identifying possible impacts of changes in pasture soil fertility on the parameters of structure, diversity, composition, and interaction of microbial communities.

Statistical tests confirmed significant superiority of the means of pH, Ca2+, Mg2+, TOC, CEC, and V% in HF soil compared to LF soil (Table S1, Fig. S3). The opposite was observed for Al3+ concentrations (twice the average in HF). The greatest contrast was in base saturation, where HF soil had an average of 68.1 ± 8.1% while in LF soil the V% was only 31.4 ± 19.8%.

Associations between soil attributes and bacterial diversity

All alpha-diversity indices were higher in HF soil (Wilcoxon, p-value < 0.05) and the Shannon and Simpson indices showed similar results to their respective diversity values recalculated as ASVs effective numbers (Fig. 2a, Table S4). Considering the effective diversity, HF soil had on average about 111 and 107 more ASVs than LF soil, based on the Shannon and Simpson indices, respectively (Table S4).

Figure 2
figure 2

Alpha-diversity metrics and their associations with the chemical attributes of HF and HL pasture soils. (a) Comparisons of diversity indices by Wilcoxon signed-rank statistics; (b) Monotonic associations between alpha-diversity and chemical variables through Pearson correlation coefficients. Associations marked with one asterisk (*) or more were considered significant. CEC cation exchange capacity, Effec. Effective number of ASVs based on the Shannon and Simpson indices, Ngrass leaf nitrogen in the aerial part of the pastures, TOC total organic carbon, V% base saturation in the soil. Created in the R environment (v.4.3.1).

Bases saturation (V%) was the most prominent variable in terms of positive contribution to alpha-diversity indices, followed by pH and Mg2+ concentration (Fig. 2b). Although the Shannon and Simpson diversity indices were positively affected by TOC, their respective effective numbers were not. Additionally, the calculation of the effective number of ASVs, through Simpson, showed a strongly positive correlation with Ca2+, unlike the Simpson index, revealing significant contrasts between both calculations. Cation exchange capacity (CEC) demonstrated significant positive influence on both indices based on Simpson. In contrast, the concentrations of labile-P, exchangeable Al, H + AL and leaf-N showed negative correlations with all observed indices. In this case, P stood out, showing significant negative influence on the total number of ASVs (richness) and on Shannon-based indices. In addition, richness and Simpson’s diversity were significantly reduced by the contents of leaf-N and exchangeable Al, respectively.

According to the canonical correlation analysis (CCA) using generalized UniFraq distance (Fig. 3a), the structure of microbial communities showed a segregation pattern like that observed in the PCA of chemical attributes (Fig. 1c), explaining an even greater proportion of the total variation (70.5%). There was a strong positive participation of pH, Ca2+, Mg2+, V%, and TOC in HF soil, highlighting the abundance of the bacterial phyla Candidatus Dependetiae, Nitrospirae, Candidatus Patescibacteria, and Terericutes in this niche, as well as Candidatus division WS4 and Abditibacteriota (old Candidatus FBP). In this model, the positive association of available P and Al3+ levels in LF soil also became clearer, where the Candidatus Eremiobacterota (old Candidatus WPS-2) phylum stood out and Candidatus Rokubacteria less intensely. Considering that some samples that made up each cluster were collected between distances of up to 200 km (Fig. 3b), this factor did not demonstrate variability in beta-diversity (Distance-Decay) greater than that generated by soil fertility patterns (Fig. 3c), resulting from the K-Means clustering.

Figure 3
figure 3

Beta-Diversity analysis of microbial communities in pastures with high (HF) and low fertility (LF) soils. Biplot with canonical correlation analysis (CCA) based on generalized UniFrac distance highlighting significant environmental variables (a), according to the Mantel test (p < 0.05), and the main responsive phyla. The average similarity between samples, calculated by Bray–Curtis dissimilarity, was associated with geographic distances (b) and edaphic distances (c). The phyla with significant correlations with one or more variables were also analyzed (d). CEC cation exchange capacity, Ngrass leaf-N, TOC total organic carbon, V% base saturation. Created in the R environment (v.4.3.1).

Taxonomic composition and enrichment

Six phyla were positively correlated with several beneficial chemical attributes of the soil, mainly pH (in H2O and CaCl2), V%, K+, Ca2+, Mg2+, CEC, and TOC (Fig. 3d). Prioritizing according to the number of positive and significant associations, the Bacteroidetes and Proteobacteria phyla stood out, that showed highly significant abundance with pH, V%, Ca2+, Mg2+, CEC, and TOC content. In the third position, the phylum Chloroflexi stood out, correlating positively and significantly with all these variables, except for Ca2+. Lastly, the phyla Nitrospirae, Verrucomicrobia, and Candidatus Rokubacteria stood out, all with positive abundance and significantly associated with V%, primarily driven by Ca2+. Nitrospira and Candidatus Rokubacteria were the only ones to show positive and significant correlations with K+, only Nitrospira demonstrated the same for P and Ngrass (pasture leaf N). Others were predominantly associated with less fertile soils, highlighting the Candidatus Eremiobacterota (WPS-2), followed by the Acidobacteria and Firmicutes phyla, negatively and significantly associated with pH, ECE, TOC and Mg2+ contents.

The abundance analyses identified differences between HF and LF soils in terms of enrichment of the main taxonomic ranks (Fig. 4). Initially, it was observed that 48.4% of the ASVs were shared between both clusters (Fig. 4a). However, HF soil presented the largest set of unique ASVs (31.9%) compared to LF soil (19.7%). Overall, the Actinobacteria phylum was the most abundant, representing 37.6% of all sequences (Fig. 4b). Next were the Proteobacteria (22.2%), Acidobacteria (10.4%), Firmicutes (10%), Chloroflexi (5.2%) and the others did not exceed 5% relative abundance. At the general class level, Thermoleophilia (17.5%), Actinobacteria (15.9%), Alphaproteobacteria (13.3%), Bacilli (9.7%), Acidobacteria (6.2%) and Gammaproteobacteria (5.3%) and Verrucomicrobiae (4.4%) predominated, with the others not exceeding 4% relative abundance (Fig. 4c). Comparing the two clusters, the Acidobacterria class showed the greatest variation, with a relative proportion about four times higher in LF.

Figure 4
figure 4

Relative composition and differential abundance of the main bacterial taxonomic ranks found in rich (HF) and poor (LF) pastures in fertilization. (a) Venn diagram showing the percentage of ASVs unique to each niche and shared between both; (b) relative abundance of the ten most abundant bacterial phyla; (c) relative abundance of the 12 most abundant classes; (d) Differential abundance analysis based on the taxon importance estimator (phyla and classes) in the decision tree branched by the Random-Forest algorithm (Mean Decrease Gini). Created in the R environment (v.4.3.1).

The differential abundance analysis through Random-Forest confirmed that HF soil was significantly enriched by Bacteroides, Rokubacteria, Chloroflexi, Proteobacteria, and Nitrospirae (Fig. 4d), while the Acidobacteria and Candidatus Eremiobacterota were more represented in LF soil. Most of the significantly enriched classes also concentrated in HF soil, highlighting Gammaproteobacteria, Deltaproteobacteria, Acidimicrobiia, and Bacteroidia for being among the 12 most abundant classes. Another thirteen classes with a relative abundance of less than 1% (each) were also enriched in HF soil. Acidobacteriia was the class highlighted in LF soil, where two less common groups also emerged, Ktedonobacteria class and uncultured Chloroflexi (AD3) sequences.

Species co-occurrence in ecological interaction networks

Although HF soil stood out significantly in terms of most richness and diversity parameters, the co-occurrence study identified that in LF soil there was greater complexity of significant interactions (C = 22.6) (SparCC > 0.06, p < 0.01) than in HF soil (C = 4.6), being a measure established by the ratio between the number of edges and nodes of ASVs (C = edges/nodes) (Fig. 5a). The range of interaction degrees (number of edges at each node) mirrored this result, where the maximum connection established by an ASV in HF soil was 37 edges, while in LF soil it reached up to 154. The average degree of the network in LF soil (46.54) was also higher than in HF soil (9.25), indicating a predominance of a few modules characterized by nodes with a higher number of connections (M1, M2, and M3). In the other cluster, HF soil presented, in addition to these main modules, numerous other ecological sub-systems operating in the network. The centralization and density of the network in LF soil were also higher (Table S5). Despite the lower complexity of HF soil, this network presented the highest number of positive connections, 66% versus 58% of LF soil, suggesting a system where the occurrence of most bacteria happens in an integrative manner. In addition, the network in HF soil presented a higher diameter, average path length, heterogeneity, and, more slightly, clustering coefficient (Table S5). Overall, the clustering coefficient values (> 0.4) for HF and LH soils suggest a high propensity for module formation in the network (Table S5).

Figure 5
figure 5

Microbial co-occurrence analyses in pastures on fertile (HF) and poor (LF) soil highlighting highly correlated groups (ASVs) through the SPIEC-EASI association measure (SparCC > 0.6, p < 0.01). (a) Networks, where modules were differentiated by colors and the degrees (number of connections) were directly proportional to the node diameter; (b) Abundance of connections at the bacterial phylum level. Values around circles represented the number of edges connected between phyla or within the same phylum. (c) Autogenic analysis of modules that showed significant association with at least one environmental variable. CEC cation exchange capacity, Ngrass leaf-N, TOC total organic carbon, V% base saturation. Created in the R environment (v.4.3.1).

Regarding the taxonomic composition of the networks (Fig. 5b), most of the connections in both systems were established by ASVs assigned to the most abundant phyla, being Actinobacteria (HF ~ 350; LF ~ 1800) and Proteobacteria (HF ~ 80; LF ~ 850). From these taxa, there was greater differentiation between the compositions of the networks. In this sense, a higher relative number of Choroflexi connections was observed in HF soil (~ 60) and Acidobacteria in LF soil (~ 500). The Gemmatimonadetes phylum only appeared in the HF network (~ 20 interactions), being the tenth most abundant phylum overall (Fig. 4b).

The autogenic analysis of the network modules, equivalent to the correlation between the first principal component (PCA) of each module with the environmental data, showed significant effects of the chemical attributes of the soils on the main modules of each network (Fig. 5c). The HF system presented a greater number of significantly affected components, highlighting the negative effect of V% and Ca2+ and Na+ concentrations on the main and most abundant modules (M1 and M2). On the other hand, soil K availability seems to favor ASVs that integrate these modules in HF soil. In LF soil, module M1 correlated negatively with cation exchange capacity (CEC) and with Mg2+ and total organic carbon (TOC) concentrations, while M2 was positive for Na+. TOC also inhibited module M3, while available P content stimulated it. In both M1 and M2 modules, Actinobacteria dominated (avg. 54.6%) the taxonomic composition (Table S6). Proteobacteria was the next most common (avg. 18.2%), except in LF soil M2 (12.7%), where Firmicutes (24%) and Acidobacteria (16.5%) took precedence. In summary, these phyla are keystones in biological interactions in nutrient-poor pasture soils, along with others that were less participatory in the network (Choroflexi, Planctomycetes, and Verrucomicrobia). Together, these three auxiliary phyla in the network represented an average of 7% of the ASVs components of modules M1 and M2 (Table S6).

Functional prediction

The results of the functional prediction showed more associations between predicted processes and the most favorable edaphic parameters for soil fertility (Fig. 6a). V% stood out, positively associated with pathways related to the nitrogen cycle, highlighting aerobic ammonia oxidation followed by aerobic nitrite oxidation, nitrate reduction, and nitrate denitrification, as well as iron respiration, anoxic photoautotrophy, and oxidizing photoheterotrophy. Similar associations were observed for the variables pH (general), Ca2+, Mg2+, and CEC. In general, a greater number of ASVs were associated with the processes of aerobic chemoheterotrophy (57.3%), followed by dark hydrogen oxidation (12.3%), cellulolysis (5.6%), and aerobic ammonia oxidation (3.5%) (Fig. 6b). Among all these processes, linear discriminant analysis (LDA) allowed to distinguish the enrichment of seven ASVs between clusters (Fig. 6c). In HF soil, a significant number of ASVs were attributed to the processes of aerobic ammonia oxidation, nitrate reduction, and degradation of aromatic compounds. In LF soil, associations with cellulolysis, ureolysis, methanotrophy, and methanol oxidation stood out. In conclusion, the results showed that in LH soil there was a greater enrichment of ASVs associated with FAProTax functions compared to LF soil, inferring that soil fertilization favored the richness of bacterial genes related to soil element cycling (Fig. 6d).

Figure 6
figure 6

Metagenomic prediction analysis based on the abundance of 16S rRNA genes associated with functional profiles from the FAProTax database. (a) correlations between environmental variables and predicted functional profiles; (b) relative frequency of the most abundant functional processes among clusters; (c) differential abundance of processes significantly distinct between HF and LF according to Mean Decrease Gini (MDG); (d) Richness of predicted functions depending on sample size. CEC cation exchange capacity, Ngrass leaf-N, TOC total organic carbon, V% base saturation. Created in the R environment (v.4.3.1).

In addition to these results, the molecular functions predicted in this study, based on the annotation of the 16S rRNA gene, showed a high degree of association with the annotation based on Shotgun metagenomic sequencing using the eggNOG database (ρ = 0.93, p < 0.001). In this case, we can conclude that, under the strict conditions of this study, it was possible to infer functional profiles with good accuracy using data from the V3-V4 region of the 16S (Fig. S4).

Discussion

This study provides information on how soil fertility affected the structure of microbial communities in pastures in a sub-humid tropical zone from Northeast, Brazil. In addition, our results provided valuable information for researchers studying microbial communities in pastures and how they are affected by soil fertility. The use of both supervised (Random Forest) and unsupervised (K-Means) machine learning methods was relevant to identify patterns and differentiating keystone bacterial species in response to distinct soil fertility levels. The findings suggest that fertile soils exhibited a higher diversity (Fig. 2a) and predominance of important bacterial phyla, mainly Proteobacteria, Nitrospira, Chloroflexi, and Bacteroidetes, while poor soils favored Acidobacteria (Fig. 4d). Moreover, fertile soils showed fewer significant interactions (Fig. 5a), but a greater number of independent interactive modules, where correlations between microorganisms were predominantly positive (Table S5). Previous studies have reported that increased soil fertility positively affected the diversity3,7,8, functions9,Chemical analyses of soil and plant material

The chemical analyses to determine soil fertility were carried out according to the protocols in the EMBRAPA soil analysis manual40, testing soil pH in water (1:2.5 v:v) and in CaCl2 (1:2.5 v:v), as well as the main soluble macronutrients (P, K+, Ca2+, and Mg2+), calculations of cation exchange capacity (CEC), base saturation (V%), and levels of Na+, Al3+, and H + Al (potential acidity).

Total organic carbon (TOC) was measured using the method described by Yeomans and Bremner41, which is based on the reduction of dichromate (Cr2O72−) by organic carbon compounds. In a digestion tube, 0.1 g of each soil sample was weighed. Then 5 mL of 0.167 mol L−1 K2Cr2O7 and 10 mL of concentrated H2SO4 were added. The tubes were placed in a digester block and kept at 170 °C for 30 min. After digestion, the samples at room temperature were transferred to Erlenmeyer flasks and added with 5 mL of H3PO4 to allow clear visualization of the titration turning point. Three drops of 1% diphenylamine indicator were then added and the remaining \({{{\text{Cr}}}_{2}{\text{O}}}_{7}^{2-}\) was determined by titration of the excess Cr3+ with 0.4 mol L−1 ammoniacal ferrous sulfate [(NH4)2Fe(SO4)2·6H2O]. The TOC contents were calculated according to the recommendations and mathematical equation described by Cantarella et al.42.

Leaf nitrogen was measured using an adapted sulfur digestion method43. The digest solution was prepared by adding substances in sequence to a 1000 mL beaker: 175 mL H2O, 3.6 g Na2SeO3, 21.39 g Na2SO4, 4.0 g CuSO4 5H2O, and finally 200 mL of H2SO4. Plant samples (100 mg) were ground, sieved (2 mm), and digested with 7 mL of solution. The digester block temperature was raised by 50 °C every 30 min until reaching 350 °C and held until the solution became colorless or slightly greenish. Digestion tubes were attached to a nitrogen distiller and slowly filled with 18 mol L−1 NaOH until turning greenish brown. A conical flask with 10 mL of boric acid indicator solution [20 g H3BO3; 1000 mL H2O; 15 mL of a 0.1% alcoholic solution of C21H14Br4O5S; and 6 mL of a 0.1% alcoholic solution of C15H15N3O2] was placed at the distiller outlet continued until the volume doubled and turned slightly greenish. After digestion, the solution was titrated with 0.02 mol L−1 H2SO4 until the indicator turned from green to blue. The volume (V) used was recorded in mL and nitrogen percentage (%N) was calculated using %N = 0.28 V43.

Genomic DNA extraction from soil and preparation of 16S rRNA libraries

Genomic DNA was extracted from a small sample of soil (0.4 g) using the DNeasy® PowerSoil® Kit (QIAGEN Inc., Valencia, CA, USA). Following the manufacturer’s instructions, the concentration and quality of the purified DNA were evaluated using a NanoDrop® 2000 spectrophotometer from Thermo Fisher Scientific Inc. (Waltham, MA, USA). Next, the three highest-quality repeats from four quadrants were selected to prepare our amplicon libraries, resulting in a total of 36 samples.

Sequencing libraries were constructed by amplifying the V3-V4 variable region of the 16S rRNA gene using Bakt_341F (5′-CCT ACG GGN GGC WGC AG-3′) and Bakt_805R (5′-GAC TAC HVG GGT ATC TAA TCC-3′) primers44. In conjunction with primers, the 16S rRNA gene amplicon sequencing library was generated using Herculase II Fusion DNA Polymerase (© Agilent Technologies, Inc., Santa Clara, CA, USA) and the Nextera XT v2 Index Kit (© Illumina, Inc., San Diego, CA, USA), following the manufacturer’s guidelines at Macrogen in Seul, South Korea. Sequencing was performed on an Illumina® MiSeq® using a v3 flow cell. A library concentration of 3 pM was loaded, with a 30% spike-in of the Illumina® PhiX control DNA library, following the manufacturer’s guidelines. The binary base call (BCL) files, which are the raw data files generated by Illumina sequencers, were converted into sequence data in FASTQ format using the bcl2fastq v2.20 software (© Illumina). The sequences were then demultiplexed and the barcodes were removed.

Processing of raw genetic data

A total of 2,674,738 raw sequence pairs (forward and reverse) obtained through Illumina MiSeq sequencing were analyzed using the ‘DADA2’ pipeline version 1.1645 in R version v.4.2.346 in conjunction with RStudio 2023.03.0 Build 38647. The FIGARO tools48 were utilized to optimize the truncation length parameters using the “filterAndTrim” R function (290 bases for forward reads and 260 bases for reverse reads). According to this tool, forward and reverse reads with more than 2 and 5 expected errors (maxEE), respectively, were discarded. Next, reads were truncated at the first instance of a quality score (truncQ) less than or equal to two. Error rates of the sequences were calculated using the “learnErrors” function, a machine learning-based algorithm. Amplicon sequence variants (ASVs) were inferred using the “given” function, and paired reads were merged by applying the outputs of the previous functions to the input of “mergePairs”. Chimeric sequences were identified using the “removeBimeraDenovo” function and taxonomic assignments were given to the remaining sequences based on the Silva SSU 138 (modified) database49, using the “IdTaxa” algorithm from the ‘DECIPHER’ v 2.20 R library50, which is considered a method with better classification performance than the standard set by the naive Bayesian classifier51. The data processing resulted in 854,980 high-quality sequences, allowing for the identification of 13,470 Amplicon Sequence Variants (ASV) when combining the information from the 36 composite soil samples. The paired and chimera-free sequences, along with their respective BioSample assignments, were deposited in the NCBI repository under project code PRJNA753707 (https://www.ncbi.nlm.nih.gov/).

Statistical analysis

All data analyses were computed using resources developed for R language v.4.2.346. The soils of the six municipalities were classified with K-Means to identify groups with contrasting fertility levels, using the Hartigan-Wong algorithm (R ‘stats’) adjusted to return a cluster center for each input point. Principal component analysis was performed (R ‘vegan’) to assist in the choice of clusters, allowing the identification of the contribution of soil chemical attributes and foliar nitrogen on the main dimensions of the multivariate model. The most characteristic variables of each dimension were pointed out by factor analysis (R ‘FactoMineR’), according to the method published by Husson et al.52.

After defining the soil property clusters by K-means, soils from the intermediate cluster were not considered as they might introduce undesirable noise for the purpose of this study: to contrast the impact of two extremes on bacterial communities. After the construction of the HF soil (high fertility—15 samples) and LF soil (low fertility—10 samples) clusters, the chemical attributes were compared by the t-test and Wilcoxon signed-rank test (R ‘agricolae’), comparing the means and contrasts between the dispersions of the pairs, respectively, both at a 5% similarity level. All probabilities were adjusted using the Benjamini and Hochberg method, a powerful technique that controls the False Discovery Rate (FDR). Variables expressed as percentages (y%) were transformed using the function sin−1 [√(y%⁄100)]180/π. These transformations are recommended for controlling error rates in biological data, resulting in acceptable residual analysis versus fit plots and producing p-values like the original data53.

The study focused on abundant bacterial communities, retaining ASVs with a relative abundance greater than 0.01%. Canonical correlation analysis (CCA) was used to identify and measure associations between the set of genetics and the environmental variables (Fig. 3), testing the significance rate of chemical attributes through the Mantel test, with Pearson correlations (R ‘vegan’). The same R library was used to estimate Alpha-Diversity metrics. Among these, the Shannon and Simpson indices were converted into effective or equivalent species numbers, also known as Hill numbers, which considered the number of equally abundant species necessary to produce the observed diversity value. To calculate differential abundance between atomic ranks, the Random Forest algorithm with the Kruskal–Wallis rank sum differential test was used, based on White et al.54. In this case, the differences in the mean importance of each taxon in the decision tree (MDG—Mean Decrease Gini) were calculated, a forest-wide weighted average of the decrease in the Gini impurity metric between daughter and parent nodes that a taxon is splitting.

Microbial co-occurrence patterns were analyzed using the SparCC association measure through the SPIEC-EASI approach (SParse Inverse Covariance Estimation for Ecological Association Inference) of the R SpiceEasi package55. The data were normalized based on the method of the R NetCoMi package; a technique suitable for identifying groups of highly correlated species56. To do so, the ten closest samples (Bray–Curtis dissimilarity) within each of the two clusters were selected, and ASVs with a relative frequency lower than 0.1% were discarded. The network graphs were constructed using Gephi software v. 0.10.157, where disconnected nodes and edges with weights lower than 0.6 or p-value greater than 0.01 were hidden. In these approaches, the nodes (ASVs) were classified into modules to analyze the connectivity of sub-communities that made up the network. Module eigengene analyses were also performed, which is the association of the first principal component of each detected module with environmental factors. In these analyses, all probabilities were also adjusted by the Benjamini and Hochberg method.

The functional prediction analysis was based on the association of 16S rRNA sequences with the collection of prokaryotic functional profiles deposited in the FAProTax database58. The predicted processes were subjected to correlation and differential abundance tests by the effect size method of linear discriminant analysis (LEfSe) and DMG measure through the R ‘microeco’ library, based on Segata et al.59.

To validate the results of the 16S-based metagenomic prediction, three random genomic DNA samples were submitted to Shotgun metagenomic sequencing using the Illumina NovaSeq PE (150 bp), following the manufacturer’s guidelines at Novogene Inc. in Sacramento, CA, USA. After the Shotgun sequencing was performed, the FASTQ reads were filtered according to quality score using Trimmomatic (v. 0.39), ensuring high-quality sequences. Subsequently, the high-quality reads were assembled into contigs using the MEGAHIT tool (v. 1.2.9), a single-node assembler for NGS reads. Following assembly, protein sequences were classified using the Prodigal tool (v. 2.60). The predicted genes were then submitted for functional annotation using the eggNOG-mapper (v. 2.1.12, http://eggnog-mapper.embl.de/). This step facilitated comprehension of the functional capabilities of the soil microbial communities. Subsequently, various KEGG Orthology (https://www.genome.jp) pathways (KO) were examined, transformed into relative frequencies, and correlated with the corresponding functional processes predicted from the 16S gene via amplicon sequencing (Shotgun vs FAProTax). The sole use of Shotgun sequencing data in this work was not for in-depth functional studies, as they were beyond the primary objective of the study.

All heatmaps used in the composition of Figs. 2, 3, 5, and 6 were constructed with the ‘heatmaply’ library (v. 1.5.0). In this case, the correlations and hypothesis tests between variables of two data matrices (rows and columns of the heatmaps) were computed using the “cor.test” function from the ‘stats’ library (v. 4.3.1). All other graphs were constructed using resources provided by the ‘ggplot2’ package60. The composition of the images within the figures was accomplished using the “grid.arrange” and “arrangeGrob” functions from the ‘gridExtra’ library (v. 2.3).