Background

Tuberculosis is a serious global health problem caused by Mycobacterium tuberculosis (M. tuberculosis), a pathogen that lives and thrives inside human cells [1]. It is a highly contagious and often fatal disease that affects millions of people worldwide, making it a significant burden on public health systems and societies. However, despite its enormous global burden, the factors that contribute to tuberculosis transmission are still poorly understood. Therefore, develo** a better understanding of M. tuberculosis transmission is critical for guiding effective tuberculosis control strategies and reducing the disease’s burden on society.

Bacterial two-component systems (TCSs) are the most important sensing mechanisms that respond to a diverse range of ligands, including ions, gases, and metabolites. In pathogenic bacteria, TCSs play a crucial role in promoting pathogenesis by regulating bacterial gene expression in response to hostile host environments or metabolic stresses [2, 37](Additional file 2: Tables S1-S2). Construction of the maximum likelihood phylogenetic tree was conducted through the IQ-TREE software package (version 1.6.12), utilizing the JC nucleotide substitution model and gamma model of rate heterogeneity, with 100 bootstrap replicates included [38]. Mycobacterium canettii CIPT140010059 was deemed to be an outlier. The resultant phylogenetic tree was visualized through the utilization of iTOL (https://itol.embl.de/) (Fig. 3, Additional file 1: Figs. S1S7).

Fig. 3
figure 3

The phylogenetic tree analysis of lineage2.2. (A) the phylogenetic tree analysis of lineage2.2.1. (B) the phylogenetic tree analysis of lineage2.2.2

Propagation analysis

Cluster analysis was utilized to investigate the influence of two-component system gene mutations on the transmission of M. tuberculosis [39]. Based on a previous study [40], we applied clustering to define transmission clusters and used a threshold of less than 25 SNPs. In addition, we chose the threshold of 25 SNPs because our isolates were spread in terms of location and time (1991–2019) and because we were probably missing several intermediary isolates (and cases) in our collection. (Additional file 2: Tables S1-S2). Additionally, according to the classification of transmission clusters by scholars, we also divided transmission clusters into large, medium, or small (large, over 75th percentile; medium, between 25th and 75th percentile; and small, under 25th percentile) [14]. To enhance understanding of the global distribution patterns and conduct an extensive analysis of the transmission dynamics of M.tuberculosis strains, we classified them into cross-country and within-country clusters. Furthermore, we categorized the M. tuberculosis strains into cross-regional and within-regional clusters based on geographic location utilizing the United Nations standard regions (UN M.49).

Acquisition of two-component system genes

A total of 45 two-component system genes were obtained according to NCBI and literature search [2, 7, 41]. Python was utilized to detect mutations in genes associated with TCSs (Additional file 2: Table S3).

Modeling and statistical analysis

Prediction models including gradient boosting decision tree and random forest were established by machine learning using the Scikit-learn Python package. We randomly divided all samples into training and test sets at a ratio of 7:3. Each of the models was evaluated with the metrics of Kappa, sensitivity, specificity, accuracy, positive predictive value (PPV), negative predictive value (NPV), positive likelihood ratio (PLR), negative likelihood ratio (NLR) and area under curve (AUC) [42]. After the model was fitted, we evaluated the importance of the input variables on the model. To enhance the precision of predicting risk factors, we utilized the score to assess the influence of each input feature of the models, and take the intersection of both conditions and obtain the top-performing accessions as the important features [43, 44]. Subsequently, we established the generalized linear mixed model by using the statsmodels.api Python package to further analyze the important features and obtain the final influencing factors. All statistical analyses were performed using SPSS 26.0. All statistical tests were two-tailed, and P values less than 0.05 were considered statistically significant.