Introduction

The discipline of microbiology means exploring the structure and function, interrelationships, and mechanisms within communities of microorganisms and their interactions with the immediate environments or hosts. Microscopy has been the key technique for the identification of microbes, which is complementarily followed by culture techniques to elucidate their physiology, genetic constructs, metabolism, and pathogenicity. However, these procedures are time-consuming and labor-intensive. The incorporation of advanced techniques such as high-throughput sequencing and next-generation sequencing in the field of microbiology has presented a plethora of genomic data. This accumulation of data from various domains of microbial genomics has enabled the development of new diagnostic and genoty** tools, deciphered microbial genetic diversity, and identified virulence and resistance mechanisms. Additionally, in silico methods assist in gathering genetic information that can be used to identify therapeutic targets, investigate host–pathogen interactions, and establish mechanisms of antibiotic resistance and virulence.

Therefore, an optimal analysis and interpretation of these large intricate data is the next challenge to achieve these promising advances. This task is beyond human expertise with a high risk of errors involved and calls out for advanced computational techniques that can detect meaningful patterns from the heaps of data. Artificial intelligence can fill these gaps with techniques such as machine learning, which uses structured data and recognizes meaningful patterns with supervised and unsupervised learning methods.

Bioinformatics, an application of information technology, helps in the processing and analysis of the data generated in biological research and experiments by applying computer-based algorithms. It helps in DNA barcoding and designing the patterns of disease outbreaks and new biological products. Proteomics also facilitates the study of protein structures and the identification of protein–protein interaction sites (Rao et al. 2014). In the study of metabolomics, dynamics in cell and cellular interactions are possible with the help of bioinformatics (Kushwaha et al. 2017). Bioinformatics has not only helped in genome sequencing and presented accomplishments in gene allocations but also helped to draw phylogenetic relationships and detect transcription factor-binding sites of the genes. Microarray data analysis is made possible by bioinformatics tools. Biological data are growing exponentially due to the availability of low-cost sequencing technologies. The enormous amount of data generated has led to the development of databases of nucleic acid sequences, protein sequences, and their structures. For example, Swiss-Prot and PIR for protein sequences, GenBank and DDBJ for genome sequences and protein structures, and protein databanks are established primary databases. Various software and tools that could be helpful in microbiological studies are summarized in Table 1.

Table 1 Useful software and tools for microbial studies

In silico approaches for microbial genomics

Metagenomics is an approach of advanced genomics techniques to study the microbial communities directly from their natural environments without cultivation in the lab and isolation of individual species (DeLong 2002; Riesenfeld et al. 2004a, b; Handelsman 2004; Rodriguez-Valera 2004; Streit and Schmitz 2004; Edwards and Rohwer 2005). This is the culture-independent approach for retrieval of 16S rRNA genes, established two decades ago by Pace and colleagues (Olsen et al. 1986). In 2002, Hugenholtz reported that until that time, 99% of microbial species had not been cultivated due to limitations but metagenomics approaches, revolutionized microbiology by eliminating the need for clonal isolates (Hugenholtz 2002; Rappe and Giovannoni 2003; Singh and Porwal 2021).

Metagenomic assembly facilitates gene prediction and annotation and is therefore considered a significant step when studying the functional constitution and size of microbiomes (Van der Walt et al. 2017).

To facilitate microbial identification studies, various techniques have emerged. DNA pyrosequencing, also known as sequencing by synthesis, was developed in the mid-1990s (Ronaghi et al. 1996). The major limitation of this method is its inability to read the long stretches of DNA sequence (sequences hardly exceed 100–200 base pairs with first- and second-generation pyrosequencing chemistries) (Joseph et al. 2009).

With the advent of sequencing technology, next-generation sequencing (NGS) has emerged as a rapid and reliable method for the identification of bacterial pathogens. NGS has evolved as a molecular microscope, expanding its applications into every field of microbial research (Buermans and den Dunnen 2014). The application of NGS in the microbial world includes both wet lab and bioinformatics tools/computational methods (Fig. 1) (Ghannam and Techtmann 2021). The first step of this technique is the molecular profiling of the microbial community that incorporates collection of sample (from the patient or environment), nucleic acid extraction, and library preparation. Several biases could be introduced with wet lab methods (Hazen et al. 2013). After sequencing, the primary analysis was performed using bioinformatics tools. Several studies have taken place on the processing of sequencing reads. This includes methods for binning marker genes into operational taxonomic units (OTUs) and is representative of biologically meaningful categories (Edgar 2010). Liu et al. (2021) have elaborated the step wise analysis methods used for high throught put analysis of microbiome. The collected samples are first diluted and then distributed in microtiter plate of 96 wells. The wells are then subjected to amplicon sequencing and selected as candidate. The candidates are further subjected to 16rDNA full length Sangers sequencing (Fig. 2).

Fig. 1
figure 1

Schematic representation of the next-generation sequencing approach used for the investigation of microbial communities through a pipeline that comprises collection of samples, nucleic acid extraction from hosts or the environment and preparation of libraries for sequencing (figure recreated in Biorender.com)

Fig. 2
figure 2

Advantages and HTS methods for different levels of microbiome analysis. At the molecule level, microbiome studies are divided into three types: microbe, DNA, and mRNA. The corresponding research techniques include culturome, amplicon, metagenome, metavirome, and metatranscriptome analyses. And corresponding advantages of various HTS methods used for analysis

Peker et al. compared the three methods for NGS data analysis for speed and diagnostic accuracy: de novo assembly followed by the Basic Local Alignment Search Tool (BLAST), operational taxonomic unit (OTU) for clustering and an in house developed database (16S–23S rRNA encoding region). They directly used the patient samples to perform NGS of the 16S and 23S rRNA encoding regions for reliable identification of pathogens. Although NGS data analysis is tedious and laborious, a database for the complete 16S–23S rRNA coding region is not obtainable. The study suggested and recommended de novo assembly followed by BLAST as a better method. This method showed the shortest turnaround time (2 h and 5 min), which is two hours less than OTU clustering and 4.5 h less than map**, with a sensitivity of 80%. This analysis concluded that the blend of de novo assembly and BLAST seems to be the best approach for the analysis of data (Peker et al. 2019). Additionally, comprehension of protein-DNA interactions, protein–protein interactions, docking between proteins and phyto/biochemicals for drug design, and modelling of the three-dimensional structure of proteins were made possible by in silico research (Qiu et al. 2020; Bryant et al. 2022; Baig et al. 2016; Ali et al. 2021; Fatoki et al. 2021).

Machine learning for metagenomic data analysis

With the evolution of technology and machine learning (ML) models, metagenomics has become a popular field of bioinformatics. One can create more competent models to address the problems of DNA sequencing and genome classification. As the technology is becoming more sophisticated, new more precise DNA sequencing techniques have been developed, and the enhanced computational power of modern computers has helped to achieve that. As a result, much larger quantities of data can now be processed and trained with more complex machine learning models that were earlier not feasible due several limitations. The advantage of ML is that it can fully appreciate the depth of data generated while microbiome studies and build predictive models based on outcomes for the data achieved from the microbial community (Ghannam and Techtmann 2021). ML approaches use several forms, involving unsupervised, semisupervised, reinforced, or supervised learning (Kumar et al. 2018; Saxena et al. 2019; Sathya and Abraham 2013: Zitnik et al. 2019) (Fig. 2). The model that uses a training set falls under supervised learning (Stoter et al. 2019). Statistical classification and regression analysis come under common supervised learning algorithms (Kumar et al. 2011). Clustering, also known as unsupervised learning, implements k-means to determine a centriole and reduces error by iteration and descent to achieve classification (Omer et al. 2014).

The progression of ML has led to the use of this technique in various fields of research (Chen et al. 2016; Li et al. 2016; Zou et al. 2016; Ding et al. 2017; Feng et al. 2017; Yu et al. 2017; Zeng et al. 2017; Pan et al. 2018; Liu et al. 2018; He et al. 2019; Kumar et al. 2021; Zhang et al. 2019). Such exemplary applications are drug repurposing (Yu et al. 2016, 2017), discovery of new antibiotics (Steele et al. 2009), identification of novel biocatalysts, personalized medicine (Virgin and Todd 2011; Pires et al. 2020a, b; Villasana et al. 2020), identification of disease-related microRNAs (Chen and Huang 2017; Zhao et al. 2018), identification of disease-related noncoding RNAs (Chen and Yan 2013; Hu et al. 2017, 2018), and bioremediation of agricultural, industrial, and domestic wastes (Mani and Kumar 2014; Pires et al. 2020a, b). Oudah and Henschel defined the four key stages of ML algorithm development (Oudah and Henschel 2018): The first step of the ML method, which is also a critical stage, addresses the extraction of the features (Liu et al. 2015) and then OTUs, which are obtained by clustering. Then, the significant features that are responsible for enhancing the precision and proficiency are selected, and the final step is training the dataset that is used to train an algorithm and fit the dataset. After that, a test set is used for the evaluation of the model.

Machine learning for disease prediction and classification

Various normal microflora residing in the gut play vital roles in human health. Disturbances in intestinal microorganisms may cause inflammatory diseases of the intestine (Chen et al. 2017a, b, c), such as colorectal cancer, tumors, diabetes, ulcerative colitis, and obesity. Consequently, it becomes essential to interpret the relationship of microbes, a disease, better clinical prognostic tests, and the development of new drugs (Yu et al. 2015, 2016; Shi et al. 2016; Su et al. 2018, Fan et al. 2019, Arango-Argoty et al. 2018, Steiner et al. 2020).

For the analysis of microiome–host interactions in the context of disease, an approach was given by Fan et al. (2019) that combines several data sources of the human microbiome–host disease consortium with HeteSim scores. Initially, they constructed heterogenicity networks and then conducted microbe–disease pair weighting with the standardized HeteSim measurement method. This was followed by the integration of the microbes–disease–disease pathway with HeteSim scores of the microbe‒microbe–disease pathway and finally calculation of the corresponding scores of probable microgenome associations.

Amgarten et al. (2018) proposed a new tool, MARVEL, for the prediction of the double-stranded DNA sequence of bacteriophages in metagenomics. MARVEL uses a random forest (RF) approach with a large dataset containing 1247 phage genomes and 1029 bacterial genomes along with a test dataset consisting of 335 bacterial and 177 phage genomes. Six features were proposed for the identification of phages, and then, RF was exercised for the selection of features. Finally, three features were established, which provided more information (Grazziotin et al. 2017).

Over the last few years, many studies have explored and scrutinized the role of microbiome communities in the prediction of diseases. Later, researchers incorporated complete genome sequencing and entire transcriptome sequencing data of 33 types of cancer from The Cancer Genome Atlas (TCGA) to examine the potential of microbial signatures as cancer predictors by using variation boosting ML models (Poore et al. 2020). The ML models successfully discriminated different cancer types and distinguished between cancer and normal tissues, suggesting that the microbiome is exclusive to each cancer type and cancer stage. The authors concluded that the proposed model could serve as a potential tool in microbiome-based cancer diagnosis. A similar study investigated the role of the vaginal microbial community based on bacterial signatures in the prediction of cervical intraepithelial neoplasia (CIN) using a random forest model (Lee et al. 2020). Sequencing data of the V3 region of 16S rRNA from vaginal swabs of 66 subjects were investigated for its taxonomic composition. A set of 33 bacterial species were obtained as marker communities differentiating between the CIN1 and CIN2 groups, with 0.952 area under curve (AUC). This finding validates the potential of the RF model in the prediction of CIN staging and VM as a biomarker.

Cai et al. (2019) focused on investigating the underlying mechanism of pathogenesis in human diseases using genomics with the help of in silico applications. They used a novel ML-based approach and recognized two genes, OTOF and SOCS1, that contribute to the pathogenesis mechanism of rhinovirus (Xu et al. 2019). The expression levels of these two genes could potentially determine the infected or noninfected state of an individual. Alongside depicting the significance of these two genes in rhinovirus pathogenesis, this study also demonstrated the effectiveness of in silico applications in studying the pathogenesis mechanisms. Wang et al. in 2019, proposed a spectral rotation method based on the triplet periodicity property to solve planted motif finding problems (Wang et al. 2019). The proposed method gives genes with several substitutions that can be detected from arbitrarily generated background sequences. The results of the experiment based on the genomic dataset of Saccharomyces cerevisiae showed that genes could be visually distinguished. The authors suggested that genes having approximately 50% mutations could be easily identified in background sequences.

Several studies have explored viral genomics with the help of in silico approaches. Remita et al. developed a machine learning-based virus classification tool called CASTOR and used different datasets of hepatitis B virus, human papillomaviruses (HPV), and HIV-1 as testing datasets (Remita et al. 2017). The model imitates the restriction fragment length polymorphism (RFLP) technique in silico and stimulates fragmention of genomic material by different restriction endonucleases. The authors noted positive cases of 99% for HPV alpha species, 99% for HBV genoty**, and 98% for HIV-1 M subty**. They concluded that this model is a great fit to achieve accurate large-scale virus studies owing to its generality and robustness (Lebatteux et al. 2019). Ren et al. proposed VirFinder (a novel k-mer-based tool) for the identification of viral sequences from collected metagenomic data (Ren et al. 2017). This model identifies viral sequences based on the differences in k-mer signatures of viruses and hosts. The model was trained on sequences of host and viral genomes that were sequenced before January 1, 2014, and evaluated on sequences attained after January 1, 2014. When compared to the current gene-based virus classification tool VirSorter (Roux et al. 2015), the proposed model had better TPRs (true positive rates), and it also works comparatively better for small viral contigs. The authors concluded that the proposed model is an effective tool to improve viral sequence identification, especially for viral metagenomic data.

Through their intricate multilayered learning models, deep neural networks have been shown to be a promising approach for the analysis of feature-rich and high-dimensional omics data with their complex multilevel structure. Various studies have developed deep learning-based computational models for the analysis of complex genomic and metagenomic datasets. Arangp-Argoty et al. proposed DeepARG networks to analyze a metagenomic dataset to envisage antibiotic resistance genes (ARGs) (Arango-Argoty et al. 2018). This network constitutes two models, DeepARG-LS for short-read sequences and DeepARG-SS for full-length sequences. The models were trained using 30 ARG categories and showed extreme accuracy (> 0.97) and recall (> 0.90) when evaluated on different databases (Berglund et al. 2017; Lakin et al. 2017). On the basis of the results, the authors concluded that DeepARG facilitates the identification of a wide range of ARGs.

Quang et al. developed a model named deleterious annotation of genetic variants using neural networks (DANN), based on a deep neural network, to annotate the pathogenicity of coding and noncoding genetic variants while also capturing nonlinear relationships among the features (Quang et al. 2022)