1 Introduction

The genetic code is a set of rules that defines how the four-letter code of DNA is translated into the 20-letter code of protein [1, 2]. Being employed in all organisms, the genetic code has many conserved and universal features, including successive triplets without overlap** [3], degeneracy for mutation tolerance [4] and codon usage bias that is meaningful for the regulation of gene expression [5,6,7,8,9,10]. Studies of codon and base usage can facilitate our understanding of the origin and evolution of genetic code. The choice of nucleotide at a specific codon position in coding genes has received much attention. For example, researchers have found strong C base preference at the second position of the second codons in the cell envelope-related genes [11], and clarified that the physical origins of codon positions strongly influence cotranslational protein folding [12]. For the broadest studies are of the third codon position. Codon degeneracy mainly manifests at this position and among the synonymous codons, the one matching the most abundant tRNA usually has the highest frequency [13]. This coupling pattern has been thought to benefit translation efficiency [14]. Recent genomic-scale expression data analyses have demonstrated the global effect of synonymous codon usage bias on transcription and translation efficiency [15,16,17,18,19]. Synonymous codon usage can also regulate protein folding type [20,21,22,23] and mutation at the synonymous site may cause intolerance and disease in human [24]. In addition, codon position specific nucleotide bias has been employed by some computational tools as an important feature for the identification of functional genes [25,26,27,28].

Although the nucleotide bias of the three codon positions has been widely used in gene identification, few studies have connected the codon position-specific nucleotide pattern with gene function distribution, particularly in the second codon position. In 2002, we observed an interesting pattern of base usage of coding genes in Vibrio cholerae. All coding genes could be divided into two unequal clusters according to the relative base frequencies of A and T at the second codon position, and the coding genes in the two clusters exhibited significant difference in protein functions [29]. We hypothesized that this pattern might appear widely in the prokaryotic domain and that it could be connected with gene function. Hence, a larger scale analysis was needed to validate our speculation.

In this paper, a systematic analysis of thousands of genomes across three domains revealed that the clustering phenomenon according to the base frequencies of A and T at the second codon position is almost universal and is especially remarkable for prokaryotes. Further study on 1483 prokaryotes with COG (clusters of orthologous groups) annotation [30, 31] shows that 99.33% of genomes have a significant difference (p < 0.01, by Chi-squared test) in the functional distributions of genes between the two unequal clusters. Furthermore, the overrepresent categories in the two clusters were consistent among species and prokaryotic phyla. Here, we revealed a consistent of A2–T2 associated clustering pattern and consistent functional influence in prokaryotes.

2 Materials and Detailed Methods

2.1 Genome Data Collection

The prokaryotic genomic data used in this study were downloaded from NCBI (ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/). Three files, i.e., ‘all.ffn.tar.gz’, ‘all.faa.tar.gz’ and ‘all.ptt.tar.gz’ were retrieved (on March 26, 2017), and contained FASTA files for nucleotide coding regions, FASTA for amino acids, and protein tables of all the prokaryotes available. After removing the sequences for plasmids and fragments from every genome, we obtained 2764 prokaryotic genomes with the sequences of protein coding genes (164 archaea and 2600 bacteria) for further analysis. The 1035 eukaryotic genomic data were downloaded from Ensembl (https://asia.ensembl.org/downloads.html), the details are shown in Table 1. Altogether, the gene sequences and annotation information of 3799 genomes from three domains of life were collected (Table 1).

Table 1 The protein coding genes of 3799 genomes collected for analysis

2.2 K-Means Algorithm

K-means is a statistical method for partitioning observations in a data set into a given number of clusters (K). In this study, the K-means was used to divide the coding genes in a genome into a specific number of clusters (e.g., K = 2) based on the relative base frequencies of A and T at the second codon position. We wanted to determine whether there was a consistent clustering pattern based on A2–T2 in the currently available genomes.

There are many basic clustering techniques, which can be classified into five categories: partitioning methods, density-based methods, grid-based methods, hierarchical methods and model-based methods [32]. The K-means algorithm is a partitioning method. It can give a definite number of final clusters and furthermore is highly efficient to implement. Here we need to cluster the genes of thousands of prokaryotic genomes hence we choose this method.

2.3 Silhouette Coefficient Analysis

As an unsupervised machine learning algorithm, K-means can be used to group the protein coding genes in a genome into K clusters (K = 2–9 in our case). We further adopted silhouette coefficient analysis to determine the optimal number of clusters into which the genes could be divided. Briefly, the silhouette coefficient was used to quantify the separation between the resulting clusters by measuring how close each gene in one cluster is to genes in the neighboring clusters. In this study, we used the silhouette coefficient to evaluate the performance of K in the range from 2 to 9 for all genomes.

2.4 Clusters of Orthologous Groups (COGs) Function Annotation

Comparison of proteins encoded in numbers of complete genomes from many major phylogenetic lineages and elucidation of consistent patterns of sequence similarities allows the delineation of many clusters of orthologous groups (COGs) [31]. There are 26 function categories (Table 2) in this framework and each category is denoted with a specific letter. Among all the prokaryotic genomes analyzed, 1483 had complete COG annotation.

Table 2 The 26 function categories could be classed into four super-categories

2.5 Chi-Squared Test

We further used the Chi-squared (χ2) test to evaluate whether there was a significant difference in the COG function categories in the two unequal clusters obtained by K-means clustering. For each genome, the χ2 test involved a 2 × 26 Chi-square table. The first row contained the number of genes in each COG function category in the large cluster, and the second row contained the number of genes in each COG function category in the small cluster. The differences were significant for 1475 out of the 1483 genomes (p < 0.05) and highly significant for 1473 genomes (p < 0.01).

2.6 Measurement of the Difference in Specific Functional Category in the Two Clusters

$$\begin{gathered} F\left( {P_{{{\text{small}}}} } \right) = \frac{{N\left( {P_{{{\text{small}}}} } \right)}}{{N\left( {{\text{small}}} \right)}} \hfill \\ F\left( {P_{{{\text{large}}}} } \right) = \frac{{N\left( {P_{{{\text{large}}}} } \right)}}{{N\left( {{\text{large}}} \right)}}, \hfill \\ \end{gathered}$$
(1)
$$\begin{array}{*{20}c} {{\text{P}}\;{\text{category}}} & {{\text{smaller}}} & {{\text{larger}}} \\ {F\left( {P_{{{\text{small}}}} } \right) > F\left( {P_{{{\text{large}}}} } \right)} & 1 & 0 \\ {F\left( {P_{{{\text{small}}}} } \right) < F\left( {P_{{{\text{large}}}} } \right)} & 0 & 1 \\ {F\left( {P_{{{\text{small}}}} } \right) = F\left( {P_{{{\text{large}}}} } \right)} & 0 & 0 \\ \end{array} ,$$
(2)

Equations (1), (2): (1) N(Psmall): the number of coding genes belonging to the P functional category in the small cluster. N(small): the total number of coding genes in the small cluster. F(Psmall): the ratio of N(Psmall) to N(small) and so for F(Plarge). (2) Significance is determined by whether the difference is beyond 5% of the lower proportion value.

2.7 Bacteria Taxonomy

TaxonKit (https://github.com/shenwei356/taxonkit) [29].

Then, a systematic survey of f(A2) and f(T2) was conducted across the three domains of life (164 archaea, 2600 bacteria and 1035 eukaryotes), and it was shown that the protein coding genes in a genome could be divided into two unequal clusters. The smaller cluster has a larger f(T2) and smaller f(A2), and the larger cluster has a relatively smaller f(T2) and larger f(A2). Such a pattern was evident and almost universal in archaea and bacteria. This pattern could also be observed in eukaryotes but was usually nonsignificant (Fig. 1).

Fig. 1
figure 1

Coding genes are divided into two unequal clusters by the base frequencies of A and T at the second position of codons. The scatter plots of 12 representative genomes from three domains with f(A2) as the x axis and f(T2) as the y axis both ranging from 0 to 0.7. The clustering phenomenon in archaea and bacteria is significant: a small cluster with much higher f(T2) and a large cluster with similar f(T2) and f(A2). This phenomenon was not significant in eukaryotes

3.2 Silhouette Coefficient to Measure the Optimal Cluster Number

To check whether the genes in the genomes could be significantly divided into two clusters based on the f(A2) and f(T2), we adopted the quantitative method of K-means clustering to divide the genes in each genome into different groups. Then, the silhouette coefficient was used to measure how many clusters the genes could be divided into with the maximum intercluster distance and minimum intracluster distance. We surveyed 2764 prokaryotic genomes and 1035 eukaryotic genomes. We found that 98.17% (161/164) of the archaeal genomes and 97.42% (2533/2600) of bacterial genomes could be optimally divided into two clusters (Fig. 2; Table S1).

Fig. 2
figure 2

The best choice of cluster number is 2. A Taking all three domains as a whole, 78.63% of genomes had an optimal K of 2. B General distribution of quantitative optimal K values indicates that two clusters are the best choice for prokaryotes and some of eukaryotes (Table S1)

Although the tendency of gene clustering could still be observed for eukaryotes, particularly higher eukaryotes (multicellular organisms), the best clustering number varied widely. Of all the eukaryotes examined, only 28.31% had an optimal clustering number of two, while optimal cluster numbers of three and four were found for 23.29% and 46.28% of the genomes, respectively (Fig. 2B). This phenomenon might be associated with the complexity of multicellular eukaryotes, which have many more genes in their genomes, with gene functions closely related to factors such as cell type and transcription regulation and also requiring more elaborate cell structures and intricate metabolic networks [34]. For simplicity, we focused on the prokaryotic genomes in the following analyses.

3.3 Biased Functional Distribution of Genes in the Two Clusters

We further checked the function of the genes in the two clusters divided by f(A2) and f(T2). Of the 2764 prokaryotic genomes, COG (cluster of orthologous groups) annotations could be retrieved for 1483. When the Chi-squared test was applied to the 26 COG functional categories, a significant difference (p < 0.01) was detected for the functions of genes included in the two clusters for 99.33% of the genomes examined (Table 3; Table S2).

Table 3 The Chi-squared test results of 1483 genomes on the protein function difference in the two unequal clusters

3.4 Differences in the Distribution of Protein Function Categories in the Two Unequal Clusters

To see a clearer pattern of the functional difference in genes in the two clusters, E. coli and M. jannaschii were chosen as a representative of bacteria and archaea, respectively (Table S3). Since the larger cluster contained much more genes than the smaller cluster, we calculated the proportion of genes belonging to every functional category in each cluster through Eq. (1). By comparing the gene function distributions in the two clusters, we found that gene related to inorganic ion transport and metabolism (COG category code P) were much more prevalent in the smaller cluster than in the larger cluster. In contrast, genes related to translation, ribosomal structure and biogenesis (J), transcription (K) and replication, recombination and repair (L) were more likely to be included in the larger cluster (Fig. 3A). This pattern seemed to be universal as it was observed in all the genomes checked.

Fig. 3
figure 3

The distribution and difference in COG functional categories in the two unequal clusters of 1483 genomes. A In two representative genomes in prokaryotes, P-related genes prevailed in the small cluster, while J-, K- and L-related genes were observed at a higher proportion in the large cluster. B Cumulative overrepresented genome numbers of 26 functional categories. The 26 functional categories are listed clockwise, beginning with A and ending with S, according to super-category: information genes, cellular processes and signaling, metabolism and poorly characterized genes. C Overrepresentation in large or small clusters for each functional category at the phylum level

Then we used Eq. (2) to examine whether a given category is overrepresented in the larger (A2 preference) or smaller (T2 preference) cluster. Using the P category as an example, for a certain genome, if F(Psmall) > F(Plarge) and the difference was greater than 5%, for P, we assigned a value of 1 to the smaller cluster and 0 to the larger cluster. In contrast, if F(Psmall) < F(Plarge) and the difference was greater than 5%, the smaller cluster was assigned a value of 0, while the larger cluster was assigned a value of 1. If the difference between F(Psmall) and F(Plarge) was less than the 5% threshold, we defined both clusters as 0, denoting that no significant difference existed in the P category between the smaller and the larger clusters. Using this method, we were able to determine how many genomes exhibited overrepresentation of each functional category in the two clusters. Among the 26 COG categories, P, U, V, S, and G were overrepresented in the smaller cluster for 98.65% (1463/1483), 95.75% (1420/1483), 92.92% (1378/1483), 90.42% (1341/1483), and 84.49% (1253/1483) genomes, respectively. Comparatively, J, K, L, F, H, T, and Q were overrepresented in the larger cluster for 98.92% (1467/1483), 98.52% (1461/1483), 98.79% (1465/1483), 95.95% (1423/1483), 87.93% (1304/1483), 87.59% (1299/1483) and 81.66% (1211/1483) genomes, respectively (Fig. 3B; Table S4).

Since prokaryotes are classified into different phyla, we checked whether all the phyla had consistent preference of functional category. The 13 prokaryotic phyla, each containing more than 10 genomes (Table S4), were extracted for further lineage analysis. Each functional category was calculated for each genome through Eq. (2). Then we calculated the cumulative number of the overrepresented categories in each phylum. Proteobacteria, for example, contained 627 genomes, 528 of which had smaller cluster as 1 and 97 of which had larger cluster as 1 for C category [2 genomes had similar F(Csmall) and F(Clarge)]. Hence the overrepresented ratios in the small and large clusters in Proteobacteria were 528/625 (84.48%) and 97/625 (15.52%), respectively. If the number of genomes was zero in some GOG functional categories, the ratio of both the smaller and the larger clusters was defined as zero. In such cases, the sum of both the smaller and the larger cluster ratios would be 0. However, for most phyla, the sum of the ratios was 1.

In Fig. 3C, we focus on those categories with different colors in the left panel and right panel because two clusters have distinct ratios. For the left panel of the larger cluster, if one functional category has a consistent color that means all phyla have consistent preference and similar for the right panel of the smaller cluster. Following this rule, we found that J, K, L, F, H, and Q were consistently overrepresented in the larger cluster of all phyla, while P, U, and S were in the smaller cluster of all phyla. Therefore, at the species level (Fig. 3A), prokaryote level (Fig. 3B) and phylum level (Fig. 3C), a consistent preference of functional category was observed, i.e., J, K and L were significantly overrepresented in the larger cluster, whereas P showed the opposite pattern. Several other categories exhibited bias to a lesser degree.

4 Discussions

Researchers have observed several clustering patterns in sequenced genomes. Médigue et al., analyzed codon usage in 780 E. coli genes [35]. Using factorial correspondence analysis, they illustrated that these genes could be divided into three classes. The first two classes are associated with expression level and the third is associated with mobility characteristics [35]. A similar pattern was found in the protein coding genes of Bacillus subtilis [36]. Since then, numerous studies have confirmed codon usage associated patterns in various prokaryotic genomes [37,38,39]. Ma and Chen defined the most deviated codon position (MDCP) and found that basing on MDCP, the CDSs of a genome can be classified into two clusters: typical and atypical [40]. Genes can also be divided into two separate clusters based on strand associated nucleotide bias [41,42,43]. All these clustering patterns are based on the distribution bias of codon usage, that is, nucleotide frequencies at three codon positions. Comparatively, here our clustering pattern is associated only with the second codon position and appears similarly in almost all prokaryotes.

On the other hand, codon usage has been used to cluster coding sequences of Arabidopsis thaliana genes in order to improve gene prediction [44]. Amino acid composition has been combined with machine-learning method to predict protein functional families and achieved accuracy of 69.1–96.1% [45]. Although DNA sequence could be also extracted as features to prediction protein function [46], in most cases features of amino acid frequency are adopted [47]. Here, we illustrated one example of direct link between nucleotide frequency and protein function categories. Therefore, our work would help to understand why protein function could be predicted from gene sequence. We hope future researches could get highly reliable prediction of protein function from DNA sequence and we think their used features would mainly associate with the second codon position, particularly A2 versus T2 frequency and in that sense our result would be well validated.

5 Conclusion

By a systematic analysis of the base frequencies at the second codon position across the three domains of life, we found that the protein coding genes of prokaryotes can be divided into two unequal clusters based on f(A2) and f(T2). Further analysis showed significant difference in the proportions of genes belonging to certain COG categories in the two clusters. P-related genes were more prevalent in the smaller cluster, while J-, K- and L-related genes were more likely to be included in the larger cluster. Lineage analysis revealed that the bias was basically consistent among different phyla. Hence, this work demonstrates an almost universal clustering pattern by the frequency of T2 versus A2 and its basically consistent influence on functional category distribution among prokaryotic phyla. These findings can help us understand why coding potentiality and functional category assignment could be theoretically predicted from gene sequences.