CAIcal: A combined set of tools to assess codon usage adaptation

Puigbò, Pere; Bravo, Ignacio G; Garcia-Vallve, Santiago

doi:10.1186/1745-6150-3-38

CAIcal: A combined set of tools to assess codon usage adaptation

Research
Open access
Published: 16 September 2008

Volume 3, article number 38, (2008)
Cite this article

Download PDF

You have full access to this open access article

Biology Direct Aims and scope Submit manuscript

CAIcal: A combined set of tools to assess codon usage adaptation

Download PDF

Pere Puigbò^1,2,
Ignacio G Bravo³ &
Santiago Garcia-Vallve¹

24k Accesses
374 Citations
Explore all metrics

Abstract

Background

The Codon Adaptation Index (CAI) was first developed to measure the synonymous codon usage bias for a DNA or RNA sequence. The CAI quantifies the similarity between the synonymous codon usage of a gene and the synonymous codon frequency of a reference set.

Results

We describe here CAIcal, a web-server available at http://genomes.urv.es/CAIcal that includes a complete set of utilities related with the CAI. The server provides useful important features, such as the calculation and graphical representation of the CAI along either an individual sequence or a protein multiple sequence alignment translated to DNA. The automated calculation of CAI and its expected value is also included as one of the CAIcal tools. The software is also free to be downloaded as a standalone application for local use.

Conclusion

The CAIcal server provides a complete set of tools to assess codon usage adaptation and to help in genome annotation.

Reviewers

This article was reviewed by Purificación López-García, Dan Graur, Rob Knight and Shamil Sunyaev.

A practical guide to amplicon and metagenomic analysis of microbiome data

Article Open access 11 May 2020

A Beginners Guide to Estimating the Non-synonymous to Synonymous Rate Ratio of all Protein-Coding Genes in a Genome

Is Population Genetics Really Relevant to Evolutionary Biology?

Article Open access 02 March 2024

Background

Ever since a relatively high number of DNA sequences were publicly available in databases, several statistical analyses addressing DNA composition have been performed. One of the parameters that first interested the scientist was codon usage [1]. It was soon discovered that a considerable heterogeneity in the codon usage exists between genes within species and that the degree of codon bias is positively correlated with gene expression [2, 3]. To quantify the degree of bias in the codon usage of genes, several parameters or indices have been worked out. The Codon Adaptation Index (CAI) developed by Sharp and Li [4], rapidly became one of the most used indices. The CAI is a measure of the synonymous codon usage bias for a DNA or RNA sequence and quantifies codon usage similarities between a gene and a reference set. The index ranges from 0 to 1, being 1 if a gene always uses the most frequently used synonymous codons in the reference set. The CAI has been used for estimation of gene expressivity and for prediction of highly expressed genes [5–9]; for giving an approximate indication of the likely success of heterologous gene expression [7]; for detecting dominating synonymous codon usage bias in genomes [3]; for acquiring new knowledge about species lifestyle [3, 10]; and for studying cases of horizontally transferred genes [11, 12].

Results and discussion

The most important contribution that we aim to provide with our server is to tie together several features, previously existing but disseminated throughout the Internet, and some new features related to CAI calculation and analysis, and to implement them into a single and easy-to-use web site.

Description of the CAIcal server

The CAIcal web-server, freely available at http://genomes.urv.es/CAIcal, calculates the CAI for a group of sequences using different reference sets and includes a complete set of tools related with codon usage adaptation, e.g. the representation of the CAI along a sequence or multialignment and the estimation of an expected CAI value (eCAI). CAI is calculated following the original method proposed by Sharp and Li [4] but using the recent computer implementation proposed by ** andnon-overlap** reading frames of papillomavirus genomes. Virus Research. 2005, 113: 81-88. 10.1016/j.virusres.2005.03.030." href="/article/10.1186/1745-6150-3-38#ref-CR23" id="ref-link-section-d157409505e583">23]. The function of E4 is not completely understood and its annotation is not very rigorous [14]. The mature E4 protein appears after splicing, with the donor site situated some codons downstream from the start codon of the E1 gene, and the acceptor site situated close to the middle of the E2 gene [24, 25]. The fact that most of E4 overlaps with E2, that the mature E1^E4 protein contains a few amino acids from E1 and that the splice sites are not strictly conserved, makes it difficult to determine the true E4 sequence in silico. The E4 PVs genes available in the databases are therefore very different in length and similarity. Although the genomes of many PVs have been sequenced, information about the expression of their genes or cDNA sequences is only available for a few of them. One of these is HPV1. In this case, the annotation of the HPV1 E4 gene is confirmed by mRNA data [26]. However, the E4 gene from HPV63, a PV that is phylogenetically related to HPV1 [19, 20, 27], is longer than the E4 gene from HPV1. The difference is between both sequences is 96 nucleotides located at 5' end of HPV63 E4. We can use the CAIcal server to show that the codon usage of these 96 nucleotides at the beginning of HPV63 E4 is very different from that of the rest of the E4 sequence, measured as the CAI value calculated with the human codon usage as reference (figure 2). This suggests that the acceptor splice site of HPV63 E4 is not well annotated and that the true E4 nested within E2 probably starts downstream from the annotated position.

Conclusion

The CAIcal server provides a complete set of tools to assess codon usage adaptation and helps to annotate genomic discontinuities such as the donor splicing site of the E4 ORF of papilomaviruses.

Reviewers' comments

Reviewer's report 1: Purificación López-García, CNRS, Université Paris-Sud

This article describes a series of tools for the automatic calculation of the codon adaptation index (CAI) and related measurements from input and reference data that have been implemented in a web-based server http://genomes.urv.es/CAIcal. CAI values are useful for a variety of purposes going from genomic annotation and gene expression analyses to the detection of potential horizontal gene transfer events. Although, as pointed out by the authors, a number of freely available facilities providing the calculation of CAI exist already, this new set of tools offers the possibility to obtain some additional estimates. These include the calculation of expected CAIs from randomly generated sequences with the GC content and amino acid composition of the input sequences that can be compared then with the observed CAIs, as well as measurements of the weight of each codon and their graphical representation. An example of the possible utility of these CAI measurements to test and validate annotations is provided. I find that this group of tools accessible online will be useful to the scientific community. I hope that this web-based server will benefit and get improved with the progressive input and suggestions of a wide variety of users.

Reviewer's report 2: Dan Graur, Department of Biology and Biochemistry, University of Houston

A very simple and straightforward tool for dealing with codon usage. I have no other comments.

Reviewer's report 3: Rob Knight, University of Colorado

In this manuscript, Puigbo et al. describe their CAIcal web server. CAI, the Codon Adaptation Index, is an important concept relating codon usage to gene expression. Although several software tools online already calculate CAI, CAIcal appears to offer a unique combination of functionality that is not easily duplicated using other tools.

However, the tool in its current form would appear to be a relatively minor advance over existing tools, and I would strongly encourage the authors to consider an extensive overhaul of the software and the manuscript before publication. However, I think the present work contains the seeds of a useful contribution to the field and to the literature, and definitely encourage the authors to persevere, perhaps thinking more carefully about the target audience of the software and the paper.

More attention needs to be paid to the specific contribution of this work if it is to be published as an independent piece of software. No feature of this tool really appears to be unique, e.g. the plots of CAI along a gene and codon-by-codon are also in Codon Analyser (as the authors note), many tools allow calculation of CAI against a reference set, etc.

Authors' response: As we acknowledge in the manuscript, a number of tools are available elsewhere addressing different calculations around CAI. We consider however, that one of the strengths of the CAIcal server is to gather together pre-existing and new features into a single and easy-to-use web site, as you also note in your revision "CAIcal appears to offer a unique combination of functionality that is not easily duplicated using other tools". As an example, after the CAI value of a group of sequences has been calculated, the user can easily (with only a click of the mouse) estimate an expected CAI value for discerning whether the differences in CAI are statistically significant or whether they are merely artifacts. The graphical representation of the CAI value along each sequence can also be easily visualised. In addition, we also want to point out the usability of the server, used to denote here the ease with which people can employ a particular tool. Thus, several of the existing tools that allow calculation of CAI are not web-servers; other require some kind of installation or execution; and some of them provide easy calculations that lack in flexibility. Finally, the server allows to represent the CAI value along a protein multialignment back-translated to DNA, a feature currently not available elsewhere.

Similarly, the calculations of the expected CAI values are delegated to another tool, E-CAI, that the authors have previously published, but this is not very clear from the description in the paper. If the sole contribution is to tie together several pre-existing features into a single web site, the authors need to make the case much more clearly that this combination will be of use to end users in a way that the individual pre-existing tools are not.

Authors' response: We have added a new sentence in the paper clarifying this point.

I think the source code of the standalone version needs a substantial overhaul before publication. It is full of large, error-prone tables of redundant information about genetic codes, for example, which should be dynamically calculated from a compact, standardized and easily verified source (e.g. the NCBI genetic code tables), is essentially without useful comments, mixes presentation and logic, and has many other indicators of poor coding style (for example, it looks as though several separate applications have simply been pasted together).

Authors' response: Although the main aim of our work was to provide a web-based server for CAI analysis, this was a fair criticism. The source code needed an extensive revision of style and lacked useful comments that could guide the experienced user. We have largely rewritten it and it incorporates now numerous comments about the functionality of each different part. Thus, we have developed the local version 1.3. The source code in the standalone application follows a descendent algorithm rather than several separate applications have simply been pasted together. For the sake of clarity, we have included a file with a detailed description of the CAIcal functions (this file is available from the web site in the FAQs section – http://genomes.urv.es/CAIcal/FAQs.html. The standalone application includes now new functions related with genetic codes to avoid putative error-prone in tables. Though, again, you are right and the coding style could still be improved, the program works well.

Although I appreciate that the authors have made the effort to produce and distribute a standalone version, the code unfortunately does not inspire confidence in the web site either in this case. Test cases, e.g. using Perl's built-in unit testing framework, would definitely be a useful addition to verify that the calculations are correct.

Authors' response: This was an interesting suggestion that we have addressed. To verify that the calculations are correct, we show that the results of the two independent programs (the standalone version written in Perl and the web-server written in PHP) are the same. In addition, we have compared our results with the results using other existing programs and the results are not significantly different. A file with some tests we made is available from the web site in the FAQs section http://genomes.urv.es/CAIcal/FAQs.html.

The utility of the Monte Carlo approach is also somewhat unclear to me, as it appears that the expected CAI could be calculated analytically, along with confidence intervals, using the multinomial distribution. It is possible that this is not feasible for numerical reasons, but some justification of the approach would be useful.

Authors' response: The expected CAI is calculated analytically from the CAI values of 500 randomly generated sequences with the same G+C content and amino acid composition as the query sequences. However, the Monte Carlo approach is used to generate the random sequences, not to calculate the expected CAI. In this sense, please see also Question 15 at the FAQs section of the server http://genomes.urv.es/CAIcal/FAQs.html.

I did not find the example especially compelling, but this is a relatively minor criticism and I understand that it is likely that the authors would want to publish any especially interesting results separately from the description of the tool itself. However, it might be interesting to try to reproduce a well-known conclusion from existing work to show how much easier it is with this workflow than with pre-existing tools. There are many examples in the literature as CAI is such a widely-used technique.

The manuscript and the web site need substantial attention to the quality of the English. I have not corrected minor wording and grammatical errors in this version of the manuscript, but if the authors plan to publish this manuscript regardless of the above comments, I would definitely recommend careful attention to detail, and also removing formatting errors such as the text "Sub-heading for this section" on page 3. Overall, I think this is a good first attempt and could ultimately be revised into a useful contribution that is more suitable for publication.

Authors' response: After receiving your comments and the comments of the three additional referees, we have decided to rewrite the code, to revise the manuscript and to publish it. We would like to thank you again for your comments. We think that it is not necessary any further overhaul of the software, as we agree that some changes were necessary in the manuscript and in the source code of the standalone version, and have accordingly been performed. We are glad to acknowledge that the code is easier to read after introducing the comments you suggested. Additional changes in the manuscript include also a second revision of the quality of the English following the recommendations by the NIH Fellows Editorial Board, and some clarifications. We sincerely consider that we have addressed the criticism you raised to the previous version of the manuscript.

Reviewer's report 3 (second revision): Rob Knight, University of Colorado

The revised versions of the manuscript and software are significantly improved.

Reviewer's report 4: Shamil Sunyaev, Harvard Medical School

This manuscript presents a new online tool to compute codon adaptation index (CAI). Although there are several CAI calculators available online, this new server includes several additional features such as computation of expected CAI and visualization of changes in the CAI along the sequence. The authors also present an analysis of papilomavirus as an example of the server utility. In sum, the manuscript does not report any significant novel scientific findings but presents a tool potentially useful for the research community.

References

Grantham R, Gautier C, Gouy M, Mercier R, Pave A: Codon catalog usage and the genome hypothesis. Nucleic Acids Res. 1980, 8 (1): r49-r62. 10.1093/nar/8.1.197-c.
Article PubMed CAS PubMed Central Google Scholar
Gouy M, Gautier C: Codon usage in bacteria: correlation with gene expressivity. Nucleic Acids Res. 1982, 10 (22): 7055-7074. 10.1093/nar/10.22.7055.
Article PubMed CAS PubMed Central Google Scholar
Carbone A, Zinovyev A, Kepes F: Codon adaptation index as a measure of dominating codon bias. Bioinformatics. 2003, 19 (16): 2005-2015. 10.1093/bioinformatics/btg272.
Article PubMed CAS Google Scholar
Sharp PM, Li WH: The codon Adaptation Index – a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 1987, 15 (3): 1281-1295. 10.1093/nar/15.3.1281.
Article PubMed CAS PubMed Central Google Scholar
Wu G, Culley DE, Zhang W: Predicted highly expressed genes in the genomes of Streptomyces coelicolor and Streptomyces avermitilis and the implications for their metabolism. Microbiology. 2005, 151 (Pt 7): 2175-2187. 10.1099/mic.0.27833-0.
Article PubMed CAS Google Scholar
Wu G, Nie L, Zhang W: Predicted highly expressed genes in Nocardia farcinica and the implication for its primary metabolism and nocardial virulence. Antonie Van Leeuwenhoek. 2006, 89 (1): 135-146. 10.1007/s10482-005-9016-z.
Article PubMed CAS Google Scholar
Puigbo P, Guzman E, Romeu A, Garcia-Vallve S: OPTIMIZER: a web server for optimizing the codon usage of DNA sequences. Nucleic Acids Res. 2007, 35: W126-W131. 10.1093/nar/gkm219.
Article PubMed PubMed Central Google Scholar
Ramazzotti M, Brilli M, Fani R, Manao G, Degl'Innocenti D: The CAI Analyser Package: inferring gene expressivity from raw genomic data. In Silico Biol. 2007, 7 (4-5): 507-526.
PubMed CAS Google Scholar
Puigbo P, Romeu A, Garcia-Vallve S: HEG-DB: a database of predicted highly expressed genes in prokaryotic complete genomes under translational selection. Nucleic Acids Res. 2008, 36: D524-D527. 10.1093/nar/gkm831.
Article PubMed CAS PubMed Central Google Scholar
Willenbrock H, Friis C, Juncker AS, Ussery DW: An environmental signature for 323 microbial genomes based on codon adaptation indices. Genome Biol. 2006, 7 (12): R114-10.1186/gb-2006-7-12-r114.
Article PubMed PubMed Central Google Scholar
Garcia-Vallve S, Palau J, Romeu A: Horizontal gene transfer in glycosyl hydrolases inferred from codon usage in Escherichia coli and Bacillus subtilis. Mol Biol Evol. 1999, 16 (9): 1125-1134.
Article PubMed CAS Google Scholar
Garcia-Vallve S, Guzman E, Montero MA, Romeu A: HGT-DB: a database of putative horizontally transferred genes in prokaryotic complete genomes. Nucleic Acids Res. 2003, 31 (1): 187-189. 10.1093/nar/gkg004.
Article PubMed CAS PubMed Central Google Scholar
**a X: An Improved Implementation of Codon Adaptation Index. Evolutionary Bioinformatics. 2007, 3: 53-58.
CAS Google Scholar
Nakamura Y, Gojobori T, Ikemura T: Codon usage tabulated from international DNA sequence databases: status for the year 2000. Nucleic Acids Res. 2000, 28 (1): 292-10.1093/nar/28.1.292.
Article PubMed CAS PubMed Central Google Scholar
Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 2000, 16 (6): 276-277. 10.1016/S0168-9525(00)02024-2.
Article PubMed CAS Google Scholar
Grote A, Hiller K, Scheer M, Munch R, Nortemann B, Hempel DC, Jahn D: JCat: a novel tool to adapt codon usage of a target gene to its potential expression host. Nucleic Acids Res. 2005, W526-31. 10.1093/nar/gki376. 33 Web Server
Wright F: The 'effective number of codons' used in a gene. Gene. 1990, 87 (1): 23-29. 10.1016/0378-1119(90)90491-9.
Article PubMed CAS Google Scholar
Puigbo P, Bravo IG, Garcia-Vallve S: E-CAI: a novel server to estimate an expected value of Codon Adaptation Index (eCAI). BMC bioinformatics. 2008, 9: 65-10.1186/1471-2105-9-65.
Article PubMed PubMed Central Google Scholar
Garcia-Vallve S, Alonso A, Bravo IG: Papillomaviruses: different genes have different histories. Trends Microbiol. 2005, 13 (11): 514-521. 10.1016/j.tim.2005.09.003.
Article PubMed CAS Google Scholar
Garcia-Vallve S, Iglesias-Rozas JR, Alonso A, Bravo IG: Different papillomaviruses have different repertoires of transcription factor binding sites: convergence and divergence in the upstream regulatory region. BMC Evol Biol. 2006, 6: 20-10.1186/1471-2148-6-20.
Article PubMed PubMed Central Google Scholar
Bravo IG, Muller M: Codon usage in papillomavirus genes: practical and functional aspects. Papillomavirus Report. 2005, 16: 63-72. 10.1179/095741905X24996.
Article Google Scholar
Zhao KN, Liu WJ, Frazer IH: Codon usage bias and A+T content variation in human papillomavirus genomes. Virus Res. 2003, 98 (2): 95-104. 10.1016/j.virusres.2003.08.019.
Article PubMed CAS Google Scholar
Hughes AL, Hughes MAK: Patterns of nucleotide difference in overlap** andnon-overlap** reading frames of papillomavirus genomes. Virus Research. 2005, 113: 81-88. 10.1016/j.virusres.2005.03.030.
Article PubMed CAS Google Scholar
Peh WL, Brandsma JL, Christensen ND, Cladel NM, Wu X, Doorbar J: The viral E4 protein is required for the completion of the cottontail rabbit papillomavirus productive cycle in vivo. J Virol. 2004, 78 (4): 2142-2151. 10.1128/JVI.78.4.2142-2151.2004.
Article PubMed CAS PubMed Central Google Scholar
Doorbar J: The papillomavirus life cycle. J Clin Virol. 2005, 32 (Suppl 1): S7-15. 10.1016/j.jcv.2004.12.006.
Article PubMed CAS Google Scholar
Palermo-Dilts DA, Broker TR, Chow LT: Human papillomavirus type 1 produces redundant as well as polycistronic mRNAs in plantar warts. J Virol. 1990, 64 (6): 3144-3149.
PubMed CAS PubMed Central Google Scholar
Gottschling M, Stamatakis A, Nindl I, Stockfleth E, Alonso A, Bravo IG: Multiple evolutionary mechanisms drive papillomavirus diversification. Mol Biol Evol. 2007, 24 (5): 1242-1258. 10.1093/molbev/msm039.
Article PubMed CAS Google Scholar

Download references

Acknowledgements

This work was supported in part by the Intramural Research Program of the National Institutes of Health, National Library of Medicine. IGB is the recipient of a professorship supported by the Volkswagen Stiftung in the program Evolutionary Biology. We thank Kevin Costello of the Language Service of the Rovira i Virgili University and the NIH Fellows Editorial Board for their help with writing the manuscript. We also thank Agnes Hotz-Wagenblatt from the HUSAR Bioinformatics Laboratory at Deutsches Krebsforschungszentrum and Obdulia Rabal from the "Centro Nacional de Investigaciones Oncológicas" for testing the server.

Author information

Authors and Affiliations

Department of Biochemistry and Biotechnology, Rovira i Virgili University (URV), Campus Sescelades, c/Marcelli Domingo s/n, 43007, Tarragona, Spain
Pere Puigbò & Santiago Garcia-Vallve
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, 20894, USA
Pere Puigbò
Experimental Molecular Evolution, Institute for Evolution and Biodiversity, University of Muenster, Germany
Ignacio G Bravo

Authors

Pere Puigbò
View author publications
You can also search for this author in PubMed Google Scholar
Ignacio G Bravo
View author publications
You can also search for this author in PubMed Google Scholar
Santiago Garcia-Vallve
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pere Puigbò.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

PP designed the server, made the programming task and drafted the manuscript. IGB participated in design of the server, prepared the example, and helped draft the manuscript. SG-V conceived and designed the server, coordinated the project and drafted the manuscript. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Puigbò, P., Bravo, I.G. & Garcia-Vallve, S. CAIcal: A combined set of tools to assess codon usage adaptation. Biol Direct 3, 38 (2008). https://doi.org/10.1186/1745-6150-3-38

Download citation

Received: 04 September 2008
Accepted: 16 September 2008
Published: 16 September 2008
DOI: https://doi.org/10.1186/1745-6150-3-38

CAIcal: A combined set of tools to assess codon usage adaptation