Background

Ever since a relatively high number of DNA sequences were publicly available in databases, several statistical analyses addressing DNA composition have been performed. One of the parameters that first interested the scientist was codon usage [1]. It was soon discovered that a considerable heterogeneity in the codon usage exists between genes within species and that the degree of codon bias is positively correlated with gene expression [2, 3]. To quantify the degree of bias in the codon usage of genes, several parameters or indices have been worked out. The Codon Adaptation Index (CAI) developed by Sharp and Li [4], rapidly became one of the most used indices. The CAI is a measure of the synonymous codon usage bias for a DNA or RNA sequence and quantifies codon usage similarities between a gene and a reference set. The index ranges from 0 to 1, being 1 if a gene always uses the most frequently used synonymous codons in the reference set. The CAI has been used for estimation of gene expressivity and for prediction of highly expressed genes [59]; for giving an approximate indication of the likely success of heterologous gene expression [7]; for detecting dominating synonymous codon usage bias in genomes [3]; for acquiring new knowledge about species lifestyle [3, 10]; and for studying cases of horizontally transferred genes [11, 12].

Results and discussion

The most important contribution that we aim to provide with our server is to tie together several features, previously existing but disseminated throughout the Internet, and some new features related to CAI calculation and analysis, and to implement them into a single and easy-to-use web site.

Description of the CAIcal server

The CAIcal web-server, freely available at http://genomes.urv.es/CAIcal, calculates the CAI for a group of sequences using different reference sets and includes a complete set of tools related with codon usage adaptation, e.g. the representation of the CAI along a sequence or multialignment and the estimation of an expected CAI value (eCAI). CAI is calculated following the original method proposed by Sharp and Li [4] but using the recent computer implementation proposed by ** andnon-overlap** reading frames of papillomavirus genomes. Virus Research. 2005, 113: 81-88. 10.1016/j.virusres.2005.03.030." href="/article/10.1186/1745-6150-3-38#ref-CR23" id="ref-link-section-d157409505e583">23]. The function of E4 is not completely understood and its annotation is not very rigorous [14]. The mature E4 protein appears after splicing, with the donor site situated some codons downstream from the start codon of the E1 gene, and the acceptor site situated close to the middle of the E2 gene [24, 25]. The fact that most of E4 overlaps with E2, that the mature E1^E4 protein contains a few amino acids from E1 and that the splice sites are not strictly conserved, makes it difficult to determine the true E4 sequence in silico. The E4 PVs genes available in the databases are therefore very different in length and similarity. Although the genomes of many PVs have been sequenced, information about the expression of their genes or cDNA sequences is only available for a few of them. One of these is HPV1. In this case, the annotation of the HPV1 E4 gene is confirmed by mRNA data [26]. However, the E4 gene from HPV63, a PV that is phylogenetically related to HPV1 [19, 20, 27], is longer than the E4 gene from HPV1. The difference is between both sequences is 96 nucleotides located at 5' end of HPV63 E4. We can use the CAIcal server to show that the codon usage of these 96 nucleotides at the beginning of HPV63 E4 is very different from that of the rest of the E4 sequence, measured as the CAI value calculated with the human codon usage as reference (figure 2). This suggests that the acceptor splice site of HPV63 E4 is not well annotated and that the true E4 nested within E2 probably starts downstream from the annotated position.

Figure 2
figure 2

Representation of the CAI, calculated using the human mean codon usage as a reference set, in the DNA sequence that encodes HPV63 E2 and E4. Part A represents the reading frame that encodes E2. Part B represents the same HPV63 genome fragment that encodes E2, but in the reading frame +1, which contains E4. The grey line in B represents the fragment of HPV63 E4 homologous to the closely related HPV1 E4. The black line in B represents the stretch also annotated as HPV63 E4, but which lacks homology with HPV1 E4. Note that the initial E4 region from HPV63, which is not homologous to the HPV1 gene, has an extremely low CAI, which suggests a wrong annotation for the E4 gene in HPV63. This figure was obtained using the output of the calculation of CAI along a sequence of the CAIcal server, with a window length of 11 and a window step of 5.

Conclusion

The CAIcal server provides a complete set of tools to assess codon usage adaptation and helps to annotate genomic discontinuities such as the donor splicing site of the E4 ORF of papilomaviruses.

Reviewers' comments

Reviewer's report 1: Purificación López-García, CNRS, Université Paris-Sud

This article describes a series of tools for the automatic calculation of the codon adaptation index (CAI) and related measurements from input and reference data that have been implemented in a web-based server http://genomes.urv.es/CAIcal. CAI values are useful for a variety of purposes going from genomic annotation and gene expression analyses to the detection of potential horizontal gene transfer events. Although, as pointed out by the authors, a number of freely available facilities providing the calculation of CAI exist already, this new set of tools offers the possibility to obtain some additional estimates. These include the calculation of expected CAIs from randomly generated sequences with the GC content and amino acid composition of the input sequences that can be compared then with the observed CAIs, as well as measurements of the weight of each codon and their graphical representation. An example of the possible utility of these CAI measurements to test and validate annotations is provided. I find that this group of tools accessible online will be useful to the scientific community. I hope that this web-based server will benefit and get improved with the progressive input and suggestions of a wide variety of users.

Reviewer's report 2: Dan Graur, Department of Biology and Biochemistry, University of Houston

A very simple and straightforward tool for dealing with codon usage. I have no other comments.

Reviewer's report 3: Rob Knight, University of Colorado

In this manuscript, Puigbo et al. describe their CAIcal web server. CAI, the Codon Adaptation Index, is an important concept relating codon usage to gene expression. Although several software tools online already calculate CAI, CAIcal appears to offer a unique combination of functionality that is not easily duplicated using other tools.

However, the tool in its current form would appear to be a relatively minor advance over existing tools, and I would strongly encourage the authors to consider an extensive overhaul of the software and the manuscript before publication. However, I think the present work contains the seeds of a useful contribution to the field and to the literature, and definitely encourage the authors to persevere, perhaps thinking more carefully about the target audience of the software and the paper.

More attention needs to be paid to the specific contribution of this work if it is to be published as an independent piece of software. No feature of this tool really appears to be unique, e.g. the plots of CAI along a gene and codon-by-codon are also in Codon Analyser (as the authors note), many tools allow calculation of CAI against a reference set, etc.

Authors' response: As we acknowledge in the manuscript, a number of tools are available elsewhere addressing different calculations around CAI. We consider however, that one of the strengths of the CAIcal server is to gather together pre-existing and new features into a single and easy-to-use web site, as you also note in your revision "CAIcal appears to offer a unique combination of functionality that is not easily duplicated using other tools". As an example, after the CAI value of a group of sequences has been calculated, the user can easily (with only a click of the mouse) estimate an expected CAI value for discerning whether the differences in CAI are statistically significant or whether they are merely artifacts. The graphical representation of the CAI value along each sequence can also be easily visualised. In addition, we also want to point out the usability of the server, used to denote here the ease with which people can employ a particular tool. Thus, several of the existing tools that allow calculation of CAI are not web-servers; other require some kind of installation or execution; and some of them provide easy calculations that lack in flexibility. Finally, the server allows to represent the CAI value along a protein multialignment back-translated to DNA, a feature currently not available elsewhere.

Similarly, the calculations of the expected CAI values are delegated to another tool, E-CAI, that the authors have previously published, but this is not very clear from the description in the paper. If the sole contribution is to tie together several pre-existing features into a single web site, the authors need to make the case much more clearly that this combination will be of use to end users in a way that the individual pre-existing tools are not.

Authors' response: We have added a new sentence in the paper clarifying this point.

I think the source code of the standalone version needs a substantial overhaul before publication. It is full of large, error-prone tables of redundant information about genetic codes, for example, which should be dynamically calculated from a compact, standardized and easily verified source (e.g. the NCBI genetic code tables), is essentially without useful comments, mixes presentation and logic, and has many other indicators of poor coding style (for example, it looks as though several separate applications have simply been pasted together).

Authors' response: Although the main aim of our work was to provide a web-based server for CAI analysis, this was a fair criticism. The source code needed an extensive revision of style and lacked useful comments that could guide the experienced user. We have largely rewritten it and it incorporates now numerous comments about the functionality of each different part. Thus, we have developed the local version 1.3. The source code in the standalone application follows a descendent algorithm rather than several separate applications have simply been pasted together. For the sake of clarity, we have included a file with a detailed description of the CAIcal functions (this file is available from the web site in the FAQs sectionhttp://genomes.urv.es/CAIcal/FAQs.html. The standalone application includes now new functions related with genetic codes to avoid putative error-prone in tables. Though, again, you are right and the coding style could still be improved, the program works well.

Although I appreciate that the authors have made the effort to produce and distribute a standalone version, the code unfortunately does not inspire confidence in the web site either in this case. Test cases, e.g. using Perl's built-in unit testing framework, would definitely be a useful addition to verify that the calculations are correct.

Authors' response: This was an interesting suggestion that we have addressed. To verify that the calculations are correct, we show that the results of the two independent programs (the standalone version written in Perl and the web-server written in PHP) are the same. In addition, we have compared our results with the results using other existing programs and the results are not significantly different. A file with some tests we made is available from the web site in the FAQs section http://genomes.urv.es/CAIcal/FAQs.html.

The utility of the Monte Carlo approach is also somewhat unclear to me, as it appears that the expected CAI could be calculated analytically, along with confidence intervals, using the multinomial distribution. It is possible that this is not feasible for numerical reasons, but some justification of the approach would be useful.

Authors' response: The expected CAI is calculated analytically from the CAI values of 500 randomly generated sequences with the same G+C content and amino acid composition as the query sequences. However, the Monte Carlo approach is used to generate the random sequences, not to calculate the expected CAI. In this sense, please see also Question 15 at the FAQs section of the server http://genomes.urv.es/CAIcal/FAQs.html.

I did not find the example especially compelling, but this is a relatively minor criticism and I understand that it is likely that the authors would want to publish any especially interesting results separately from the description of the tool itself. However, it might be interesting to try to reproduce a well-known conclusion from existing work to show how much easier it is with this workflow than with pre-existing tools. There are many examples in the literature as CAI is such a widely-used technique.

The manuscript and the web site need substantial attention to the quality of the English. I have not corrected minor wording and grammatical errors in this version of the manuscript, but if the authors plan to publish this manuscript regardless of the above comments, I would definitely recommend careful attention to detail, and also removing formatting errors such as the text "Sub-heading for this section" on page 3. Overall, I think this is a good first attempt and could ultimately be revised into a useful contribution that is more suitable for publication.

Authors' response: After receiving your comments and the comments of the three additional referees, we have decided to rewrite the code, to revise the manuscript and to publish it. We would like to thank you again for your comments. We think that it is not necessary any further overhaul of the software, as we agree that some changes were necessary in the manuscript and in the source code of the standalone version, and have accordingly been performed. We are glad to acknowledge that the code is easier to read after introducing the comments you suggested. Additional changes in the manuscript include also a second revision of the quality of the English following the recommendations by the NIH Fellows Editorial Board, and some clarifications. We sincerely consider that we have addressed the criticism you raised to the previous version of the manuscript.

Reviewer's report 3 (second revision): Rob Knight, University of Colorado

The revised versions of the manuscript and software are significantly improved.

Reviewer's report 4: Shamil Sunyaev, Harvard Medical School

This manuscript presents a new online tool to compute codon adaptation index (CAI). Although there are several CAI calculators available online, this new server includes several additional features such as computation of expected CAI and visualization of changes in the CAI along the sequence. The authors also present an analysis of papilomavirus as an example of the server utility. In sum, the manuscript does not report any significant novel scientific findings but presents a tool potentially useful for the research community.