Abstract
The comparative study of homologous proteins can provide abundant information about the functional and structural constraints on protein evolution. For example, an amino acid substitution that is deleterious may become permissive in the presence of another substitution at a second site of the protein. A popular approach for detecting coevolving residues is by looking for correlated substitution events on branches of the molecular phylogeny relating the protein-coding sequences. Here we describe a machine learning method (Bayesian graphical models) implemented in the open-source phylogenetic software package HyPhy, http://hyphy.org, for extracting a network of coevolving residues from a sequence alignment.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The scripts in this chapter were tested with HyPhy version 2.220170201beta and release 2.2.7. HyPhy is a large and complex software package that is constantly undergoing development by a small team of researchers and programmers, and some of the more specialized features such as BGMs may temporarily break as newer versions are released. If you compiled HyPhy from source, make sure that you are using a single-threaded (HYPHYSP) or multiprocessing-enabled (HYPHYMP) build and not a message passing interface (MPI)-enabled (HYPHYMPI) build; at the time of writing, there were residual issues in the source code related to MPI processing. If you encounter any other problems, please submit an issue at https://github.com/veg/hyphy/issues and we will attend to it as soon as possible.
- 2.
For this type of analysis, we prefer using maximum likelihood (ML) methods to reconstruct trees. If it is not feasible to use ML methods due to excessive numbers of sequence and/or sequence lengths, we suggest using the approximate ML program FastTree 2 [37], which can be orders of magnitude faster than the standard ML programs. Neighbor-joining (NJ) methods also scale favorably with larger alignments, but tend to be less accurate for reconstructing branch lengths. While there are NJ and ML tree reconstruction methods implemented in HyPhy, they are not as efficient as these specialized programs and we do not recommend using them for larger data sets.
- 3.
A bootstrap support value is an empirical measure of confidence in a specific clade given the data. Most phylogeny reconstruction programs should have an option to omit these values. If you already have a Newick tree file and you just need to remove the support values, you can use the following UNIX command: sed -E ’s/)[0-9.]+:/):/g’ [input] > [output].
- 4.
From this point onward, we assume that you are using the command-line interface. Unfortunately, this script may not work properly with the GUI because of how HyPhy handles file paths. Even on the command line, this is not straight-forward. For example, we used the following invocation in the macOS Terminal: HYPHYMP BASEPATH=/usr/local/lib/hyphy/ ‘pwd‘/fit_codon_model.bf If you want to take advantage of a multi-core CPU, you can add the argument CPU=[number of cores] immediately after HYPHYMP. Note that not all steps in this analysis are able to utilize multiple threads.
- 5.
If you want to examine this scaling factor, you can find it in the serialized likelihood function generated by this script by searching for the parameter name scalingB.
- 6.
If you’re using an operating system with a desktop environment, it’s often easier to drag the icon representing your file into the terminal window instead of ty** out the corresponding path. This works when running HyPhy on the command line, but you need to use backspace to remove the space that is automatically appended to end of the path. HyPhy won’t be able to locate the file otherwise.
- 7.
Prior to version 2.3.4, the text in HyPhy implies that these options allow rates to vary among branches, not sites: “…branch lengths come from a user-chosen distribution.” We have revised this help text as of version 2.3.4 to indicate that the distributions are used to model rate variation across sites, not branches.
- 8.
A standard codon model is described by a 61-by-61 transition rate matrix and a single parameter R that corresponds to the ratio of non-synonymous and synonymous substitution rates. The model assumes that the system moves from one codon to another by single nucleotide substitutions; codon substitutions that require more than one nucleotide change are not allowed.
- 9.
Some phylogeny reconstruction programs truncate sequence labels and cause an error at this stage—for example, neither RAxML or FastTree2 will read sequence labels beyond a whitespace character. A quick fix in this situation is to replace all whitespace characters with underscores in a text editor or with sed.
- 10.
By convention, we use the file extension .lf and keep the same basename as the codon data file. This makes it easier to track files that belong to the same workflow.
- 11.
NEXUS is a widespread format with known issues with standardization and usability, and has been implemented in diverse and often incompatible ways by multiple programs.
- 12.
We have previously found this list output to be a more convenient format for debugging the script. It’s usually a good idea to manually compare entries in this list against your sequence alignment to make sure that things make sense.
- 13.
Most phylogenetic tree reconstruction methods, such as maximum likelihood or neighbor-joining, will output an unrooted tree. For an unrooted tree, the labels will be generated for the deepest internal node.
- 14.
For example, you can customize on a node-by-node basis the number of “parental” nodes on which a given node can be conditionally dependent. You can also load a serialized BGM from a XML Bayesian Interchange format file and use this model to simulate additional data sets. For more details, please refer to the file bayesgraph.ibf and the batch file tests/hbltests/BayesianGraphicalModels/TestBGM.bf in the HyPhy source code distribution.
- 15.
As a general rule of thumb, we try to not build a BGM model that has many more nodes than observations. The number of substitutions provides a meaningful criterion for reducing the dimensionality of our data.
- 16.
This is where the ability to customize the analysis implemented in the bayesgraph.bf script can be very useful. If you have prior information that a subset of codon sites are involved in a large number of interactions, the computational complexity of increasing the number of parents can be greatly reduced by modifying this parameter for only these sites.
- 17.
(In an MCMC run, we observe autocorrelation when we sample parameter values that are very close in the parameter space and unrepresentative of the true underlying posterior distribution. Therefore, we try to decrease autocorrelation so that the MCMC sample provides a more precise estimate of the posterior sample. One way to accomplish this is by down-sampling to every n-th step).
- 18.
We have provided most of the data files in this example on our GitHub repository at https://github.com/PoonLab/comet-prot/tree/master/data.
- 19.
To generate an amino acid sequence from the column labels, we used the regular expression “[0-9]+,*” to replace all instances with an empty string. In Python, this can be achieved with the re module: seq = re.sub(’[0-9]+,*’, ’’, header.strip()), where header is a string variable containing the first line of the CSV file.
- 20.
This can be accomplished with the following R commands:
require(coda)
chain1 <- read.csv("chain1.trace.csv", header=F)
chain2 <- read.csv("chain2.trace.csv", header=F)
chains <- mcmc.list(mcmc(chain1$V1), mcmc(chain2$V1))
gelman.diag(chains, autoburnin=F)
where the file names may be different for your run.
References
Kihara D (2005) The effect of long-range interactions on the secondary structure formation of proteins. Protein Sci 14(8):1955–1963
Sprinzak E, Margalit H (2001) Correlated sequence-signatures as markers of protein-protein interaction. J Mol Biol 311(4):681–692
Horner DS, Pirovano W, Pesole G (2007) Correlated substitution analysis and the prediction of amino acid structural contacts. Brief Bioinform 9(1):46–56
Taylor WR, Hamilton RS, Sadowski MI (2013) Prediction of contacts from correlated sequence substitutions. Curr Opin Struct Biol 23(3):473–479
Marks DS, Hopf TA, Sander C (2012) Protein structure prediction from sequence variation. Nat Biotechnol 30(11):1072–1080
De Juan D, Pazos F, Valencia A (2013) Emerging methods in protein co-evolution. Nat Rev Genet 14(4):249
Göbel U, Sander C, Schneider R, Valencia A (1994) Correlated mutations and residue contacts in proteins. Proteins Struct Funct Bioinf 18(4):309–317
Korber B, Farber RM, Wolpert DH, Lapedes AS (1993) Covariation of mutations in the V3 loop of human immunodeficiency virus type 1 envelope protein: an information theoretic analysis. Proc Natl Acad Sci 90(15):7176–7180
Hirschhorn JN, Lohmueller K, Byrne E, Hirschhorn K (2002) A comprehensive review of genetic association studies. Genet Med 4(2):45–61
Kowarsch A, Fuchs A, Frishman D, Pagel P (2010) Correlated mutations: a hallmark of phenotypic amino acid substitutions. PLoS Comput Biol 6(9):e1000923
Weinreich DM, Delaney NF, DePristo MA, Hartl DL (2006) Darwinian evolution can follow only very few mutational paths to fitter proteins. Science 312(5770):111–114
Ivankov DN, Finkelstein AV, Kondrashov FA (2014) A structural perspective of compensatory evolution. Curr Opin Struct Biol 26:104–112
Neher E (1994) How frequent are correlated changes in families of protein sequences? Proc Natl Acad Sci 91(1):98–102
Olmea O, Rost B, Valencia A (1999) Effective use of sequence correlation and conservation in fold recognition. J Mol Biol 293(5):1221–1239
Atchley WR, Wollenberg KR, Fitch WM, Terhalle W, Dress AW (2000) Correlations among amino acid sites in bHLH protein domains: an information theoretic analysis. Mol Biol Evol 17(1):164–178
Tillier ER, Lui TW (2003) Using multiple interdependency to separate functional from phylogenetic correlations in protein alignments. Bioinformatics 19(6):750–755
Martin L, Gloor GB, Dunn S, Wahl LM (2005) Using information theory to search for co-evolving residues in proteins. Bioinformatics 21(22):4116–4124
Gouveia-Oliveira R, Pedersen AG (2007) Finding coevolving amino acid residues using row and column weighting of mutual information and multi-dimensional amino acid representation. Algorithms Mol Biol 2(1):12
Fernandes AD, Gloor GB (2010) Mutual information is critically dependent on prior assumptions: would the correct estimate of mutual information please identify itself? Bioinformatics 26(9):1135–1139
Jeong CS, Kim D (2012) Reliable and robust detection of coevolving protein residues. Protein Eng Des Sel 25(11):705–713
Felsenstein J (1985) Phylogenies and the comparative method. Am Nat 125(1):1–15
Shindyalov IN, Kolchanov NA, Sander C (1994) Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations? Protein Eng 7(3):349–358
Wollenberg KR, Atchley WR (2000) Separation of phylogenetic and functional associations in biological sequences by using the parametric bootstrap. Proc Natl Acad Sci 97(7):3288–3291
Gloor GB, Martin LC, Wahl LM, Dunn SD (2005) Mutual information in protein multiple sequence alignments reveals two classes of coevolving positions. Biochemistry 44(19):7156–7165
Pollock DD, Taylor WR, Goldman N (1999) Coevolving protein residues: maximum likelihood identification and relationship to structure. J Mol Biol 287(1):187–198
Tuff P, Darlu P (2000) Exploring a phylogenetic approach for the detection of correlated substitutions in proteins. Mol Biol Evol 17(11):1753–1759
Poon AFY, Lewis FI, Pond SLK, Frost SDW (2007) An evolutionary-network model reveals stratified interactions in the V3 loop of the HIV-1 envelope. PLoS Comput Biol 3(11):e231
Talavera D, Lovell SC, Whelan S (2015) Covariation is a poor measure of molecular coevolution. Mol Biol Evol 32(9):2456–2468
Fodor AA, Aldrich RW (2004) Influence of conservation on calculations of amino acid covariance in multiple sequence alignments. Proteins Struct Funct Bioinf 56(2):211–221
Pearl J (1986) Fusion, propagation, and structuring in belief networks. Artif Intell 29(3):241–288
Friedman N, Koller D (2003) Being Bayesian about network structure. A Bayesian approach to structure discovery in Bayesian networks. Mach Learn 50(1–2):95–125
Pond SLK, Frost SDW, Muse SV (2005) HyPhy: hypothesis testing using phylogenies. Bioinformatics 21(5):676–679
Delport W, Poon AFY, Frost SDW, Kosakovsky Pond SL (2010) Datamonkey 2010: a suite of phylogenetic analysis tools for evolutionary biology. Bioinformatics 26(19):2455–2457
Poon AFY, Lewis FI, Frost SDW, Kosakovsky Pond SL (2008) Spidermonkey: rapid detection of co-evolving sites using Bayesian graphical models. Bioinformatics 24(17):1949–1950
Stamatakis A (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30(9):1312–1313
Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O (2010) New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol 59(3):307–321
Price MN, Dehal PS, Arkin AP (2010) FastTree 2–approximately maximum-likelihood trees for large alignments. PLoS ONE 5(3):e9490
Holmes S (2003) Bootstrap** phylogenetic trees: theory and methods. Stat Sci 18:241–255
Muse SV, Gaut BS (1994) A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evol 11(5):715–724
Yang Z (1993) Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol Biol Evol 10(6):1396–1401
Felsenstein J, Churchill GA (1996) A hidden Markov model approach to variation among sites in rate of evolution. Mol Biol Evol 13(1):93–104
Swofford D, Begle DP (1993) PAUP: Phylogenetic analysis using parsimony, Version 3.1, March 1993. Center for Biodiversity, Illinois Natural History Survey
Tamura K, Nei M (1993) Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol 10(3):512–526
Posada D (2003) Using MODELTEST and PAUP* to select a model of nucleotide substitution. Curr Protoc Bioinformatics 6–5. https://doi.org/10.1002/0471250953.bi0605s00
Maddison DR, Swofford DL, Maddison WP (1997) NEXUS: an extensible file format for systematic information. Syst Biol 46(4):590–621
Joy JB, Liang RH, McCloskey RM, Nguyen T, Poon AFY (2016) Ancestral reconstruction. PLoS Comput Biol 12(7):e1004763
Nielsen R (2002) Map** mutations on phylogenies. Syst Biol 51(5):729–739
Pupko T, Pe I, Shamir R, Graur D (2000) A fast algorithm for joint reconstruction of ancestral amino acid sequences. Mol Biol Evol 17(6):890–896
Ellson J, Gansner E, Koutsofios L, North SC, Woodhull G (2001) Graphviz—open source graph drawing tools. In: International symposium on graph drawing. Springer, Berlin, pp 483–484
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13(11):2498–2504
Bastian M, Heymann S, Jacomy M et al (2009) Gephi: an open source software for exploring and manipulating networks. In: Proceedings of the third international ICWSM conference, vol 8, pp 361–362
Simmonds P (2004) Genetic diversity and evolution of hepatitis C virus–15 years on. J Gen Virol 85(11):3173–3188
Blach S, Zeuzem S, Manns M, Altraif I, Duberg AS, Muljono DH, Waked I, Alavian SM, Lee MH, Negro F et al (2017) Global prevalence and genotype distribution of hepatitis C virus infection in 2015: a modelling study. Lancet Gastroenterol Hepatol 2(3):161–176
Campo D, Dimitrova Z, Mitchell RJ, Lara J, Khudyakov Y (2008) Coordinated evolution of the hepatitis C virus. Proc Natl Acad Sci 105(28):9685–9690
Aurora R, Donlin MJ, Cannon NA, Tavis JE (2009) Genome-wide hepatitis C virus amino acid covariance networks can predict response to antiviral therapy in humans. J Clin Invest 119(1):225–236
McCloskey RM, Liang RH, Joy JB, Krajden M, Montaner JS, Harrigan PR, Poon AF (2014) Global origin and transmission of hepatitis C virus nonstructural protein 3 Q80K polymorphism. J Infect Dis 211(8):1288–1295
Poveda E, Wyles DL, Mena Á, Pedreira JD, Castro-Iglesias Á, Cachay E (2014) Update on hepatitis C virus resistance to direct-acting antiviral agents. Antivir Res 108:181–191
Combet C, Garnier N, Charavay C, Grando D, Crisan D, Lopez J, Dehne-Garcia A, Geourjon C, Bettler E, Hulo C et al (2006) euHCVdb: the European hepatitis C virus database. Nucleic Acids Res 35(Suppl_1):D363–D366
Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30(4):772–780
Larsson A (2014) AliView: a fast and lightweight alignment viewer and editor for large datasets. Bioinformatics 30(22):3276–3278
Darriba D, Taboada GL, Doallo R, Posada D (2012) jModelTest 2: more models, new heuristics and parallel computing. Nat Methods 9(8):772
Guindon S, Gascuel O (2003) A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 52(5):696–704
Yu G, Smith DK, Zhu H, Guan Y, Lam TTY (2017) ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods Ecol Evol 8(1):28–36
Plummer M, Best N, Cowles K, Vines K (2006) CODA: convergence diagnosis and output analysis for MCMC. R News 6(1):7–11
Gelman A, Rubin DB (1992) Inference from iterative simulation using multiple sequences. Stat Sci 7:457–472
Ranjith-Kumar C, Kao CC (2006) Biochemical activities of the HCV NS5B RNA-dependent RNA polymerase. In: Tan S (ed) Hepatitis C viruses: genomes and molecular biology. Horizon Bioscience, Norfolk, pp 293–310
Hong Z, Cameron CE, Walker MP, Castro C, Yao N, Lau JY, Zhong W (2001) A novel mechanism to ensure terminal initiation by hepatitis C virus NS5B polymerase. Virology 285(1):6–11
Acknowledgements
This study was supported in part by the Government of Canada through Genome Canada and the Ontario Genomics Institute (OGI-131), and by grants from the Canadian Institutes of Health Research (PJT-153391 and BOP-149562). AFYP was supported by a CIHR New Investigator Award (FRN-130609).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Avino, M., Poon, A.F.Y. (2019). Detecting Amino Acid Coevolution with Bayesian Graphical Models. In: Sikosek, T. (eds) Computational Methods in Protein Evolution. Methods in Molecular Biology, vol 1851. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-8736-8_6
Download citation
DOI: https://doi.org/10.1007/978-1-4939-8736-8_6
Published:
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-8735-1
Online ISBN: 978-1-4939-8736-8
eBook Packages: Springer Protocols