A Graph-Based Approach for Detecting Sequence Homology in Highly Diverged Repeat Protein Families

Wells, Jonathan N.; Marsh, Joseph A.

doi:10.1007/978-1-4939-8736-8_13

Jonathan N. Wells³ &
Joseph A. Marsh³

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1851))

2744 Accesses
1 Citations
2 Altmetric

Abstract

Reconstructing evolutionary relationships in repeat proteins is notoriously difficult due to the high degree of sequence divergence that typically occurs between duplicated repeats. This is complicated further by the fact that proteins with a large number of similar repeats are more likely to produce significant local sequence alignments than proteins with fewer copies of the repeat motif. Furthermore, biologically correct sequence alignments are sometimes impossible to achieve in cases where insertion or translocation events disrupt the order of repeats in one of the sequences being aligned. Combined, these attributes make traditional phylogenetic methods for studying protein families unreliable for repeat proteins, due to the dependence of such methods on accurate sequence alignment.

We present here a practical solution to this problem, making use of graph clustering combined with the open-source software package HH-suite, which enables highly sensitive detection of sequence relationships. Carrying out multiple rounds of homology searches via alignment of profile hidden Markov models, large sets of related proteins are generated. By representing the relationships between proteins in these sets as graphs, subsequent clustering with the Markov cluster algorithm enables robust detection of repeat protein subfamilies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Self-analysis of repeat proteins reveals evolutionarily conserved patterns

Article Open access 07 May 2020

Evaluation and improvements of clustering algorithms for detecting remote homologous protein families

Article Open access 05 February 2015

A sequence-based evolutionary distance method for Phylogenetic analysis of highly divergent proteins

Article Open access 20 November 2023

References

Kajava AV (2001) Review: proteins with repeated sequence—structural prediction and modeling. J Struct Biol 134:132–144. https://doi.org/10.1006/jsbi.2000.4328
Article CAS PubMed Google Scholar
Kajava AV (2012) Tandem repeats in proteins: from sequence to structure. J Struct Biol 179:279–288. https://doi.org/10.1016/j.jsb.2011.08.009
Article CAS PubMed Google Scholar
Kobe B, Deisenhofer J (1994) The leucine-rich repeat: a versatile binding motif. Trends Biochem Sci 19:415–421
Article CAS Google Scholar
Neer EJ, Schmidt CJ, Nambudripad R, Smith TF (1994) The ancient regulatory-protein family of WD-repeat proteins. Nature 371:297–300. https://doi.org/10.1038/371297a0
Article CAS PubMed Google Scholar
Marcotte EM, Pellegrini M, Yeates TO, Eisenberg D (1999) A census of protein repeats. J Mol Biol 293:151–160. https://doi.org/10.1006/jmbi.1999.3136
Article CAS PubMed Google Scholar
Schaper E, Gascuel O, Anisimova M (2014) Deep conservation of human protein tandem repeats within the eukaryotes. Mol Biol Evol 31:1132–1148. https://doi.org/10.1093/molbev/msu062
Article CAS PubMed PubMed Central Google Scholar
Andrade MA, Petosa C, O’Donoghue SI et al (2001) Comparison of ARM and HEAT protein repeats. J Mol Biol 309:1–18. https://doi.org/10.1006/jmbi.2001.4624
Article CAS PubMed Google Scholar
Sutherland TD, Campbell PM, Weisman S et al (2006) A highly divergent gene cluster in honey bees encodes a novel silk family. Genome Res 16:1414–1421. https://doi.org/10.1101/gr.5052606
Article CAS PubMed PubMed Central Google Scholar
Björklund ÅK, Ekman D, Elofsson A (2006) Expansion of protein domain repeats. PLoS Comput Biol 2:0959–0970. https://doi.org/10.1371/journal.pcbi.0020114
Article CAS Google Scholar
Schüler A, Bornberg-Bauer E (2016) Evolution of protein domain repeats in Metazoa. Mol Biol Evol 33:3170
Article Google Scholar
Persi E, Wolf YI, Koonin EV (2016) Positive and strongly relaxed purifying selection drive the evolution of repeats in proteins. Nat Commun 7:13570. https://doi.org/10.1038/ncomms13570
Article CAS PubMed PubMed Central Google Scholar
Szklarczyk R, Heringa J (2004) Tracking repeats using significance and transitivity. Bioinformatics 20(Suppl 1):i311–i317. https://doi.org/10.1093/bioinformatics/bth911
Article CAS PubMed Google Scholar
Söding J, Remmert M, Biegert A, Lupas AN (2006) HHsenser: exhaustive transitive profile search using HMM-HMM comparison. Nucleic Acids Res 34:374–378. https://doi.org/10.1093/nar/gkl195
Article CAS Google Scholar
Newman AM, Cooper JB (2007) XSTREAM: a practical algorithm for identification and architecture modeling of tandem repeats in protein sequences. BMC Bioinformatics 8:382. https://doi.org/10.1186/1471-2105-8-382
Article CAS PubMed PubMed Central Google Scholar
Vo A, Nguyen N, Huang H (2010) Solenoid and non-solenoid protein recognition using stationary wavelet packet transform. Bioinformatics 26:i467–i473. https://doi.org/10.1093/bioinformatics/btq371
Article CAS PubMed PubMed Central Google Scholar
Szalkowski AM, Anisimova M (2013) Graph-based modeling of tandem repeats improves global multiple sequence alignment. Nucleic Acids Res 41:e162–e162. https://doi.org/10.1093/nar/gkt628
Article CAS PubMed PubMed Central Google Scholar
Schaper E, Kajava AV, Hauser A, Anisimova M (2012) Repeat or not repeat?--Statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Res 40:10005–10017. https://doi.org/10.1093/nar/gks726
Article CAS PubMed PubMed Central Google Scholar
Soding J, Söding J (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21:951–960. https://doi.org/10.1093/bioinformatics/bti125
Article Google Scholar
Remmert M, Biegert A, Hauser A, Söding J (2011) HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods 9:173–175. https://doi.org/10.1038/nmeth.1818
Article CAS PubMed Google Scholar
Van Dongen S (2000) A cluster algorithm for graphs. Rep Inf Syst 10:1–40
Article Google Scholar
Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30:1575–1584
Article CAS Google Scholar
Wells JN, Gligoris TG, Nasmyth KA, Marsh JA (2017) Evolution of condensin and cohesin complexes driven by replacement of kite by hawk proteins. Curr Biol 27:R17–R18. https://doi.org/10.1016/j.cub.2016.11.050
Article CAS PubMed PubMed Central Google Scholar
Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14:755–763
Article CAS Google Scholar
Viterbi A (1967) Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans Inf Theory 13:260–269. https://doi.org/10.1109/TIT.1967.1054010
Article Google Scholar
Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215:403–410. https://doi.org/10.1016/S0022-2836(05)80360-2
Article CAS PubMed Google Scholar
Altschul SF, Madden TL, Schäffer AA et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Article CAS Google Scholar
Cline MS, Smoot M, Cerami E et al (2007) Integration of biological networks and gene expression data using Cytoscape. Nat Protoc 2:2366–2382. https://doi.org/10.1038/nprot.2007.324
Article CAS PubMed PubMed Central Google Scholar
Chavali S, Chavali PL, Chalancon G et al (2017) Constraints and consequences of the emergence of amino acid repeats in eukaryotic proteins. Nat Struct Mol Biol 24:765–777. https://doi.org/10.1038/nsmb.3441
Article CAS PubMed PubMed Central Google Scholar

Download references

Author information

Authors and Affiliations

MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, UK
Jonathan N. Wells & Joseph A. Marsh

Authors

Jonathan N. Wells
View author publications
You can also search for this author in PubMed Google Scholar
Joseph A. Marsh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jonathan N. Wells .

Editor information

Editors and Affiliations

GlaxoSmithKline, Cellzome – a GSK company Meyerhofstrasse 1, Heidelberg, Baden-Württemberg, Germany
Tobias Sikosek

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Wells, J.N., Marsh, J.A. (2019). A Graph-Based Approach for Detecting Sequence Homology in Highly Diverged Repeat Protein Families. In: Sikosek, T. (eds) Computational Methods in Protein Evolution. Methods in Molecular Biology, vol 1851. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-8736-8_13

Download citation

DOI: https://doi.org/10.1007/978-1-4939-8736-8_13
Published: 27 September 2018
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-8735-1
Online ISBN: 978-1-4939-8736-8
eBook Packages: Springer Protocols

Publish with us

Policies and ethics

A Graph-Based Approach for Detecting Sequence Homology in Highly Diverged Repeat Protein Families

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Self-analysis of repeat proteins reveals evolutionarily conserved patterns

Evaluation and improvements of clustering algorithms for detecting remote homologous protein families

A sequence-based evolutionary distance method for Phylogenetic analysis of highly divergent proteins

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this protocol

Cite this protocol

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Graph-Based Approach for Detecting Sequence Homology in Highly Diverged Repeat Protein Families

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Self-analysis of repeat proteins reveals evolutionarily conserved patterns

Evaluation and improvements of clustering algorithms for detecting remote homologous protein families

A sequence-based evolutionary distance method for Phylogenetic analysis of highly divergent proteins

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this protocol

Cite this protocol

Download citation

Publish with us

Search

Navigation