Progress on Constructing Phylogenetic Networks for Languages

Warnow, Tandy; Evans, Steven N.; Nakhleh, Luay

doi:10.1007/978-3-031-48959-4_3

Tandy Warnow⁵,
Steven N. Evans⁶ &
Luay Nakhleh⁷

Abstract

In 2006, Warnow, Evans, Ringe, and Nakhleh proposed a stochastic model (hereafter, the WERN 2006 model) of multi-state linguistic character evolution that allowed for homoplasy and borrowing. They proved that if there is no borrowing between languages and homoplastic states are known in advance, then the phylogenetic tree of a set of languages is statistically identifiable under this model, and they presented statistically consistent methods for estimating these phylogenetic trees. However, they left open the question of whether a phylogenetic network—which would explicitly model borrowing between languages that are in contact—can be estimated under the model of character evolution. Here, we establish that under some mild additional constraints on the WERN 2006 model, the phylogenetic network topology is statistically identifiable and we present algorithms to infer the phylogenetic network. We discuss the ramifications for linguistic phylogenetic network estimation in practice and suggest directions for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Atkinson, Q., Nicholls, G., Welch, D., & Gray, R. (2005). From words to dates. Water into wine, mathemagic or phylogenetic inference? Transactions of the Philological Society, 103, 193–219.
Article Google Scholar
Barbançon, F., Evans, S. N., Nakhleh, L., Ringe, D., & Warnow, T. (2013). An experimental study comparing linguistic phylogenetic reconstruction methods. Diachronica, 30, 143–170.
Article Google Scholar
Boc, A., Di Sciullo, A. M., & Makarenkov, V. (2010). Classification of the Indo-European languages using a phylogenetic network approach. In H. Locarek-Junge & C. Weihspages (Eds.), Classification as a tool for research (pp. 647–655). Springer.
Chapter Google Scholar
Calude, A. S., & Verkerk, A. (2016). The typology and diachrony of higher numerals in Indo-European. A phylogenetic comparative study. Journal of Language Evolution, 1, 91–108.
Article Google Scholar
Cao, Z., Zhu, J., & Nakhleh, L. (2019). Empirical performance of tree-based inference of phylogenetic networks. In K. T. Huber & D. Gusfield (Eds.), 19th International Workshop on Algorithms in Bioinformatics (WABI 2019), vol. 143 of Leibniz International Proceedings in Informatics (LIPIcs), 21: 1–21:13. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
Google Scholar
Chang, W., Hall, D., Cathcart, C., & Garrett, A. (2015). Ancestry-constrained phylogenetic analysis supports the Indo-European steppe hypothesis. Language, 91, 194–244.
Article Google Scholar
Choy, C., Jansson, J., Sadakane, K., & Sung, W.-K. (2004). Computing the maximum agreement of phylogenetic networks. Electronic Notes in Theoretical Computer Science, 91, 134–147.
Article Google Scholar
Dunn, M., Terrill, A., Reesink, G., Foley, R. A., & Levinson, S. C. (2005). Structural phylogenetics and the reconstruction of ancient language history. Science, 309(5743), 2072–2075.
Article Google Scholar
Francis, A. R., & Steel, M. (2015). Which phylogenetic networks are merely trees with additional arcs? Systematic Biology, 64, 768–777.
Article Google Scholar
Gambette, P., Berry, V., & Paul, C. (2012). Quartets and unrooted phylogenetic networks. Journal of Bioinformatics and Computational Biology, 10(04), 1250004.
Article Google Scholar
Goldstein, D. (2020). Indo-European phylogenetics with R. A tutorial introduction. Indo-European Linguistics, 8, 110–180.
Article Google Scholar
Goldstein, D. (2022). Correlated grammaticalization. The rise of articles in Indo-European. Diachronica, 39, 658–706.
Article Google Scholar
Gray, R. D., Drummond, A. J., & Greenhill, S. J. (2009). Language phylogenies reveal expansion pulses and pauses in Pacific settlement. Science, 323(5913), 479–483.
Article Google Scholar
Gusfield, D. (2014). ReCombinatorics. The algorithmics of ancestral recombination graphs and explicit phylogenetic networks. MIT Press.
Book Google Scholar
Gusfield, D., Eddhu, S., & Langley, C. (2004). Optimal, efficient reconstruction of phylogenetic networks with constrained recombination. Journal of Bioinformatics and Computational Biology, 2, 173–213.
Article Google Scholar
Haak, W., Lazaridis, I., Patterson, N., Rohland, N., Mallick, S., Llamas, B., Brandt, G., Nordenfelt, S., Harney, E., Stewardson, K., et al. (2015). Massive migration from the steppe was a source for Indo-European languages in Europe. Nature, 522(7555), 207–211.
Article Google Scholar
Jacques, G., & List, J.-M. (2019). Save the trees. Why we need tree models in linguistic reconstruction (and when we should apply them). Journal of Historical Linguistics, 9, 128–167.
Article Google Scholar
Keijsper, J. C. M., & Pendavingh, R. A. (2014). Reconstructing a phylogenetic level-1 network from quartets. Bulletin of Mathematical Biology, 76, 2517–2541.
Article Google Scholar
Kelly, L. J., & Nicholls, G. K. (2017). Lateral transfer in stochastic Dollo models. The Annals of Applied Statistics, 11, 1146–1168.
Article Google Scholar
McMahon, A., & McMahon, R. (2006). Why linguists don’t do dates. Evidence from Indo-European and Australian languages. In Forster & Renfrew, 2006, 153–160.
Google Scholar
Mirarab, S., Rezwana Reaz, M. S., Bayzid, T. Z., Shel Swenson, M., & Warnow, T. (2014). ASTRAL. Genome-scale coalescent-based species tree estimation. Bioinformatics, 30(17), i541–i548.
Article Google Scholar
Nakhleh, L., Ringe, D., & Warnow, T. (2005). Perfect phylogenetic networks. A new methodology for reconstructing the evolutionary history of natural languages. Language, 382–420.
Google Scholar
Nelson-Sathi, S., List, J.-M., Geisler, H., Fangerau, H., Gray, R. D., Martin, W., & Dagan, T. (2011). Networks uncover hidden lexical borrowing in Indo-European language evolution. Proceedings of the Royal Society B: Biological Sciences, 278(1713), 1794–1803.
Article Google Scholar
Nichols, J., & Warnow, T. (2008). Tutorial on computational linguistic phylogeny. Language and Linguistics Compass, 2, 760–820.
Article Google Scholar
Reaz, R., Shamsuzzoha Bayzid, M., & Sohel Rahman, M. (2014). Accurate phylogenetic tree reconstruction from quartets. A heuristic approach. PloS One, 9(8), e104008.
Article Google Scholar
Rexová, K., Frynta, D., & Zrzavỳ, J. (2003). Cladistic analysis of languages. Indo-European classification based on lexicostatistical data. Cladistics, 19, 120–127.
Google Scholar
Ringe, D., Warnow, T., & Taylor, A. (2002). Indo-European and computational cladistics. Transactions of the Philological Society, 100, 59–129.
Article Google Scholar
Skelton, C. M. (2015). Borrowing, character weighting, and preliminary cluster analysis in a phylogenetic analysis of the ancient Greek dialects. Indo-European Linguistics, 3, 84–117.
Article Google Scholar
Snir, S., & Rao, S. (2012). Quartet MaxCut. A fast algorithm for amalgamating quartet trees. Molecular Phylogenetics and Evolution, 62, 1–8.
Article Google Scholar
Warnow, T. (2017). Computational phylogenetics: An introduction to designing methods for phylogeny estimation. Cambridge University Press.
Book Google Scholar
Warnow, T., Evans, S. N., Ringe, D., & Nakhleh, L. (2006). A stochastic model of language evolution that incorporates homoplasy and borrowing. In Forster & Renfrew, 2006, 75–90.
Google Scholar
Warnow, T., Evans, S. N., & Nakhleh, L. (2023). Progress on constructing phylogenetic networks for languages. ar**v Preprint ar**v:2306.06298.
Google Scholar

Download references

Acknowledgments

The authors wish to thank Don Ringe for the inspiration we experienced in working with him over these decades. TW also thanks the Grainger Foundation for support of this research.

Author information

Authors and Affiliations

University of Illinois, Urbana-Champaign, Urbana, IL, USA
Tandy Warnow
University of California at Berkeley, Berkeley, CA, USA
Steven N. Evans
Rice University, Houston, TX, USA
Luay Nakhleh

Authors

Tandy Warnow
View author publications
You can also search for this author in PubMed Google Scholar
Steven N. Evans
View author publications
You can also search for this author in PubMed Google Scholar
Luay Nakhleh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tandy Warnow .

Editor information

Editors and Affiliations

Department of English, Virginia Tech, Blacksburg, VA, USA
Joseph F. Eska
Ludwig-Maximilians-Universität München, Munich, Germany
Olav Hackstein
Collegium Heliodori, Adam Mickiewicz University in Poznań, Poznań, Poland
Ronald I. Kim
World Languages, Minot State University, Minot, ND, USA
Jean-François Mondon

Appendix

We restate and then sketch proofs for Theorems 1–3.

Theorem 1

Let N be a rooted phylogenetic network, and let characters evolve down N under the WERN 2023 model, and let Q be the output of QTC. Then every quartet tree in Q will be in Q(N_r). Furthermore, as the number of characters increase, with probability converging to 1, every quartet tree in Q(N_r) will appear in Q. Thus, QTC is a statistically consistent estimator of Q(N_r).

Proof

We begin by showing that every quartet tree placed in Q is also in Q(N_r). Recall that quartet tree uv|xy is included in Q if and only if some character α is found such that α(u) = α(v) ≠ α(x) = α(y) and the states α(u), α(x) are non-homoplastic. This character evolves down some tree T contained inside the network. Moreover, since the states exhibited at u, v, x, y are non-homoplastic, there is a path in T connecting u and v and another path connecting x and y and these two paths do not share any vertices. Hence, the quartet tree uv|xy is in Q(N_r).

We now show that in the limit, every quartet tree in Q(N_r) is also in Q. Let ab|cd be a quartet tree in Q(N_r). Hence, there is a rooted tree T contained in N that induces this quartet tree (when T is considered as an unrooted tree). With positive probability, a character will evolve down T. Without loss of generality, assume a and b are siblings in the rooted version of T, so that their least common ancestor, lca_T (a, b), lies strictly below the root of the tree T.

Since a and b are siblings, there is an edge e above lca_T (a, b) within T. It follows that the probability that a random character evolves down T, selecting a non-homoplastic state at the root, and then changing on e, but on no other edge in T, is strictly positive. Note that for any such character α, we have α(a) = α(b) and α(c) = α(d) where α(a) and α(b) are different and both are non-homoplastic states. In such a case, Q will include quartet tree ab|cd. Thus, in the limit as the number of characters increases, with probability converging to 1, Q will contain every quartet tree in Q(N_r).

Since in the limit Q ⊆ Q(N_r) and Q(N_r) ⊆ Q, it follows that Q = Q(N_r) with probability converging to 1.

Theorem 2

The QBTE (Quartet-Based Topology Estimation) method is statistically consistent for estimating the unrooted topology of the network N under the WERN 2023 model when the rooted network N is a level-1 network where all cycles have length at least six; furthermore, QBTE runs in polynomial time.

Proof

By Theorem 1, we have shown that as the number of characters increases, we can construct Q(N_r). Due to space constraints, we direct the reader to Warnow et al. (2023) https://arxiv.org/abs/2306.06298 for the rest of the proof.

Theorem 3

Let N be the true unrooted level-1 network and let C₀ denote the set of homoplasy-free phonological characters that exhibit both 0 and 1 at the leaves of N. Rooting N on any edge returned by Root-Network will produce a rooted network on which all characters in C₀ can evolve without homoplasy, and the edge containing the true location of the root will be in the output returned by Root-Network. Furthermore, when given the unrooted topology of the true phylogenetic network as input, Root-Network is a statistically consistent estimator of the root location under the assumption that the probability of homoplasy-free phonological characters is positive.

Proof

We sketch the proof due to space constraints. It is straightforward to verify that an edge is colored red for a character α if and only if subdividing the edge and labeling the introduced node by 0 for α makes α homoplastic on every tree contained within the network. Furthermore, it is not hard to see that if we root the network on any edge that remains green throughout Root-Network, then all characters in C₀ will be homoplasy-free. As a result, the first part of the theorem is established.

For the second part of the theorem, if the probability of homoplasy-free phonological characters is positive, then with probability converging to 1, for every edge in the true network, there is a character α that changes on the edge, but on no other edge; hence, α will be non-constant and homoplasy-free. Let e₁ and e₂ be the two edges incident to the root, and suppose the input set of characters contains α₁ and α₂ homoplasy-free characters that change on e₁ and e₂, respectively, then these two characters will mark as red every edge below e₁ and e₂. In the unrooted topology for N, the root is suppressed and edges e₁ and e₂ are merged into the same single edge, e. Hence, when Root-Network is applied to the unrooted topology for N, if characters α₁ and α₂ are in the input, then the only edge that is not colored red will be the edge e containing the suppressed root. In conclusion, since the probability of homoplasy-free phonological characters is strictly positive, as the number of such characters increase, the probability that every edge other than the root edge will be red will converge to 1. Thus, Root-Network will uniquely leave the single edge containing the suppressed root green, establishing that it is statistically consistent for locating the root in the network.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Warnow, T., Evans, S.N., Nakhleh, L. (2024). Progress on Constructing Phylogenetic Networks for Languages. In: Eska, J.F., Hackstein, O., Kim, R.I., Mondon, JF. (eds) The Method Works. Palgrave Macmillan, Cham. https://doi.org/10.1007/978-3-031-48959-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-48959-4_3
Published: 04 July 2024
Publisher Name: Palgrave Macmillan, Cham
Print ISBN: 978-3-031-48958-7
Online ISBN: 978-3-031-48959-4
eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics

Progress on Constructing Phylogenetic Networks for Languages

Abstract

Access this chapter

Subscribe and save

Buy Now

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Theorem 1

Proof

Theorem 2

Proof

Theorem 3

Proof

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation