Abstract
In 2006, Warnow, Evans, Ringe, and Nakhleh proposed a stochastic model (hereafter, the WERN 2006 model) of multi-state linguistic character evolution that allowed for homoplasy and borrowing. They proved that if there is no borrowing between languages and homoplastic states are known in advance, then the phylogenetic tree of a set of languages is statistically identifiable under this model, and they presented statistically consistent methods for estimating these phylogenetic trees. However, they left open the question of whether a phylogenetic network—which would explicitly model borrowing between languages that are in contact—can be estimated under the model of character evolution. Here, we establish that under some mild additional constraints on the WERN 2006 model, the phylogenetic network topology is statistically identifiable and we present algorithms to infer the phylogenetic network. We discuss the ramifications for linguistic phylogenetic network estimation in practice and suggest directions for future research.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Atkinson, Q., Nicholls, G., Welch, D., & Gray, R. (2005). From words to dates. Water into wine, mathemagic or phylogenetic inference? Transactions of the Philological Society, 103, 193–219.
Barbançon, F., Evans, S. N., Nakhleh, L., Ringe, D., & Warnow, T. (2013). An experimental study comparing linguistic phylogenetic reconstruction methods. Diachronica, 30, 143–170.
Boc, A., Di Sciullo, A. M., & Makarenkov, V. (2010). Classification of the Indo-European languages using a phylogenetic network approach. In H. Locarek-Junge & C. Weihspages (Eds.), Classification as a tool for research (pp. 647–655). Springer.
Calude, A. S., & Verkerk, A. (2016). The typology and diachrony of higher numerals in Indo-European. A phylogenetic comparative study. Journal of Language Evolution, 1, 91–108.
Cao, Z., Zhu, J., & Nakhleh, L. (2019). Empirical performance of tree-based inference of phylogenetic networks. In K. T. Huber & D. Gusfield (Eds.), 19th International Workshop on Algorithms in Bioinformatics (WABI 2019), vol. 143 of Leibniz International Proceedings in Informatics (LIPIcs), 21: 1–21:13. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
Chang, W., Hall, D., Cathcart, C., & Garrett, A. (2015). Ancestry-constrained phylogenetic analysis supports the Indo-European steppe hypothesis. Language, 91, 194–244.
Choy, C., Jansson, J., Sadakane, K., & Sung, W.-K. (2004). Computing the maximum agreement of phylogenetic networks. Electronic Notes in Theoretical Computer Science, 91, 134–147.
Dunn, M., Terrill, A., Reesink, G., Foley, R. A., & Levinson, S. C. (2005). Structural phylogenetics and the reconstruction of ancient language history. Science, 309(5743), 2072–2075.
Francis, A. R., & Steel, M. (2015). Which phylogenetic networks are merely trees with additional arcs? Systematic Biology, 64, 768–777.
Gambette, P., Berry, V., & Paul, C. (2012). Quartets and unrooted phylogenetic networks. Journal of Bioinformatics and Computational Biology, 10(04), 1250004.
Goldstein, D. (2020). Indo-European phylogenetics with R. A tutorial introduction. Indo-European Linguistics, 8, 110–180.
Goldstein, D. (2022). Correlated grammaticalization. The rise of articles in Indo-European. Diachronica, 39, 658–706.
Gray, R. D., Drummond, A. J., & Greenhill, S. J. (2009). Language phylogenies reveal expansion pulses and pauses in Pacific settlement. Science, 323(5913), 479–483.
Gusfield, D. (2014). ReCombinatorics. The algorithmics of ancestral recombination graphs and explicit phylogenetic networks. MIT Press.
Gusfield, D., Eddhu, S., & Langley, C. (2004). Optimal, efficient reconstruction of phylogenetic networks with constrained recombination. Journal of Bioinformatics and Computational Biology, 2, 173–213.
Haak, W., Lazaridis, I., Patterson, N., Rohland, N., Mallick, S., Llamas, B., Brandt, G., Nordenfelt, S., Harney, E., Stewardson, K., et al. (2015). Massive migration from the steppe was a source for Indo-European languages in Europe. Nature, 522(7555), 207–211.
Jacques, G., & List, J.-M. (2019). Save the trees. Why we need tree models in linguistic reconstruction (and when we should apply them). Journal of Historical Linguistics, 9, 128–167.
Keijsper, J. C. M., & Pendavingh, R. A. (2014). Reconstructing a phylogenetic level-1 network from quartets. Bulletin of Mathematical Biology, 76, 2517–2541.
Kelly, L. J., & Nicholls, G. K. (2017). Lateral transfer in stochastic Dollo models. The Annals of Applied Statistics, 11, 1146–1168.
McMahon, A., & McMahon, R. (2006). Why linguists don’t do dates. Evidence from Indo-European and Australian languages. In Forster & Renfrew, 2006, 153–160.
Mirarab, S., Rezwana Reaz, M. S., Bayzid, T. Z., Shel Swenson, M., & Warnow, T. (2014). ASTRAL. Genome-scale coalescent-based species tree estimation. Bioinformatics, 30(17), i541–i548.
Nakhleh, L., Ringe, D., & Warnow, T. (2005). Perfect phylogenetic networks. A new methodology for reconstructing the evolutionary history of natural languages. Language, 382–420.
Nelson-Sathi, S., List, J.-M., Geisler, H., Fangerau, H., Gray, R. D., Martin, W., & Dagan, T. (2011). Networks uncover hidden lexical borrowing in Indo-European language evolution. Proceedings of the Royal Society B: Biological Sciences, 278(1713), 1794–1803.
Nichols, J., & Warnow, T. (2008). Tutorial on computational linguistic phylogeny. Language and Linguistics Compass, 2, 760–820.
Reaz, R., Shamsuzzoha Bayzid, M., & Sohel Rahman, M. (2014). Accurate phylogenetic tree reconstruction from quartets. A heuristic approach. PloS One, 9(8), e104008.
Rexová, K., Frynta, D., & Zrzavỳ, J. (2003). Cladistic analysis of languages. Indo-European classification based on lexicostatistical data. Cladistics, 19, 120–127.
Ringe, D., Warnow, T., & Taylor, A. (2002). Indo-European and computational cladistics. Transactions of the Philological Society, 100, 59–129.
Skelton, C. M. (2015). Borrowing, character weighting, and preliminary cluster analysis in a phylogenetic analysis of the ancient Greek dialects. Indo-European Linguistics, 3, 84–117.
Snir, S., & Rao, S. (2012). Quartet MaxCut. A fast algorithm for amalgamating quartet trees. Molecular Phylogenetics and Evolution, 62, 1–8.
Warnow, T. (2017). Computational phylogenetics: An introduction to designing methods for phylogeny estimation. Cambridge University Press.
Warnow, T., Evans, S. N., Ringe, D., & Nakhleh, L. (2006). A stochastic model of language evolution that incorporates homoplasy and borrowing. In Forster & Renfrew, 2006, 75–90.
Warnow, T., Evans, S. N., & Nakhleh, L. (2023). Progress on constructing phylogenetic networks for languages. ar**v Preprint ar**v:2306.06298.
Acknowledgments
The authors wish to thank Don Ringe for the inspiration we experienced in working with him over these decades. TW also thanks the Grainger Foundation for support of this research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
We restate and then sketch proofs for Theorems 1–3.
Theorem 1
Let N be a rooted phylogenetic network, and let characters evolve down N under the WERN 2023 model, and let Q be the output of QTC. Then every quartet tree in Q will be in Q(Nr). Furthermore, as the number of characters increase, with probability converging to 1, every quartet tree in Q(Nr) will appear in Q. Thus, QTC is a statistically consistent estimator of Q(Nr).
Proof
We begin by showing that every quartet tree placed in Q is also in Q(Nr). Recall that quartet tree uv|xy is included in Q if and only if some character α is found such that α(u) = α(v) ≠ α(x) = α(y) and the states α(u), α(x) are non-homoplastic. This character evolves down some tree T contained inside the network. Moreover, since the states exhibited at u, v, x, y are non-homoplastic, there is a path in T connecting u and v and another path connecting x and y and these two paths do not share any vertices. Hence, the quartet tree uv|xy is in Q(Nr).
We now show that in the limit, every quartet tree in Q(Nr) is also in Q. Let ab|cd be a quartet tree in Q(Nr). Hence, there is a rooted tree T contained in N that induces this quartet tree (when T is considered as an unrooted tree). With positive probability, a character will evolve down T. Without loss of generality, assume a and b are siblings in the rooted version of T, so that their least common ancestor, lcaT (a, b), lies strictly below the root of the tree T.
Since a and b are siblings, there is an edge e above lcaT (a, b) within T. It follows that the probability that a random character evolves down T, selecting a non-homoplastic state at the root, and then changing on e, but on no other edge in T, is strictly positive. Note that for any such character α, we have α(a) = α(b) and α(c) = α(d) where α(a) and α(b) are different and both are non-homoplastic states. In such a case, Q will include quartet tree ab|cd. Thus, in the limit as the number of characters increases, with probability converging to 1, Q will contain every quartet tree in Q(Nr).
Since in the limit Q ⊆ Q(Nr) and Q(Nr) ⊆ Q, it follows that Q = Q(Nr) with probability converging to 1.
Theorem 2
The QBTE (Quartet-Based Topology Estimation) method is statistically consistent for estimating the unrooted topology of the network N under the WERN 2023 model when the rooted network N is a level-1 network where all cycles have length at least six; furthermore, QBTE runs in polynomial time.
Proof
By Theorem 1, we have shown that as the number of characters increases, we can construct Q(Nr). Due to space constraints, we direct the reader to Warnow et al. (2023) https://arxiv.org/abs/2306.06298 for the rest of the proof.
Theorem 3
Let N be the true unrooted level-1 network and let C0 denote the set of homoplasy-free phonological characters that exhibit both 0 and 1 at the leaves of N. Rooting N on any edge returned by Root-Network will produce a rooted network on which all characters in C0 can evolve without homoplasy, and the edge containing the true location of the root will be in the output returned by Root-Network. Furthermore, when given the unrooted topology of the true phylogenetic network as input, Root-Network is a statistically consistent estimator of the root location under the assumption that the probability of homoplasy-free phonological characters is positive.
Proof
We sketch the proof due to space constraints. It is straightforward to verify that an edge is colored red for a character α if and only if subdividing the edge and labeling the introduced node by 0 for α makes α homoplastic on every tree contained within the network. Furthermore, it is not hard to see that if we root the network on any edge that remains green throughout Root-Network, then all characters in C0 will be homoplasy-free. As a result, the first part of the theorem is established.
For the second part of the theorem, if the probability of homoplasy-free phonological characters is positive, then with probability converging to 1, for every edge in the true network, there is a character α that changes on the edge, but on no other edge; hence, α will be non-constant and homoplasy-free. Let e1 and e2 be the two edges incident to the root, and suppose the input set of characters contains α1 and α2 homoplasy-free characters that change on e1 and e2, respectively, then these two characters will mark as red every edge below e1 and e2. In the unrooted topology for N, the root is suppressed and edges e1 and e2 are merged into the same single edge, e. Hence, when Root-Network is applied to the unrooted topology for N, if characters α1 and α2 are in the input, then the only edge that is not colored red will be the edge e containing the suppressed root. In conclusion, since the probability of homoplasy-free phonological characters is strictly positive, as the number of such characters increase, the probability that every edge other than the root edge will be red will converge to 1. Thus, Root-Network will uniquely leave the single edge containing the suppressed root green, establishing that it is statistically consistent for locating the root in the network.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Warnow, T., Evans, S.N., Nakhleh, L. (2024). Progress on Constructing Phylogenetic Networks for Languages. In: Eska, J.F., Hackstein, O., Kim, R.I., Mondon, JF. (eds) The Method Works. Palgrave Macmillan, Cham. https://doi.org/10.1007/978-3-031-48959-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-48959-4_3
Published:
Publisher Name: Palgrave Macmillan, Cham
Print ISBN: 978-3-031-48958-7
Online ISBN: 978-3-031-48959-4
eBook Packages: Social SciencesSocial Sciences (R0)