Progress on Constructing Phylogenetic Networks for Languages

  • Chapter
  • First Online:
The Method Works

Abstract

In 2006, Warnow, Evans, Ringe, and Nakhleh proposed a stochastic model (hereafter, the WERN 2006 model) of multi-state linguistic character evolution that allowed for homoplasy and borrowing. They proved that if there is no borrowing between languages and homoplastic states are known in advance, then the phylogenetic tree of a set of languages is statistically identifiable under this model, and they presented statistically consistent methods for estimating these phylogenetic trees. However, they left open the question of whether a phylogenetic network—which would explicitly model borrowing between languages that are in contact—can be estimated under the model of character evolution. Here, we establish that under some mild additional constraints on the WERN 2006 model, the phylogenetic network topology is statistically identifiable and we present algorithms to infer the phylogenetic network. We discuss the ramifications for linguistic phylogenetic network estimation in practice and suggest directions for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Atkinson, Q., Nicholls, G., Welch, D., & Gray, R. (2005). From words to dates. Water into wine, mathemagic or phylogenetic inference? Transactions of the Philological Society, 103, 193–219.

    Article  Google Scholar 

  • Barbançon, F., Evans, S. N., Nakhleh, L., Ringe, D., & Warnow, T. (2013). An experimental study comparing linguistic phylogenetic reconstruction methods. Diachronica, 30, 143–170.

    Article  Google Scholar 

  • Boc, A., Di Sciullo, A. M., & Makarenkov, V. (2010). Classification of the Indo-European languages using a phylogenetic network approach. In H. Locarek-Junge & C. Weihspages (Eds.), Classification as a tool for research (pp. 647–655). Springer.

    Chapter  Google Scholar 

  • Calude, A. S., & Verkerk, A. (2016). The typology and diachrony of higher numerals in Indo-European. A phylogenetic comparative study. Journal of Language Evolution, 1, 91–108.

    Article  Google Scholar 

  • Cao, Z., Zhu, J., & Nakhleh, L. (2019). Empirical performance of tree-based inference of phylogenetic networks. In K. T. Huber & D. Gusfield (Eds.), 19th International Workshop on Algorithms in Bioinformatics (WABI 2019), vol. 143 of Leibniz International Proceedings in Informatics (LIPIcs), 21: 1–21:13. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.

    Google Scholar 

  • Chang, W., Hall, D., Cathcart, C., & Garrett, A. (2015). Ancestry-constrained phylogenetic analysis supports the Indo-European steppe hypothesis. Language, 91, 194–244.

    Article  Google Scholar 

  • Choy, C., Jansson, J., Sadakane, K., & Sung, W.-K. (2004). Computing the maximum agreement of phylogenetic networks. Electronic Notes in Theoretical Computer Science, 91, 134–147.

    Article  Google Scholar 

  • Dunn, M., Terrill, A., Reesink, G., Foley, R. A., & Levinson, S. C. (2005). Structural phylogenetics and the reconstruction of ancient language history. Science, 309(5743), 2072–2075.

    Article  Google Scholar 

  • Francis, A. R., & Steel, M. (2015). Which phylogenetic networks are merely trees with additional arcs? Systematic Biology, 64, 768–777.

    Article  Google Scholar 

  • Gambette, P., Berry, V., & Paul, C. (2012). Quartets and unrooted phylogenetic networks. Journal of Bioinformatics and Computational Biology, 10(04), 1250004.

    Article  Google Scholar 

  • Goldstein, D. (2020). Indo-European phylogenetics with R. A tutorial introduction. Indo-European Linguistics, 8, 110–180.

    Article  Google Scholar 

  • Goldstein, D. (2022). Correlated grammaticalization. The rise of articles in Indo-European. Diachronica, 39, 658–706.

    Article  Google Scholar 

  • Gray, R. D., Drummond, A. J., & Greenhill, S. J. (2009). Language phylogenies reveal expansion pulses and pauses in Pacific settlement. Science, 323(5913), 479–483.

    Article  Google Scholar 

  • Gusfield, D. (2014). ReCombinatorics. The algorithmics of ancestral recombination graphs and explicit phylogenetic networks. MIT Press.

    Book  Google Scholar 

  • Gusfield, D., Eddhu, S., & Langley, C. (2004). Optimal, efficient reconstruction of phylogenetic networks with constrained recombination. Journal of Bioinformatics and Computational Biology, 2, 173–213.

    Article  Google Scholar 

  • Haak, W., Lazaridis, I., Patterson, N., Rohland, N., Mallick, S., Llamas, B., Brandt, G., Nordenfelt, S., Harney, E., Stewardson, K., et al. (2015). Massive migration from the steppe was a source for Indo-European languages in Europe. Nature, 522(7555), 207–211.

    Article  Google Scholar 

  • Jacques, G., & List, J.-M. (2019). Save the trees. Why we need tree models in linguistic reconstruction (and when we should apply them). Journal of Historical Linguistics, 9, 128–167.

    Article  Google Scholar 

  • Keijsper, J. C. M., & Pendavingh, R. A. (2014). Reconstructing a phylogenetic level-1 network from quartets. Bulletin of Mathematical Biology, 76, 2517–2541.

    Article  Google Scholar 

  • Kelly, L. J., & Nicholls, G. K. (2017). Lateral transfer in stochastic Dollo models. The Annals of Applied Statistics, 11, 1146–1168.

    Article  Google Scholar 

  • McMahon, A., & McMahon, R. (2006). Why linguists don’t do dates. Evidence from Indo-European and Australian languages. In Forster & Renfrew, 2006, 153–160.

    Google Scholar 

  • Mirarab, S., Rezwana Reaz, M. S., Bayzid, T. Z., Shel Swenson, M., & Warnow, T. (2014). ASTRAL. Genome-scale coalescent-based species tree estimation. Bioinformatics, 30(17), i541–i548.

    Article  Google Scholar 

  • Nakhleh, L., Ringe, D., & Warnow, T. (2005). Perfect phylogenetic networks. A new methodology for reconstructing the evolutionary history of natural languages. Language, 382–420.

    Google Scholar 

  • Nelson-Sathi, S., List, J.-M., Geisler, H., Fangerau, H., Gray, R. D., Martin, W., & Dagan, T. (2011). Networks uncover hidden lexical borrowing in Indo-European language evolution. Proceedings of the Royal Society B: Biological Sciences, 278(1713), 1794–1803.

    Article  Google Scholar 

  • Nichols, J., & Warnow, T. (2008). Tutorial on computational linguistic phylogeny. Language and Linguistics Compass, 2, 760–820.

    Article  Google Scholar 

  • Reaz, R., Shamsuzzoha Bayzid, M., & Sohel Rahman, M. (2014). Accurate phylogenetic tree reconstruction from quartets. A heuristic approach. PloS One, 9(8), e104008.

    Article  Google Scholar 

  • Rexová, K., Frynta, D., & Zrzavỳ, J. (2003). Cladistic analysis of languages. Indo-European classification based on lexicostatistical data. Cladistics, 19, 120–127.

    Google Scholar 

  • Ringe, D., Warnow, T., & Taylor, A. (2002). Indo-European and computational cladistics. Transactions of the Philological Society, 100, 59–129.

    Article  Google Scholar 

  • Skelton, C. M. (2015). Borrowing, character weighting, and preliminary cluster analysis in a phylogenetic analysis of the ancient Greek dialects. Indo-European Linguistics, 3, 84–117.

    Article  Google Scholar 

  • Snir, S., & Rao, S. (2012). Quartet MaxCut. A fast algorithm for amalgamating quartet trees. Molecular Phylogenetics and Evolution, 62, 1–8.

    Article  Google Scholar 

  • Warnow, T. (2017). Computational phylogenetics: An introduction to designing methods for phylogeny estimation. Cambridge University Press.

    Book  Google Scholar 

  • Warnow, T., Evans, S. N., Ringe, D., & Nakhleh, L. (2006). A stochastic model of language evolution that incorporates homoplasy and borrowing. In Forster & Renfrew, 2006, 75–90.

    Google Scholar 

  • Warnow, T., Evans, S. N., & Nakhleh, L. (2023). Progress on constructing phylogenetic networks for languages. ar**v Preprint ar**v:2306.06298.

    Google Scholar 

Download references

Acknowledgments

The authors wish to thank Don Ringe for the inspiration we experienced in working with him over these decades. TW also thanks the Grainger Foundation for support of this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tandy Warnow .

Editor information

Editors and Affiliations

Appendix

Appendix

We restate and then sketch proofs for Theorems 1–3.

Theorem 1

Let N be a rooted phylogenetic network, and let characters evolve down N under the WERN 2023 model, and let Q be the output of QTC. Then every quartet tree in Q will be in Q(Nr). Furthermore, as the number of characters increase, with probability converging to 1, every quartet tree in Q(Nr) will appear in Q. Thus, QTC is a statistically consistent estimator of Q(Nr).

Proof

We begin by showing that every quartet tree placed in Q is also in Q(Nr). Recall that quartet tree uv|xy is included in Q if and only if some character α is found such that α(u) = α(v) ≠ α(x) = α(y) and the states α(u), α(x) are non-homoplastic. This character evolves down some tree T contained inside the network. Moreover, since the states exhibited at u, v, x, y are non-homoplastic, there is a path in T connecting u and v and another path connecting x and y and these two paths do not share any vertices. Hence, the quartet tree uv|xy is in Q(Nr).

We now show that in the limit, every quartet tree in Q(Nr) is also in Q. Let ab|cd be a quartet tree in Q(Nr). Hence, there is a rooted tree T contained in N that induces this quartet tree (when T is considered as an unrooted tree). With positive probability, a character will evolve down T. Without loss of generality, assume a and b are siblings in the rooted version of T, so that their least common ancestor, lcaT (a, b), lies strictly below the root of the tree T.

Since a and b are siblings, there is an edge e above lcaT (a, b) within T. It follows that the probability that a random character evolves down T, selecting a non-homoplastic state at the root, and then changing on e, but on no other edge in T, is strictly positive. Note that for any such character α, we have α(a) = α(b) and α(c) = α(d) where α(a) and α(b) are different and both are non-homoplastic states. In such a case, Q will include quartet tree ab|cd. Thus, in the limit as the number of characters increases, with probability converging to 1, Q will contain every quartet tree in Q(Nr).

Since in the limit Q ⊆ Q(Nr) and Q(Nr) ⊆ Q, it follows that Q = Q(Nr) with probability converging to 1.

Theorem 2

The QBTE (Quartet-Based Topology Estimation) method is statistically consistent for estimating the unrooted topology of the network N under the WERN 2023 model when the rooted network N is a level-1 network where all cycles have length at least six; furthermore, QBTE runs in polynomial time.

Proof

By Theorem 1, we have shown that as the number of characters increases, we can construct Q(Nr). Due to space constraints, we direct the reader to Warnow et al. (2023) https://arxiv.org/abs/2306.06298 for the rest of the proof.

Theorem 3

Let N be the true unrooted level-1 network and let C0 denote the set of homoplasy-free phonological characters that exhibit both 0 and 1 at the leaves of N. Rooting N on any edge returned by Root-Network will produce a rooted network on which all characters in C0 can evolve without homoplasy, and the edge containing the true location of the root will be in the output returned by Root-Network. Furthermore, when given the unrooted topology of the true phylogenetic network as input, Root-Network is a statistically consistent estimator of the root location under the assumption that the probability of homoplasy-free phonological characters is positive.

Proof

We sketch the proof due to space constraints. It is straightforward to verify that an edge is colored red for a character α if and only if subdividing the edge and labeling the introduced node by 0 for α makes α homoplastic on every tree contained within the network. Furthermore, it is not hard to see that if we root the network on any edge that remains green throughout Root-Network, then all characters in C0 will be homoplasy-free. As a result, the first part of the theorem is established.

For the second part of the theorem, if the probability of homoplasy-free phonological characters is positive, then with probability converging to 1, for every edge in the true network, there is a character α that changes on the edge, but on no other edge; hence, α will be non-constant and homoplasy-free. Let e1 and e2 be the two edges incident to the root, and suppose the input set of characters contains α1 and α2 homoplasy-free characters that change on e1 and e2, respectively, then these two characters will mark as red every edge below e1 and e2. In the unrooted topology for N, the root is suppressed and edges e1 and e2 are merged into the same single edge, e. Hence, when Root-Network is applied to the unrooted topology for N, if characters α1 and α2 are in the input, then the only edge that is not colored red will be the edge e containing the suppressed root. In conclusion, since the probability of homoplasy-free phonological characters is strictly positive, as the number of such characters increase, the probability that every edge other than the root edge will be red will converge to 1. Thus, Root-Network will uniquely leave the single edge containing the suppressed root green, establishing that it is statistically consistent for locating the root in the network.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Warnow, T., Evans, S.N., Nakhleh, L. (2024). Progress on Constructing Phylogenetic Networks for Languages. In: Eska, J.F., Hackstein, O., Kim, R.I., Mondon, JF. (eds) The Method Works. Palgrave Macmillan, Cham. https://doi.org/10.1007/978-3-031-48959-4_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-48959-4_3

  • Published:

  • Publisher Name: Palgrave Macmillan, Cham

  • Print ISBN: 978-3-031-48958-7

  • Online ISBN: 978-3-031-48959-4

  • eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics

Navigation