The trouble with language phylogenies

A new paper by Bouckaert et al. claims to show support for the hypothesis that Anatolia was the birthplace of the Indo-European languages, going against the more popular hypothesis that the ancient inhabitants of the Steppes were the linguistic progenitors of the world’s most successful non-Sino languages. (Here’s the accompanying free website detailing their work.)

The methodology of the paper is new to me: Bayesian phlyogeography. What the team did was to compile a large dataset of cognates (e.g., English water and German wasser), which presumably show a common ancestry. They then plotted known contemporary and dead languages onto a map, based on known information about where the languages are or were spoken. They modeled the evolution of language change as “the gain and loss of cognates along the branches of an unknown family tree, using an approach called Bayesian phylogenetic inference to infer the set of language trees that makes the [known] cognate data most likely” (emphasis added). Using this data and model, the team ran a series of “Brownian walks” that, working backwards, showed how far and where a language could have spread given its known ancestry (i.e., given the constraints of the known geographies and phylogenies).

Essentially, they combined what is known about languages’ geographies and phylogenies and, using a lot of equations I don’t understand, worked backwards to model probable locations of where all these Indo European languages actually came from. From the paper:

The Bayesian approach we employ means that we can directly test support for the Steppe homeland hypothesis versus the Anatolian homeland hypothesis. This is because the method we use does not produce a single answer – e.g. the homeland is at x degrees longitude and y degrees latitude. That would not be all that useful, because if you want to test between competing theories, you need some estimate of uncertainty – how sure are you that the origin is at one location versus another?

There is uncertainty in the relationships between the languages (nobody can say with absolute certainty that one particular family tree is the true one – for 103 languages there are more possible trees than there are atoms in the universe!), there is uncertainty in the time scale (we can’t know for sure exactly how fast languages change), and even if we knew the family tree and time scale exactly, there is uncertainty in the geographic expansion process so we cannot pin down the location of the root exactly.

One of the major advantages of the Bayesian approach is that we do not produce a single answer, but instead account for all those uncertainties using some clever algorithms (called Markov Chain Monte Carlos methods) that sample language trees, divergences times and locations at all points on the tree, in proportion to how likely they make our observed data . . . So we were able to run our analyses and directly compare how often the origin locations we inferred fell in the range proposed for the Steppe theory versus in the range proposed for the Anatolian theory  . . . As we report in the paper, using either version of the Steppe theory, it was the Anatolian theory that came out on top.

The map at the bottom of this page shows that, given a wide range of uncertainties about where languages travel, how fast they evolve, and which is related to which, the range of possible geographic birthplaces of Indo European languages can be limited most probably to the Anatolian region.

Like I said, I still don’t fully understand how to do Bayesian methods, but I appreciate them on an epistemological level. Bayesian methods, such as the ones used here, don’t model things based on an “all things being equal” approach. Insofar as I understand them, Bayesian equations take into account known and assumed information and produce probabilistic answers, answers that change along with the known information.

This paper works with two “knowns” about languages: their geographies and their phylogenies. I have no problem with the way the authors of the paper handled the former. I do think there is a problem with how they handled the latter.

A linguistic phylogeny (family tree) is often modeled like a human family tree, with a parent giving rise to children and related parents giving rise to cousins of the children, and so forth. As the authors put it in their paper:

Languages evolve through time in a manner similar to biological species. As groups of speakers become separated, their speech drifts apart forming new descendant languages, and eventually whole families of related languages. Over thousands of years this process has generated the 6000+ languages in the world today . . . We can represent the relationships between languages on a family tree, otherwise known as a ‘phylogeny’. A simple example of a phylogeny is a family tree where the leaves of the tree represent the children in a family and branches represent relationships between parent and child.

The problem with this model is that it doesn’t take into account what the geneticists might call “admixture” between separate populations. It is true that, when a language is in isolation, it evolves through time on its own; it is also true that, when speakers of a language drift apart geographically, the languages of these split populations will evolve through time in two new ways. However, the authors of the paper fail to take into consideration another known fact about language evolution: when languages come into contact, one or both of the languages will evolve in a new way due to the linguistic contact.

The problem is, if we try to put language contact into our phylogenies, we no longer have linear branching phylogenies. Rather, we are forced to view language evolution as more of a network than a linear family.

We needn’t look further than Standard English to see how this works. (This graphic oversimplifies the relations, but it gets the point across . . .)


We know the recent history of English, and it sure as hell doesn’t include a straight evolution-in-time from Old English to contemporary English, from Beowulf to The Canterbury Tales, from Caedmon to Malory. Old English texts, circa 1000 AD, are incomprehensible to the modern English speaker; Middle English texts, circa 1400 AD, are quite easy to read. What happened during those 400 years? Punctuated linguistic equilibrium? Geographic separation?

No. The Norman invasion happened. The most recent ancestor of the French language (which, I believe, had a lot in common with modern Spanish)  traveled across the Channel from its Latin family tree and interrupted the evolution of the Germanic tree, and the interruption left us with, among other things, the English language we know and love today. English is not pure Germanic, with a pure and direct lineage from proto-Germanic to Old English to English. Neither does it boast a purely Latinate ancestry. English is a mix of two otherwise separate family trees, the Germanic and the Latin.

Percentages come into play here; so, too, does the untangling of the two ancestral strands. Is English half German, half Latin? No, probably not. But this is indeed a fascinating question: figuring out how these internetworked phylogenies surface in the phonology, morphology, and syntax of contemporary Standard English.

(Having learned German and now learning Spanish, I am prepared to offer a general generalization: English shares more with Spanish and French in terms of syntax and idioms; it shares more with German in terms of phonology; it also shares German’s simple, consonant-based verbal inflection system. At this point, I’d say that English is a Latin idiom without a Latin verbal system, spoken with a German accent.)

There are interesting critiques and comments on the study’s methods and outcome here and here. And anthropology blogger Dienekes pointed me to a blogpost he wrote on network models of linguistic relations, which, obviously, I think are the right kinds of models for language evolution and the relations between living languages.


