Language family classifications as Newick trees with branch length

Short summary

One type of non-independence between languages is due to descent from a common ancestor, forming language families. There are several classifications of languages into language families, each with its own advantages and disadvantages, but they are relatively difficult to use by computational methods due to a lack of standardization. Moreover, phylogeentic methods usually require not only the topology of the language family tree but also information concerning the amount of evolution that has happened on the tree represented as the branch lengths, and this information is usually missing.
Here I present a method that converts the language classifications provided by four widely-used databases (Ethnologue, WALS, AUTOTYP and Glottolog) into the Newick standard, aligns the four most used conventions of unique identifiers for linguistic entities (ISO 639-3, WALS, AUTOTYP and Glottocode), and adds branch length information form a variety of sources (the tree’s own topology, an externally given numeric constant or a distance matrix).
The R scripts, input data and resulting Newick trees are provided in a GitHub repository in the hope that this will promote the use of advanced quantitative methods in answering questions concerning linguistic diversity and its temporal dynamics.

More information can be found in the GitHub repository, especially the file and the paper, this blog post being just a quick summary.

Language families are groups of languages related through common descent

Languages are not independent entities for a host of reasons, probably the most important being shared ancestry and contact; what this practically means is that one cannot a priori assume treat a set of languages are statistically independent and thus one must properly adjust their statistical inferences appropriately.

Non-independence due to shared ancestry is due to the fact that languages usually derive from pre-existing “mother” languages (called proto-languages) just as biological species derive from ancestral species.

A well-known example is represented by the Romance languages (including French, Italian, Spanish, Portuguese, Romansh and Romanian) which, even at a superficial level are very similar (e.g., speakers of Italian and Spanish can talk to each other while speakers of Romanian have very little trouble learning Italian or Spanish) because they derive from the Latin spoken throughout the Roman empire about 2000 years ago. In this case, (Vulgar) Latin represents the proto-language of the Romance subfamily; I said “subfamily” because this grouping is part of a much larger set of languages — the so-called Indo-European languages — that also contain the Germanic languages (such as German, Danish and Dutch), the Indo-Aryan languages (including Hindi, Urdu and Punjabi), the Slavic languages (Russian, Czech and Polish being some examples), Greek, and several others, grouping that forms a language family (in this case deriving from its own proto-language called Proto-Indo-European).

Now there are many such language families (their number and composition depends on the source —  more on this below), but the point is that, just like in biology, the daughter languages derived from the same proto-language tend to be more similar than expected by chance due to their tendency to inherit properties from the proto-language, similarity that tends to decrease the more time has passed since that separation from the common ancestor (for example, Italian and Spanish are obviously more similar to each other than each is to German). This is a more general issue that affects many aspects of culture and is known as Galton’s problem.

Also just like in biology, a popular representation of language families is in terms of trees that purport to show the patterns of vertical inheritance from mother languages (proto-languages) to their daughter languages, such as the tree below representing (part of) the structure of the Indo-European language family.

IE tree
Indo-European language family tree using modern Bayesian phylogenetic methods. From; see Bouckaert, R., Lemey, P., Dunn, M., Greenhill, S. J., Alekseyenko, A. V., Drummond, A. J., Gray, R. D., Suchard, M. A., & Atkinson, Q. D.* (2012). Mapping the origins and expansion of the Indo-European language family. Science, 337:957–960.

As a side note, such trees clearly do not capture the whole story (for example, they have trouble representing contact) and there’s a lot of research going on about better models for language history; however, trees do seem to capture something important about this history, they are very powerful inferential models and are (more and more) widely used to shed light on the linguistic past.

Language family classifications and their use

Now, these being said, where can we find such language families, what can we use them for, how and with what caveats?

As we saw, the central idea behind a language family is that those languages descend from a common ancestor, but in many cases we simply do not know that for sure for lack of manpower, primary data or because the situation is so complex that conflicting proposals exist. Even in the well-studied case of Indo-European where the “golden” comparative method has a long history, things are not entirely clear especially in what concerns the internal structure of the tree, the dating of various splits and the place where this happened (but modern Bayesian phylogenetic methods using mostly basic vocabulary cognacy judgements help for some large and well-studied families such as Indo-Europen, Austronesian and Bantu).

Probably the most used databases that offer classifications of languages into “language families” are the Ethnologue, WALS, AUTOTYP, and Glottolog, but they differ in several relevant respects among which:

  1. the criteria used to classify languages into language families and the criteria used to further refine the language family’s internal structure;
  2. related, the sources of information used to make decisions based on these criteria;
  3. the number of levels in a classification (limited to e.g., maximum 4 or unlimited varying among families);
  4. the unique identifiers used for the languages (or dialects, proto-languages, etc.) that are classified;
  5. the options for downloading and the format(s) available for download.

I do not want to express too strong opinions on the first two points (mainly because I am not an expert myself) but these days I tend to rely more on the classifications contained in the Glottolog database; therefore in the work presented here I treated these four databases on an equal footing.

However, for this discussion the last two points are more important. There are in fact four different standards for uniquely identifying languages (or other type of entities): ISO 639-3 codes (tree letters), WALS codes (three letters), AUTOTYP LIDs (numeric), and Glottocodes (alphanumeric: four letters followed by four digits); for example, (Standard) English is identified as eng, eng, 74 and stan1293, respectively. Moreover, there is as yet (to my knowledge at least) no resource that provides mappings between these unique identifiers, complicating the cross-linking of various datasets.

Finally, these four databases differ in how readily the language classification data is available for download an importing in various software programs.

  • the Ethnologue data is the most difficult to access (with the goal of further processing) because even if it allows in its Terms of Use the use of “portions” of the data for “research or educational purposes”, it requires the download of a master HTML page containing a list of all language families and links to their respective webpages, which must then be downloaded and parsed to extract the tree structure of the family, the group names, and the language names and their ISO 639-3 codes;
  • WALS provides the whole database (including language name, codes, geographic coordinates but also values for more than 130 typological features) under a Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Germany (CC BY-NC-ND 2.0 DE); here the important columns are WALS, ISO 639-3 and Glottolog codes, the languages’ name, genus and family, resulting in a rather flat three-levels structure;
  • the AUTOTYP trees are freely available for download, use and distribution provided that their source is clearly cited; the format of the language families is similar to the WALS in the sense that each language (row) contains the language names, the AUTOTYP LID, the Glottolog and the ISO 639-3 codes, as well as the stock, mbranch, sbranch, ssbranch and lsbranch names, each denoting more and more superficial levels (i.e., the “stock” is the highest level corresponding to the language family), and in some cases intermediate levels might be missing;
  • finally, Glottolog provides the family trees in a standardized Newick format under a Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA 3.0) license.

Given this diversity of language identifiers and forms, the use of these classifications for computational tasks is not straightforward.

Converting the language classifications to a standard Newick tree format and mapping the unique identifiers

Thus, the first two things I did were (a) to map as best as possible the four unique identifiers and to define a format of representation that makes manipulating these equivalences as easy as possible, and (b) to export these different formats into the de facto standard format for phylogenetic tree representation, namely the Newick format.

Concerning (a), I used the already existing partial mapping between these unique identifiers to create a TAB-separated CSV file listing the four codes, the four language names as well as the geographic coordinates for the languages in these four databases. Moreover, I defined a so-called Unique Universal Language IDentifier (or UULID) which has the format:

‘NAME [i-I][w-W][a-A][g-G]’

where CAPITAL LETTERS denote variables and the full node name is usually included within single quotes. NAME is the entity name as given by the classification, followed by a SPACE and the four unique codes I (ISO 639-3), W (WALS), A (AUTOTYP) and G (Glottocode), where each and all can be missing or can have multiple values (in which case the values are separated by “-“). A few examples are (from the WALS classification, the Indo-European family):

  • ‘German {Zurich} [i-gsw][w-gzu][a-1305-1306-1307-1308-1309-1310][g-swis1247]’
  • ‘Urdu [i-urd][w-urd][a-2671][g-urdu1245]’
  • ‘Romani {Sepecides} [i-][w-rse][a-][g-]’
  • ‘Germanic [i-][w-][a-][g-]’.

Concerning (b), I converted the specific format given by each database (except Glottolog) into the standard Newick tree format that basically represents trees using parentheses: the subtrees are enclosed within parentheses “()” and the (optional) branch length is given as a number immediately following the branch and separated from it by “:”. Moreover, the nodes in these Newick trees follow the UULID conventions above.

Branch lengths: what are they good for and adding them to the trees

These classifications give only the topologies of the language families but not any information on how long the branches in the tree are. This extra information encodes the amount of evolution that has happened on a branch and is extremely important for “advanced” phylogenetic methods such as Maximum Likelihood or Bayesian. How can we add branch length information?

If you are lucky and have good-quality basic vocabulary cognacy judgements (as per Indo-European or Austronesian), you can compute these branch length yourself using for example Bayesian phylogenetic methods, but for most language families this is currently not feasible.

Therefore, I implemented a set of methods for adding branch length information to a phylogeny, as follows:

(a) methods that depend only on the topology: (1) constant, (2) proportional and (3) grafen,
(b) methods that generate the branch length and topology from a distance matrix: (4) nj, and
(c) methods that map a given distance matrix onto the topology: (5) nnls and (6) ga.

The methods of type (a) only need a tree topology (and possibly a numeric constant). Method (1) computes branch lengths such that the sum of the branch lengths for every path in the tree is equal to the constant (the same amount of evolution has happened on all branches); method (2) simply gives each branch the same length such that the amount of evolution on a path is proportional to the number of splits on that path; method (3) is a classic whereby first each node is given a “height” defined as the number of leaves of its subtree minus 1 (0 for the leaves), after which branch lengths are computed as the difference between the height of the lower and the upper nodes of the branch.

Method (4) is the only one of type (b) used here and is a classic method in phylogenetics (Neighbor-Joining), a clustering method that iteratively joins taxa into higher groupings based on distance matrix between all the taxa.

Methods (5) and (6) try to use both the given language family’s tree topology and the information contained in a inter-language distance matrix by computing branch lengths that best approximate the original distances. Method (5) computes the branch lengths by using a non-negative least squares approach, while method (6) estimates the branch lengths using a standard genetic algorithm; they produce very similar results but there are also differences: method (5) is less robust than method (6), but method (6) is much slower, especially for very large trees, and might produce non-unique solutions.

As distance matrices between languages I have used the following:

a.  distances based on vocabulary: (1) ASJP16,
b.  distances based on geography: (2) great-circle geographic distances,
c.  distances based on WALS: (3) gower and (4) euclidean, with and without missing data imputation,
d.  distance based on AUTOTYP: (5) gower with missing data using only the variables with a single datapoint per language (this distance was computed by Balthasar Bickel), and
e.  distances based on the tree topology: a new “genetic method” applied to the WALS (6), Ethnologue (7), Glottolog (8) and AUTOTYP (9) classifications.

Briefly, these distances encode the differences between languages based on a very restricted vocabulary (1), geographic location (2), structural differences between languages (3-5) and the language family tree (6-9).

Where can I get these and why are they important?

Fair questions 🙂 All these (and much more, including a detailed description of the input data, the output files and the process, and the actual R code implementing all this) are available from a GitHub repository.

Make sure to read first the and the paper describing the whole thing (as PDF, HTML or, if you prefer, also the R markdown source)!

Please note that I tested the code quite extensively but errors are quite possible, so use it with caution and please do report any weird things (or good suggestions)!

I hope this code and Newick trees with branch lengths will be useful (and my 2+ years of intermittent work will thus prove a good investment).