Our R package AdhereR for computing the adherence to treatment from medical databases has a new version. As usual, it is released on CRAN (for Windows, macOS and Linux), and its full source code, example data, documentation and vignettes are available in its GitHub repository.

Details about the new stuff can be found in the NEWS.md file, but in brief:

various optimizations, bug fixes and enhancements;

One type of non-independence between languages is due to descent from a common ancestor, forming language families. There are several classifications of languages into language families, each with its own advantages and disadvantages, but they are relatively difficult to use by computational methods due to a lack of standardization. Moreover, phylogeentic methods usually require not only the topology of the language family tree but also information concerning the amount of evolution that has happened on the tree represented as the branch lengths, and this information is usually missing.
Here I present a method that converts the language classifications provided by four widely-used databases (Ethnologue, WALS, AUTOTYP and Glottolog) into the Newick standard, aligns the four most used conventions of unique identifiers for linguistic entities (ISO 639-3, WALS, AUTOTYP and Glottocode), and adds branch length information form a variety of sources (the tree’s own topology, an externally given numeric constant or a distance matrix).
The R scripts, input data and resulting Newick trees are provided in a GitHub repository in the hope that this will promote the use of advanced quantitative methods in answering questions concerning linguistic diversity and its temporal dynamics.

More information can be found in the GitHub repository, especially the README.md file and the paper, this blog post being just a quick summary.

Language families are groups of languages related through common descent

Languages are not independent entities for a host of reasons, probably the most important being shared ancestry and contact; what this practically means is that one cannot a priori assume treat a set of languages are statistically independent and thus one must properly adjust their statistical inferences appropriately.

Non-independence due to shared ancestry is due to the fact that languages usually derive from pre-existing “mother” languages (called proto-languages) just as biological species derive from ancestral species.

A well-known example is represented by the Romance languages (including French, Italian, Spanish, Portuguese, Romansh and Romanian) which, even at a superficial level are very similar (e.g., speakers of Italian and Spanish can talk to each other while speakers of Romanian have very little trouble learning Italian or Spanish) because they derive from the Latin spoken throughout the Roman empire about 2000 years ago. In this case, (Vulgar) Latin represents the proto-language of the Romance subfamily; I said “subfamily” because this grouping is part of a much larger set of languages — the so-called Indo-European languages — that also contain the Germanic languages (such as German, Danish and Dutch), the Indo-Aryan languages (including Hindi, Urdu and Punjabi), the Slavic languages (Russian, Czech and Polish being some examples), Greek, and several others, grouping that forms a language family (in this case deriving from its own proto-language called Proto-Indo-European).

Now there are many such language families (their number and composition depends on the source — more on this below), but the point is that, just like in biology, the daughter languages derived from the same proto-language tend to be more similar than expected by chance due to their tendency to inherit properties from the proto-language, similarity that tends to decrease the more time has passed since that separation from the common ancestor (for example, Italian and Spanish are obviously more similar to each other than each is to German). This is a more general issue that affects many aspects of culture and is known as Galton’s problem.

Also just like in biology, a popular representation of language families is in terms of trees that purport to show the patterns of vertical inheritance from mother languages (proto-languages) to their daughter languages, such as the tree below representing (part of) the structure of the Indo-European language family.

As a side note, such trees clearly do not capture the whole story (for example, they have trouble representing contact) and there’s a lot of research going on about better models for language history; however, trees do seem to capture something important about this history, they are very powerful inferential models and are (more and more) widely used to shed light on the linguistic past.

Language family classifications and their use

Now, these being said, where can we find such language families, what can we use them for, how and with what caveats?

As we saw, the central idea behind a language family is that those languages descend from a common ancestor, but in many cases we simply do not know that for sure for lack of manpower, primary data or because the situation is so complex that conflicting proposals exist. Even in the well-studied case of Indo-European where the “golden” comparative method has a long history, things are not entirely clear especially in what concerns the internal structure of the tree, the dating of various splits and the place where this happened (but modern Bayesian phylogenetic methods using mostly basic vocabularycognacy judgements help for some large and well-studied families such as Indo-Europen, Austronesian and Bantu).

Probably the most used databases that offer classifications of languages into “language families” are the Ethnologue, WALS, AUTOTYP, and Glottolog, but they differ in several relevant respects among which:

the criteria used to classify languages into language families and the criteria used to further refine the language family’s internal structure;

related, the sources of information used to make decisions based on these criteria;

the number of levels in a classification (limited to e.g., maximum 4 or unlimited varying among families);

the unique identifiers used for the languages (or dialects, proto-languages, etc.) that are classified;

the options for downloading and the format(s) available for download.

I do not want to express too strong opinions on the first two points (mainly because I am not an expert myself) but these days I tend to rely more on the classifications contained in the Glottolog database; therefore in the work presented here I treated these four databases on an equal footing.

However, for this discussion the last two points are more important. There are in fact four different standards for uniquely identifying languages (or other type of entities): ISO 639-3 codes (tree letters), WALS codes (three letters), AUTOTYP LIDs (numeric), and Glottocodes (alphanumeric: four letters followed by four digits); for example, (Standard) English is identified as eng, eng, 74 and stan1293, respectively. Moreover, there is as yet (to my knowledge at least) no resource that provides mappings between these unique identifiers, complicating the cross-linking of various datasets.

Finally, these four databases differ in how readily the language classification data is available for download an importing in various software programs.

the Ethnologue data is the most difficult to access (with the goal of further processing) because even if it allows in its Terms of Use the use of “portions” of the data for “research or educational purposes”, it requires the download of a master HTML page containing a list of all language families and links to their respective webpages, which must then be downloaded and parsed to extract the tree structure of the family, the group names, and the language names and their ISO 639-3 codes;

WALS provides the whole database (including language name, codes, geographic coordinates but also values for more than 130 typological features) under a Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Germany (CC BY-NC-ND 2.0 DE); here the important columns are WALS, ISO 639-3 and Glottolog codes, the languages’ name, genus and family, resulting in a rather flat three-levels structure;

the AUTOTYP trees are freely available for download, use and distribution provided that their source is clearly cited; the format of the language families is similar to the WALS in the sense that each language (row) contains the language names, the AUTOTYP LID, the Glottolog and the ISO 639-3 codes, as well as the stock, mbranch, sbranch, ssbranch and lsbranch names, each denoting more and more superficial levels (i.e., the “stock” is the highest level corresponding to the language family), and in some cases intermediate levels might be missing;

Given this diversity of language identifiers and forms, the use of these classifications for computational tasks is not straightforward.

Converting the language classifications to a standard Newick tree format and mapping the unique identifiers

Thus, the first two things I did were (a) to map as best as possible the four unique identifiers and to define a format of representation that makes manipulating these equivalences as easy as possible, and (b) to export these different formats into the de facto standard format for phylogenetic tree representation, namely the Newick format.

Concerning (a), I used the already existing partial mapping between these unique identifiers to create a TAB-separated CSV file listing the four codes, the four language names as well as the geographic coordinates for the languages in these four databases. Moreover, I defined a so-called Unique Universal Language IDentifier (or UULID) which has the format:

‘NAME [i-I][w-W][a-A][g-G]’

where CAPITAL LETTERS denote variables and the full node name is usually included within single quotes. NAME is the entity name as given by the classification, followed by a SPACE and the four unique codes I (ISO 639-3), W (WALS), A (AUTOTYP) and G (Glottocode), where each and all can be missing or can have multiple values (in which case the values are separated by “-“). A few examples are (from the WALS classification, the Indo-European family):

Concerning (b), I converted the specific format given by each database (except Glottolog) into the standard Newick tree format that basically represents trees using parentheses: the subtrees are enclosed within parentheses “()” and the (optional) branch length is given as a number immediately following the branch and separated from it by “:”. Moreover, the nodes in these Newick trees follow the UULID conventions above.

Branch lengths: what are they good for and adding them to the trees

These classifications give only the topologies of the language families but not any information on how long the branches in the tree are. This extra information encodes the amount of evolution that has happened on a branch and is extremely important for “advanced” phylogenetic methods such as Maximum Likelihood or Bayesian. How can we add branch length information?

If you are lucky and have good-quality basic vocabulary cognacy judgements (as per Indo-European or Austronesian), you can compute these branch length yourself using for example Bayesian phylogenetic methods, but for most language families this is currently not feasible.

Therefore, I implemented a set of methods for adding branch length information to a phylogeny, as follows:

(a) methods that depend only on the topology: (1) constant, (2) proportional and (3) grafen,
(b) methods that generate the branch length and topology from a distance matrix: (4) nj, and
(c) methods that map a given distance matrix onto the topology: (5) nnls and (6) ga.

The methods of type (a) only need a tree topology (and possibly a numeric constant). Method (1) computes branch lengths such that the sum of the branch lengths for every path in the tree is equal to the constant (the same amount of evolution has happened on all branches); method (2) simply gives each branch the same length such that the amount of evolution on a path is proportional to the number of splits on that path; method (3) is a classic whereby first each node is given a “height” defined as the number of leaves of its subtree minus 1 (0 for the leaves), after which branch lengths are computed as the difference between the height of the lower and the upper nodes of the branch.

Method (4) is the only one of type (b) used here and is a classic method in phylogenetics (Neighbor-Joining), a clustering method that iteratively joins taxa into higher groupings based on distance matrix between all the taxa.

Methods (5) and (6) try to use both the given language family’s tree topology and the information contained in a inter-language distance matrix by computing branch lengths that best approximate the original distances. Method (5) computes the branch lengths by using a non-negative least squares approach, while method (6) estimates the branch lengths using a standard genetic algorithm; they produce very similar results but there are also differences: method (5) is less robust than method (6), but method (6) is much slower, especially for very large trees, and might produce non-unique solutions.

As distance matrices between languages I have used the following:

a. distances based on vocabulary: (1) ASJP16,
b. distances based on geography: (2) great-circle geographic distances,
c. distances based on WALS: (3) gower and (4) euclidean, with and without missing data imputation,
d. distance based on AUTOTYP: (5) gower with missing data using only the variables with a single datapoint per language (this distance was computed by Balthasar Bickel), and
e. distances based on the tree topology: a new “genetic method” applied to the WALS (6), Ethnologue (7), Glottolog (8) and AUTOTYP (9) classifications.

Briefly, these distances encode the differences between languages based on a very restricted vocabulary (1), geographic location (2), structural differences between languages (3-5) and the language family tree (6-9).

Make sure to read first the README.md and the paper describing the whole thing (as PDF, HTML or, if you prefer, also the R markdown source)!

Please note that I tested the code quite extensively but errors are quite possible, so use it with caution and please do report any weird things (or good suggestions)!

I hope this code and Newick trees with branch lengths will be useful (and my 2+ years of intermittent work will thus prove a good investment).

When doing phylogenetics with R one often uses (directly or indirectly) the ape library, but there’s a catch: many times one has to get rid of nodes that have a single descendant by calling the collapse.singles() function.

An example might help here:

> library(phytools)
Loading required package: apeLoading required package: maps
> tree <- read.newick(text="((a:0.1)A:0.5,(b1:0.2,b2:0.1)B:0.2);")
> tree
Phylogenetic tree with 3 tips and 3 internal nodes.
Tip labels:
[1] "a" "b1" "b2"
Node labels:
[1] "" "A" "B"
Rooted; includes branch lengths.
> plot(tree)
Error in plot.phylo(tree) : there are single (non-splitting) nodes in your tree; you may need to use collapse.singles()

The tree ((a:0.1)A:0.5,(b1:0.2,b2:0.1)B:0.2) can be displayed as:

and it can be seen that node A has only one descendant a.

By calling collapse.singles(tree) one obtains the reduced tree (a:0.6,(b1:0.2,b2:0.1)B:0.2) with the node A removed; this can now be plotted (among other things):

But one manipulation one might want to do (and I do it myself a lot!) is to change the branch length (using, for example, nnls.tree() in phangorn) but then how do you get back to the original tree topology (i.e., re-insert the collapsed nodes)?

This is where a piece of R code I wrote recently (part of a larger script that I will discuss later) might help. I basically extended the original collapse.singles()‘s code by also recording the information about the removed nodes (their ancestor — if any — and their unique descendant as well as the branch length from the ancestor — if any — and to the descendant). This information can be used later to re-insert back the removed singles.

where tree and collapsed.tree are objects of class phylo (such as tree in our example), reverse.info is a list containing the information needed to re-insert the removed singles and is returned by collapse.singles.reversible(), and restore.brlen.method controls the manner in which the length of the collapsed branch is distributed when the single is re-inserted: "original.proportion" means that the proportion between (ancestor → single) and (single → descendant) in the original tree is maintained, "equal.proportion" means that the length is divided equally between the two new branches, while "first.zero" simply sets the ancestor → single branch length to 0.

Again, an example might help:

> reverse.info <- collapse.singles.reversible( tree )
> reverse.info
$original.treePhylogenetic tree with 3 tips and 3 internal nodes.Tip labels:[1] "a" "b1" "b2"Node labels:[1] "" "A" "B"Rooted; includes branch lengths.$collapsed.treePhylogenetic tree with 3 tips and 2 internal nodes.Tip labels:[1] "a" "b1" "b2"Node labels:[1] "" "B"Rooted; includes branch lengths.$removed.singles node name prev.node next.node prev.brlen next.brlen1 5 A 4 1 0.5 0.1$original.root[1] 4
>

The list reverse.info contains the original tree (in the field original.tree), the tree with the singles removed (in the field collapsed.tree) and the information on what nodes have been removed (in the field removed.singles) as well as the original tree’s root (in original.root). You can then play with the new collapsed tree by, for example, changing the branch lengths:

> tree2$edge.length[1] <- 3.0 # alter the length of the branch leading to a
> plot(tree2)

> tree3 <- reverse.collapse.singles(tree2, reverse.info, "original.proportion")
> tree3
Phylogenetic tree with 3 tips and 3 internal nodes.
Tip labels:
[1] "a" "b1" "b2"
Node labels:
[1] "" "A" "B"
Rooted; includes branch lengths.
>

tree3 cannot of course be plotted using ape (because of the re-inserted single node A) but its structure is:

It can be seen that not only A is safely re-inserted between the root and a, but that the new branch lengths respect the original ratio: (root → A) is 2.5 (original was 0.5) and (A → a) is 0.5 (original was 0.1) giving a ratio of 5:1.

Likewise,

> tree4 <- reverse.collapse.singles(tree2, reverse.info, "equal.proportion")
> tree4
Phylogenetic tree with 3 tips and 3 internal nodes.
Tip labels:
[1] "a" "b1" "b2"
Node labels:
[1] "" "A" "B"
Rooted; includes branch lengths.
>

The code is quite well tested but bugs are of course possible! Likewise, it is relatively fast but I guess it can always be optimized. If you find any bugs, have suggestions or comments, please drop me an e-mail.

Now, the actual code, released here under a GLPv2 license (ape is also GPL ≥ 2 and I used collapse.singles()‘s code as a foundation for my own), is given below. Feel free to us it!

# Extend ape's collapse.singles() to allow restoring the removed nodes.
# Copyright (C) 2015 Dan Dediu
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.# Auxiliary functions required by reverse.collapse.singles():
# Insert a row in a matrix at a given position:
.insert.row <- function(m, row=rep(NA,ncol(m)), where=nrow(m))
{
if( is.null(m) || !is.matrix(m) || where <= 0 || length(row) != ncol(m) ) return (m); # nothing to do
rnames <- rownames(m);
if( where == 1 )
{
m <- rbind(row,m);
} else if( where <= nrow(m) )
{
m <- rbind(m[1:(where-1),], row, m[where:nrow(m),]);
} else if( where == nrow(m)+1 )
{
m <- rbind(m,row);
} else
{
m <- rbind(m,matrix(NA,ncol=ncol(m),nrow=where-nrow(m)-1),row);
}
if( is.null(rnames) ) rownames(m) <- NULL;
return (m);
}
# Insert an element in a vector at a given position:
.insert.element <- function(v, element=NA, where=length(v))
{
if( is.null(v) || where <= 0 ) return (v); # nothing to do
rnames <- names(v);
if( where == 1 )
{
v <- c(element,v);
} else if( where <= length(v) )
{
v <- c(v[1:(where-1)], element, v[where:length(v)]);
} else if( where == length(v)+1 )
{
v <- c(v,element);
} else
{
v <- c(v,rep(NA,where-length(v)-1),element);
}
if( is.null(rnames) ) names(v) <- NULL;
return (v);
}
# Collapse single nodes also returning the collapsed nodes so that the original topology can be reconstructed later on:
collapse.singles.reversible <- function(tree) # return a list containing the new tree and the collapsed singles
{
# This is shamelessly inspired from ape's original collapse.singles() function# Cache several tree properties:
elen <- tree$edge.length;
xmat <- tree$edge;
node.lab <- tree$node.label;
nnode <- tree$Nnode;
ntip <- length(tree$tip.label);
# The original tree's root:
root <- find.root(tree);
# Start processing the singles:
singles <- NA;
removed.singles <- data.frame("node"=rep(NA,nnode), "name"=NA, "prev.node"=NA, "next.node"=NA, "prev.brlen"=NA, "next.brlen"=NA); k <- 1; tree.orig <- tree;
while( length(singles) > 0 )
{
tx <- tabulate( xmat[,1] );
singles <- which( tx == 1 ); # singles are those nodes with just one descendant
if( length(singles) > 0 )
{
i <- singles[1]; # focus on the first single
prev.node <- which( xmat[,2] == i ); # the ancestor
next.node <- which( xmat[,1] == i ); # the single descendant
prev.single.edge <- which(xmat[,1] == i); # the (ancestor -> single) edge
xmat[ xmat > i ] <- xmat[ xmat > i ] - 1L; # adjust the node numbering# Store this to be removed node:
removed.singles$node[k] <- i;
if( !is.null(node.lab) ){ removed.singles$name[k] <- node.lab[i - ntip]; };
if( length(prev.node) > 0 ){ removed.singles$prev.node[k] <- xmat[prev.node,1]; removed.singles$prev.brlen[k] <- elen[prev.node]; }
if( length(next.node) > 0 ){ removed.singles$next.node[k] <- xmat[next.node,2]; removed.singles$next.brlen[k] <- elen[next.node]; }
# Connect the (ancestor) directly to the (descendant) removing the (single):
xmat[prev.node, 2] <- xmat[next.node, 2]; # connect the ancestor directly to the descendant
xmat <- xmat[ -prev.single.edge,]; # delete the (ancestor -> single) edge
elen[prev.node] <- elen[prev.node] + elen[next.node]; # the new direct branch's length is the sum of the (ancestor -> single) and the (single -> descendant)'s branch lengths
if( !is.null(node.lab) ) node.lab <- node.lab[-c(i - ntip)]; # adjust the node labels as well
nnode <- nnode - 1L; # one less node
elen <- elen[-next.node]; # and one less edge
k <- k+1; # prepare for the (possible) next removed single
}
}
removed.singles <- removed.singles[ !is.na(removed.singles$node), ]; # clean them# Update the tree...
tree$edge <- xmat;
tree$edge.length <- elen;
tree$node.label <- node.lab;
tree$Nnode <- nnode;
# ...and return it
list("original.tree"=tree.orig, "collapsed.tree"=tree, "removed.singles"=removed.singles, "original.root"=root);
}
# Reverse the collapsing of single nodes:
reverse.collapse.singles <- function(collapsed.tree, reverse.info,
restore.brlen.method=c("original.proportion","equal.proportion","first.zero")) # use the original.tree and collapsed.singles list to restore the collapsed.tree
{
if( is.null(reverse.info) || is.null(reverse.info$removed.singles) || nrow(reverse.info$removed.singles) == 0 ) return (collapsed.tree); # nothing to restore!
restore.brlen.method <- restore.brlen.method[1];
# Cache several tree properties:
elen <- collapsed.tree$edge.length;
xmat <- collapsed.tree$edge;
node.lab <- collapsed.tree$node.label;
nnode <- collapsed.tree$Nnode;
ntip <- length(collapsed.tree$tip.label);
singles <- reverse.info$removed.singles; # less typing
# The removed.singles list records the order of removal: start from the end
for( i in nrow(singles):1 )
{
s <- singles[i,]; # save typing...
# Try to locate the ancestor ("prev") and decendant ("next") nodes in the tree as we need to restore this single between them:
prev.node <- s$prev.node; next.node <- s$next.node;
if( is.na(prev.node) )
{
# This is the new root: insert it:
xmat <- .insert.row(xmat, c(s$node, next.node), 1); # insert the link# Adjust the node numbering and names:
xmat[ xmat >= s$node ] <- xmat[ xmat >= s$node ] + 1L; xmat[1,1] <- s$node; # but make sure to keep the right root node
if( !is.null(node.lab) ) node.lab <- .insert.element(node.lab, s$name, s$node - ntip);
nnode <- nnode + 1L;
# Update the brlens:
if( !is.null(elen) )
{
elen <- .insert.element(elen, (s$next.brlen), 1);
}
} else
{
# Regular internal node:
dlink <- which(xmat[,1] == prev.node & xmat[,2] == next.node); # locate the (ancestor -> descendant) direct link
if( length(dlink) != 1 )
{
warning(paste0("Error restoring collapsed single node ",s$orig.node,ifelse(!is.na(s$node.name),paste0(", '",s$node.name,"'"),"")," cannot locate branch!\n"));
return (collapsed.tree);
}
# Adjust the node numbering and names:
xmat[ xmat >= s$node ] <- xmat[ xmat >= s$node ] + 1L;
if( !is.null(node.lab) ) node.lab <- .insert.element(node.lab, s$name, s$node - ntip);
nnode <- nnode + 1L;
# Replace the direct link by two links (ancestor -> single) and (single -> ancestor):
xmat[dlink,2] <- s$node;
# Insert a new link (single -> descendant):
xmat <- .insert.row(xmat, c(s$node, ifelse(next.node >= s$node, next.node+1L, next.node)), dlink+1);
# Update the brlens:
if( !is.null(elen) )
{
if( !(restore.brlen.method %in% c("original.proportion","equal.proportion","first.zero")) )
{
warning("Unknown branch length restoration method: defaulting to original proportion\n");
restore.brlen.method <- "original.proportion";
}
k <- elen[dlink]; # the brlen to be split
if( restore.brlen.method == "original.proportion" )
{
elen[dlink] <- k * s$next.brlen / (s$prev.brlen + s$next.brlen);
} else if( restore.brlen.method == "equal.proportion" )
{
elen[dlink] <- k / 2;
} else if( restore.brlen.method == "first.zero" )
{
elen[dlink] <- 0;
} else
{
# This should never happen
}
elen <- .insert.element(elen, (k - elen[dlink]), dlink);
}
}
}
# Update the tree...
collapsed.tree$edge <- xmat;
collapsed.tree$edge.length <- elen;
collapsed.tree$node.label <- node.lab;
collapsed.tree$Nnode <- nnode;
# ... and return it:
return (collapsed.tree);
}