Forests, trees and gene expressions

Four studies of mixing science and nature

This is a rather unusual post for me as it concerns art (Chinese brush painting, to be more precise) and, on top of that, somebody else’s art.

Probably you don’t know it, but the Max Planck Institute for Psycholinguistics in Nijmegen, The Netherlands (where I currently work) has a new building housing mainly the Department of Language and Genetics headed by Simon Fisher (of FOXP2 fame). Now, that’s a nice thing in itself but it also means that there’s a lot of empty, freshly painted walls waiting for art!

As it happens, Alexandra Dima, who in her daily life is a postdoctoral researcher in Health Psychology at the University of Amsterdam (she does lots of cool and useful stuff with relevance to keeping all of us healthy, mainly involving statistics), is also an artist, lately specializing in Chinese brush painting. Her name might ring a bell as she created the cover of my book An introduction to genetics for language scientists, and this is how she was invited to also create art for the new building.

The challenge was twofold: not only to create aesthetically pleasing paintings that fit well with the new building’s (internal and external) environment, but to create something that speaks to the research and the people that work there. So, this is how it started…


The final result is displayed on the second floor and looks like this:

scrolls-all
The four scrolls on the second floor’s wall (there are four windows just in front of them, too and, please, ignore me: I’m there just to give an impression of the size).

There are some points worth making. First, the space is quite large (and, therefore, the paintings themselves are also large, at 70cm x 140cm without the frames). Second, just opposite the paintings are four large windows facing roughly south-west (so that there’s plenty of sun in the afternoon) overlooking a large forested park (dominated by deciduous trees such as oak). Third, this is a very interdisciplinary place where we study language, genetics, neurobiology, using all sorts of cool methods ranging from wet-lab-based to statistical. And this is where the paintings start speaking…

The four paintings in all their glory (and high resolution) are below:

scroll_1

scroll_2

scroll_3

scroll_4


I will give here just a few clues, leaving most of the messages and metaphors hidden for you to discover. For example, if you really look at the branches of those trees, do they remind you of something else? Something to do with the brain, the way neurons connect (and talk) to each other?

detail-branches
Are these tree branches or what?

And, really, really, what are those patches of colour behind the trees? Leaves in the autumn? City lights? Or maybe a reference to a much-used way of showing the patterns of gene expression (more generally known as heat maps; for the geeks, this is one way of drawing them in R)? Or maybe they remind you of the patterns one sees in DNA sequencing or even doing PCR?

radioactive_fluorescent_seq1
DNA sequencing (courtesy of Wikipedia)

Wait, that’s not all! There are weird scribblings on the trees! Some look like some sort of old Chinese writing:

detail-seal-script
Seal script “talk”

Some look like cuneiforms, more precisely Ugaritic from about the 14 century BCE…

detail-cuneiform
Ugaritic

… and this one translates to GATCACAGGT, the origin of the so-called Cambridge Reference Sequence (CRS) of the human mitochondrial DNA (mtDNA).

Some might recognize these

detail-runes
Runes

as runes (or the Futhark alphabet, used in Northern Europe on wood, stone and metal), meaning TATAAA, the so-called TATA box, one of the most important promoter sequences (where gene transcription starts and a crucial locus of gene regulation).

But for me the most fascinating are the

detail-ogham
Ogham

Ogham inscriptions, used to write Early (4-6 century CE) and Old (6-9 CE) Irish on wood and stone, with their long and serpentine form; this one here reads also GATCACAGGT.

There’s more on those trees for you to enjoy and I won’t spoil the surprises here (but I do at the end so please skip that if you want to discover them yourself!), but there are more inscriptions in these alphabets but also in Etruscan (8-7 century BCE in Italy) and Phoenician (about 11 century BCE in the Mediterranean), and more messages with particular relevance to genetics (hint: one concerns FOXP2, another the first genome to be completely sequenced, but there’s more).


The technique (“spontaneous style” or xieyi 寫意) is very demanding, particularly for such large paintings forming a coherent set, because there’s no fixing of errors: once committed to paper a stroke stays there! So, if errors happen they must be either integrated or everything starts anew. The paper is also very thin and absorbent, which means that mounting such large paintings was very difficult and risky. Alexandra used as original materials (paper, ink, brushes) as possible and her technique and style is also faithful to the traditional masters.

The small red square with Chinese seal script is Alexandra’s artist seal:

seal


Thus, Alexandra’s paintings on the second floor try to speak to genetics and language using Chinese brush painting conventions and style, but the paintings would not look so good without the skill and dedication of Jos Witsiers of Peter van Ginkel in Arnhem who not only framed the paintings but helped in choosing the right colours, glass and frames.


Finally, Peter Sansom created three very beautiful paintings (completely different style but very warm and peaceful) for the first floor of the same building. Here is the summary of the unveiling giving more details about the event itself.


Spoiler alert!!!

Here is the list of all the scripts used in Alexandra’s paintings and of the genetic (or otherwise) messages they represent:

The scripts were mostly used for carving on wood or stone, as their shapes fit best idea of inscriptions on trees; they are: Ogham, Futhark (runes), Ugaritic (cuneiforms), Etruscan, Phoenician, and four Chinese seal script inscriptions: 問 (“ask”), 士 (“scholar”), 說 (“talk”), 基因 (“gene”), and 明白 (“understand”).

The genetic messages are: TATAAA (the “TATA box“), AUG/TAC (the START codon where transcription begins), CAAATT (FOXP2 core binding sequence), ATTA (Homeo-domain binding sequence), GATCACAGGT (reference human mtDNA sequence origin), and GAGTTTTATC GCTTCCATGA (the origin of the ΦX174 genome, the first ever genome to be completely sequenced in 1977).


Disclaimer: I have been involved in designing these paintings in an advisory capacity and I am actually married to Alex:)

 

Errata for ‘An introduction to genetics for language scientists’ (Update 1)

This post is about various errata concerning the first edition (2015) of my book An introduction to genetics for language scientists published by Cambridge University Press. I thought it would be good to keep my reader updated so here it is…

Many thanks to those who have signalled these errors, especially to Pierre Dupont (Machine Learning Group, ICTEAM Institute, Louvain School of Engineering, Louvain-la-Neuve, Belgium), Scott Moisik (Language and Genetics Department, Max Planck for Psycholinguistics, Nijmegen, The Netherlands) and Mark Rosenstein.

Needless to say, but if you do find errors or things that are not as clear as they should be, please do drop me an e-mail at Dan.Dediu@mpi.nl with the page and the description of the problem. Thanks in advance!


Now, the actual comments & corrections are in a separate document: as HTML, as PDF, and even the R Markdown source.

Language family classifications as Newick trees with branch length

Short summary

One type of non-independence between languages is due to descent from a common ancestor, forming language families. There are several classifications of languages into language families, each with its own advantages and disadvantages, but they are relatively difficult to use by computational methods due to a lack of standardization. Moreover, phylogeentic methods usually require not only the topology of the language family tree but also information concerning the amount of evolution that has happened on the tree represented as the branch lengths, and this information is usually missing.
Here I present a method that converts the language classifications provided by four widely-used databases (Ethnologue, WALS, AUTOTYP and Glottolog) into the Newick standard, aligns the four most used conventions of unique identifiers for linguistic entities (ISO 639-3, WALS, AUTOTYP and Glottocode), and adds branch length information form a variety of sources (the tree’s own topology, an externally given numeric constant or a distance matrix).
The R scripts, input data and resulting Newick trees are provided in a GitHub repository in the hope that this will promote the use of advanced quantitative methods in answering questions concerning linguistic diversity and its temporal dynamics.

More information can be found in the GitHub repository, especially the README.md file and the paper, this blog post being just a quick summary.

Language families are groups of languages related through common descent

Languages are not independent entities for a host of reasons, probably the most important being shared ancestry and contact; what this practically means is that one cannot a priori assume treat a set of languages are statistically independent and thus one must properly adjust their statistical inferences appropriately.

Non-independence due to shared ancestry is due to the fact that languages usually derive from pre-existing “mother” languages (called proto-languages) just as biological species derive from ancestral species.

A well-known example is represented by the Romance languages (including French, Italian, Spanish, Portuguese, Romansh and Romanian) which, even at a superficial level are very similar (e.g., speakers of Italian and Spanish can talk to each other while speakers of Romanian have very little trouble learning Italian or Spanish) because they derive from the Latin spoken throughout the Roman empire about 2000 years ago. In this case, (Vulgar) Latin represents the proto-language of the Romance subfamily; I said “subfamily” because this grouping is part of a much larger set of languages — the so-called Indo-European languages — that also contain the Germanic languages (such as German, Danish and Dutch), the Indo-Aryan languages (including Hindi, Urdu and Punjabi), the Slavic languages (Russian, Czech and Polish being some examples), Greek, and several others, grouping that forms a language family (in this case deriving from its own proto-language called Proto-Indo-European).

Now there are many such language families (their number and composition depends on the source —  more on this below), but the point is that, just like in biology, the daughter languages derived from the same proto-language tend to be more similar than expected by chance due to their tendency to inherit properties from the proto-language, similarity that tends to decrease the more time has passed since that separation from the common ancestor (for example, Italian and Spanish are obviously more similar to each other than each is to German). This is a more general issue that affects many aspects of culture and is known as Galton’s problem.

Also just like in biology, a popular representation of language families is in terms of trees that purport to show the patterns of vertical inheritance from mother languages (proto-languages) to their daughter languages, such as the tree below representing (part of) the structure of the Indo-European language family.

IE tree
Indo-European language family tree using modern Bayesian phylogenetic methods. From http://language.cs.auckland.ac.nz; see Bouckaert, R., Lemey, P., Dunn, M., Greenhill, S. J., Alekseyenko, A. V., Drummond, A. J., Gray, R. D., Suchard, M. A., & Atkinson, Q. D.* (2012). Mapping the origins and expansion of the Indo-European language family. Science, 337:957–960.

As a side note, such trees clearly do not capture the whole story (for example, they have trouble representing contact) and there’s a lot of research going on about better models for language history; however, trees do seem to capture something important about this history, they are very powerful inferential models and are (more and more) widely used to shed light on the linguistic past.

Language family classifications and their use

Now, these being said, where can we find such language families, what can we use them for, how and with what caveats?

As we saw, the central idea behind a language family is that those languages descend from a common ancestor, but in many cases we simply do not know that for sure for lack of manpower, primary data or because the situation is so complex that conflicting proposals exist. Even in the well-studied case of Indo-European where the “golden” comparative method has a long history, things are not entirely clear especially in what concerns the internal structure of the tree, the dating of various splits and the place where this happened (but modern Bayesian phylogenetic methods using mostly basic vocabulary cognacy judgements help for some large and well-studied families such as Indo-Europen, Austronesian and Bantu).

Probably the most used databases that offer classifications of languages into “language families” are the Ethnologue, WALS, AUTOTYP, and Glottolog, but they differ in several relevant respects among which:

  1. the criteria used to classify languages into language families and the criteria used to further refine the language family’s internal structure;
  2. related, the sources of information used to make decisions based on these criteria;
  3. the number of levels in a classification (limited to e.g., maximum 4 or unlimited varying among families);
  4. the unique identifiers used for the languages (or dialects, proto-languages, etc.) that are classified;
  5. the options for downloading and the format(s) available for download.

I do not want to express too strong opinions on the first two points (mainly because I am not an expert myself) but these days I tend to rely more on the classifications contained in the Glottolog database; therefore in the work presented here I treated these four databases on an equal footing.

However, for this discussion the last two points are more important. There are in fact four different standards for uniquely identifying languages (or other type of entities): ISO 639-3 codes (tree letters), WALS codes (three letters), AUTOTYP LIDs (numeric), and Glottocodes (alphanumeric: four letters followed by four digits); for example, (Standard) English is identified as eng, eng, 74 and stan1293, respectively. Moreover, there is as yet (to my knowledge at least) no resource that provides mappings between these unique identifiers, complicating the cross-linking of various datasets.

Finally, these four databases differ in how readily the language classification data is available for download an importing in various software programs.

  • the Ethnologue data is the most difficult to access (with the goal of further processing) because even if it allows in its Terms of Use the use of “portions” of the data for “research or educational purposes”, it requires the download of a master HTML page containing a list of all language families and links to their respective webpages, which must then be downloaded and parsed to extract the tree structure of the family, the group names, and the language names and their ISO 639-3 codes;
  • WALS provides the whole database (including language name, codes, geographic coordinates but also values for more than 130 typological features) under a Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Germany (CC BY-NC-ND 2.0 DE); here the important columns are WALS, ISO 639-3 and Glottolog codes, the languages’ name, genus and family, resulting in a rather flat three-levels structure;
  • the AUTOTYP trees are freely available for download, use and distribution provided that their source is clearly cited; the format of the language families is similar to the WALS in the sense that each language (row) contains the language names, the AUTOTYP LID, the Glottolog and the ISO 639-3 codes, as well as the stock, mbranch, sbranch, ssbranch and lsbranch names, each denoting more and more superficial levels (i.e., the “stock” is the highest level corresponding to the language family), and in some cases intermediate levels might be missing;
  • finally, Glottolog provides the family trees in a standardized Newick format under a Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA 3.0) license.

Given this diversity of language identifiers and forms, the use of these classifications for computational tasks is not straightforward.

Converting the language classifications to a standard Newick tree format and mapping the unique identifiers

Thus, the first two things I did were (a) to map as best as possible the four unique identifiers and to define a format of representation that makes manipulating these equivalences as easy as possible, and (b) to export these different formats into the de facto standard format for phylogenetic tree representation, namely the Newick format.

Concerning (a), I used the already existing partial mapping between these unique identifiers to create a TAB-separated CSV file listing the four codes, the four language names as well as the geographic coordinates for the languages in these four databases. Moreover, I defined a so-called Unique Universal Language IDentifier (or UULID) which has the format:

‘NAME [i-I][w-W][a-A][g-G]’

where CAPITAL LETTERS denote variables and the full node name is usually included within single quotes. NAME is the entity name as given by the classification, followed by a SPACE and the four unique codes I (ISO 639-3), W (WALS), A (AUTOTYP) and G (Glottocode), where each and all can be missing or can have multiple values (in which case the values are separated by “-“). A few examples are (from the WALS classification, the Indo-European family):

  • ‘German {Zurich} [i-gsw][w-gzu][a-1305-1306-1307-1308-1309-1310][g-swis1247]’
  • ‘Urdu [i-urd][w-urd][a-2671][g-urdu1245]’
  • ‘Romani {Sepecides} [i-][w-rse][a-][g-]’
  • ‘Germanic [i-][w-][a-][g-]’.

Concerning (b), I converted the specific format given by each database (except Glottolog) into the standard Newick tree format that basically represents trees using parentheses: the subtrees are enclosed within parentheses “()” and the (optional) branch length is given as a number immediately following the branch and separated from it by “:”. Moreover, the nodes in these Newick trees follow the UULID conventions above.

Branch lengths: what are they good for and adding them to the trees

These classifications give only the topologies of the language families but not any information on how long the branches in the tree are. This extra information encodes the amount of evolution that has happened on a branch and is extremely important for “advanced” phylogenetic methods such as Maximum Likelihood or Bayesian. How can we add branch length information?

If you are lucky and have good-quality basic vocabulary cognacy judgements (as per Indo-European or Austronesian), you can compute these branch length yourself using for example Bayesian phylogenetic methods, but for most language families this is currently not feasible.

Therefore, I implemented a set of methods for adding branch length information to a phylogeny, as follows:

(a) methods that depend only on the topology: (1) constant, (2) proportional and (3) grafen,
(b) methods that generate the branch length and topology from a distance matrix: (4) nj, and
(c) methods that map a given distance matrix onto the topology: (5) nnls and (6) ga.

The methods of type (a) only need a tree topology (and possibly a numeric constant). Method (1) computes branch lengths such that the sum of the branch lengths for every path in the tree is equal to the constant (the same amount of evolution has happened on all branches); method (2) simply gives each branch the same length such that the amount of evolution on a path is proportional to the number of splits on that path; method (3) is a classic whereby first each node is given a “height” defined as the number of leaves of its subtree minus 1 (0 for the leaves), after which branch lengths are computed as the difference between the height of the lower and the upper nodes of the branch.

Method (4) is the only one of type (b) used here and is a classic method in phylogenetics (Neighbor-Joining), a clustering method that iteratively joins taxa into higher groupings based on distance matrix between all the taxa.

Methods (5) and (6) try to use both the given language family’s tree topology and the information contained in a inter-language distance matrix by computing branch lengths that best approximate the original distances. Method (5) computes the branch lengths by using a non-negative least squares approach, while method (6) estimates the branch lengths using a standard genetic algorithm; they produce very similar results but there are also differences: method (5) is less robust than method (6), but method (6) is much slower, especially for very large trees, and might produce non-unique solutions.

As distance matrices between languages I have used the following:

a.  distances based on vocabulary: (1) ASJP16,
b.  distances based on geography: (2) great-circle geographic distances,
c.  distances based on WALS: (3) gower and (4) euclidean, with and without missing data imputation,
d.  distance based on AUTOTYP: (5) gower with missing data using only the variables with a single datapoint per language (this distance was computed by Balthasar Bickel), and
e.  distances based on the tree topology: a new “genetic method” applied to the WALS (6), Ethnologue (7), Glottolog (8) and AUTOTYP (9) classifications.

Briefly, these distances encode the differences between languages based on a very restricted vocabulary (1), geographic location (2), structural differences between languages (3-5) and the language family tree (6-9).

Where can I get these and why are they important?

Fair questions:) All these (and much more, including a detailed description of the input data, the output files and the process, and the actual R code implementing all this) are available from a GitHub repository.

Make sure to read first the README.md and the paper describing the whole thing (as PDF, HTML or, if you prefer, also the R markdown source)!

Please note that I tested the code quite extensively but errors are quite possible, so use it with caution and please do report any weird things (or good suggestions)!

I hope this code and Newick trees with branch lengths will be useful (and my 2+ years of intermittent work will thus prove a good investment).

Tired of what ape’s collapse.singles() does? Here’s a way to put those nodes back…

R logo

When doing phylogenetics with R one often uses (directly or indirectly) the ape library, but there’s a catch: many times one has to get rid of nodes that have a single descendant by calling the collapse.singles() function.

An example might help here:

> library(phytools)
Loading required package: ape
Loading required package: maps
> tree <- read.newick(text="((a:0.1)A:0.5,(b1:0.2,b2:0.1)B:0.2);")
> tree

Phylogenetic tree with 3 tips and 3 internal nodes.

Tip labels:
[1] "a"  "b1" "b2"
Node labels:
[1] ""  "A" "B"

Rooted; includes branch lengths.
> plot(tree)
Error in plot.phylo(tree) : 
  there are single (non-splitting) nodes in your tree; you may need to use collapse.singles()

The tree ((a:0.1)A:0.5,(b1:0.2,b2:0.1)B:0.2) can be displayed as:

<tree> (3 tips, 2 levels)
├─────A : 0.500
│     └─a : 0.100
└──B : 0.200
   ├──b1 : 0.200
   └─b2 : 0.100

and it can be seen that node A has only one descendant a.

By calling collapse.singles(tree) one obtains the reduced tree (a:0.6,(b1:0.2,b2:0.1)B:0.2) with the node A removed; this can now be plotted (among other things):

The tree with the single node A removed.
The tree with the single node A removed.

But one manipulation one might want to do (and I do it myself a lot!) is to change the branch length (using, for example, nnls.tree() in phangorn) but then how do you get back to the original tree topology (i.e., re-insert the collapsed nodes)?

 

This is where a piece of R code I wrote recently (part of a larger script that I will discuss later) might help. I basically extended the original collapse.singles()‘s code by also recording the information about the removed nodes (their ancestor — if any — and their unique descendant as well as the branch length from the ancestor — if any — and to the descendant). This information can be used later to re-insert back the removed singles.

The code is implemented as a pair of functions

collapse.singles.reversible(tree)

and

reverse.collapse.singles(collapsed.tree, reverse.info, restore.brlen.method=c("original.proportion","equal.proportion","first.zero"))

where tree and collapsed.tree are objects of class phylo (such as tree in our example), reverse.info is a list containing the information needed to re-insert the removed singles and is returned by collapse.singles.reversible(), and restore.brlen.method controls the manner in which the length of the collapsed branch is distributed when the single is re-inserted: "original.proportion" means that the proportion between (ancestor → single) and (single  → descendant) in the original tree is maintained, "equal.proportion" means that the length is divided equally between the two new branches, while "first.zero" simply sets the ancestor → single branch length to 0.

Again, an example might help:

> reverse.info <- collapse.singles.reversible( tree )
> reverse.info
$original.tree

Phylogenetic tree with 3 tips and 3 internal nodes.

Tip labels:
[1] "a"  "b1" "b2"
Node labels:
[1] ""  "A" "B"

Rooted; includes branch lengths.

$collapsed.tree

Phylogenetic tree with 3 tips and 2 internal nodes.

Tip labels:
[1] "a"  "b1" "b2"
Node labels:
[1] ""  "B"

Rooted; includes branch lengths.

$removed.singles
  node name prev.node next.node prev.brlen next.brlen
1    5    A         4         1        0.5        0.1

$original.root
[1] 4
> 

The list reverse.info contains the original tree (in the field original.tree), the tree with the singles removed (in the field collapsed.tree) and the information on what nodes have been removed (in the field removed.singles) as well as the original tree’s root (in original.root). You can then play with the new collapsed tree by, for example, changing the branch lengths:

> tree2 <- reverse.info$collapsed.tree

> plot(tree2)

> tree2$edge.length[1] <- 3.0 # alter the length of the branch leading to a

> plot(tree2)
collapsed-tree-new-brlen
> tree3 <- reverse.collapse.singles(tree2, reverse.info, "original.proportion")
> tree3

Phylogenetic tree with 3 tips and 3 internal nodes.

Tip labels:
[1] "a"  "b1" "b2"
Node labels:
[1] ""  "A" "B"

Rooted; includes branch lengths.
>

tree3 cannot of course be plotted using ape (because of the re-inserted single node A) but its structure is:

<tree3> (3 tips, 2 levels)
├─────────────────────────A : 2.500
│                         └─────a : 0.500
└──B : 0.200
   ├──b1 : 0.200
   └─b2 : 0.100

It can be seen that not only A is safely re-inserted between the root and a, but that the new branch lengths respect the original ratio: (root → A) is 2.5 (original was 0.5) and (Aa) is 0.5 (original was 0.1) giving a ratio of 5:1.

Likewise,

> tree4 <- reverse.collapse.singles(tree2, reverse.info, "equal.proportion")
> tree4

Phylogenetic tree with 3 tips and 3 internal nodes.

Tip labels:
[1] "a"  "b1" "b2"
Node labels:
[1] ""  "A" "B"

Rooted; includes branch lengths.
>

is:

<tree4> (3 tips, 2 levels)
   ├───────────────A : 1.500
   │                 └───────────────a : 1.500
   └──B : 0.200
      ├──b1 : 0.200
      └─b2 : 0.100

with the two new branches equal to 1.5.

 

The code is quite well tested but bugs are of course possible! Likewise, it is relatively fast but I guess it can always be optimized. If you find any bugs, have suggestions or comments, please drop me an e-mail.

 

Now, the actual code, released here under a GLPv2 license (ape is also GPL ≥ 2 and I used collapse.singles()‘s code as a foundation for my own), is given below. Feel free to us it!

# Extend ape's collapse.singles() to allow restoring the removed nodes.
# Copyright (C) 2015  Dan Dediu

# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.

# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.

# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.


# Auxiliary functions required by reverse.collapse.singles():
# Insert a row in a matrix at a given position:
.insert.row <- function(m, row=rep(NA,ncol(m)), where=nrow(m))
{
  if( is.null(m) || !is.matrix(m) || where <= 0 || length(row) != ncol(m) ) return (m); # nothing to do
  rnames <- rownames(m);
  if( where == 1 )
  {
    m <- rbind(row,m);
  } else if( where <= nrow(m) )
  {
    m <- rbind(m[1:(where-1),], row, m[where:nrow(m),]);
  } else if( where == nrow(m)+1 )
  {
    m <- rbind(m,row);
  } else
  {
    m <- rbind(m,matrix(NA,ncol=ncol(m),nrow=where-nrow(m)-1),row);
  }
  if( is.null(rnames) ) rownames(m) <- NULL;
  return (m);
}

# Insert an element in a vector at a given position:
.insert.element <- function(v, element=NA, where=length(v))
{
  if( is.null(v) || where <= 0 ) return (v); # nothing to do
  rnames <- names(v);
  if( where == 1 )
  {
    v <- c(element,v);
  } else if( where <= length(v) )
  {
    v <- c(v[1:(where-1)], element, v[where:length(v)]);
  } else if( where == length(v)+1 )
  {
    v <- c(v,element);
  } else
  {
    v <- c(v,rep(NA,where-length(v)-1),element);
  }
  if( is.null(rnames) ) names(v) <- NULL;
  return (v);
}

# Collapse single nodes also returning the collapsed nodes so that the original topology can be reconstructed later on:
collapse.singles.reversible <- function(tree) # return a list containing the new tree and the collapsed singles
{
  # This is shamelessly inspired from ape's original collapse.singles() function
  # Cache several tree properties:
  elen <- tree$edge.length;
  xmat <- tree$edge;
  node.lab <- tree$node.label;
  nnode <- tree$Nnode;
  ntip <- length(tree$tip.label);
  
  # The original tree's root:
  root <- find.root(tree);
  
  # Start processing the singles:
  singles <- NA;
  removed.singles <- data.frame("node"=rep(NA,nnode), "name"=NA, "prev.node"=NA, "next.node"=NA, "prev.brlen"=NA, "next.brlen"=NA); k <- 1; tree.orig <- tree; 
  while( length(singles) > 0 )
  {
    tx <- tabulate( xmat[,1] );
    singles <- which( tx == 1 ); # singles are those nodes with just one descendant
    if( length(singles) > 0 )
    {
      i <- singles[1]; # focus on the first single
      
      prev.node <- which( xmat[,2] == i ); # the ancestor
      next.node <- which( xmat[,1] == i ); # the single descendant
      prev.single.edge <- which(xmat[,1] == i); # the (ancestor -> single) edge

      xmat[ xmat > i ] <- xmat[ xmat > i ] - 1L; # adjust the node numbering

      # Store this to be removed node:
      removed.singles$node[k] <- i;
      if( !is.null(node.lab) ){ removed.singles$name[k] <- node.lab[i - ntip]; };
      if( length(prev.node) > 0 ){ removed.singles$prev.node[k] <- xmat[prev.node,1]; removed.singles$prev.brlen[k] <- elen[prev.node]; }
      if( length(next.node) > 0 ){ removed.singles$next.node[k] <- xmat[next.node,2]; removed.singles$next.brlen[k] <- elen[next.node]; }
      
      # Connect the (ancestor) directly to the (descendant) removing the (single):
      xmat[prev.node, 2] <- xmat[next.node, 2]; # connect the ancestor directly to the descendant        
      xmat <- xmat[ -prev.single.edge,]; # delete the (ancestor -> single) edge

      elen[prev.node] <- elen[prev.node] + elen[next.node]; # the new direct branch's length is the sum of the (ancestor -> single) and the (single -> descendant)'s branch lengths 
      if( !is.null(node.lab) ) node.lab <- node.lab[-c(i - ntip)]; # adjust the node labels as well
      nnode <- nnode - 1L; # one less node
      elen <- elen[-next.node]; # and one less edge
      
      k <- k+1; # prepare for the (possible) next removed single
    }
  }
  removed.singles <- removed.singles[ !is.na(removed.singles$node), ]; # clean them
  
  # Update the tree...
  tree$edge <- xmat;
  tree$edge.length <- elen;
  tree$node.label <- node.lab;
  tree$Nnode <- nnode;
  
  # ...and return it
  list("original.tree"=tree.orig, "collapsed.tree"=tree, "removed.singles"=removed.singles, "original.root"=root);
}

# Reverse the collapsing of single nodes:
reverse.collapse.singles <- function(collapsed.tree, reverse.info, 
                                     restore.brlen.method=c("original.proportion","equal.proportion","first.zero")) # use the original.tree and collapsed.singles list to restore the collapsed.tree
{
  if( is.null(reverse.info) || is.null(reverse.info$removed.singles) || nrow(reverse.info$removed.singles) == 0 ) return (collapsed.tree); # nothing to restore!
  
  restore.brlen.method <- restore.brlen.method[1];
  
  # Cache several tree properties:
  elen <- collapsed.tree$edge.length;
  xmat <- collapsed.tree$edge;
  node.lab <- collapsed.tree$node.label;
  nnode <- collapsed.tree$Nnode;
  ntip <- length(collapsed.tree$tip.label);
  singles <- reverse.info$removed.singles; # less typing
  
  # The removed.singles list records the order of removal: start from the end
  for( i in nrow(singles):1 )
  {
    s <- singles[i,]; # save typing...
    # Try to locate the ancestor ("prev") and decendant ("next") nodes in the tree as we need to restore this single between them:
    prev.node <- s$prev.node; next.node <- s$next.node;
    if( is.na(prev.node) )
    {
      # This is the new root: insert it:
      xmat <- .insert.row(xmat, c(s$node, next.node), 1); # insert the link
      
      # Adjust the node numbering and names:
      xmat[ xmat >= s$node ] <- xmat[ xmat >= s$node ] + 1L; xmat[1,1] <- s$node; # but make sure to keep the right root node
      if( !is.null(node.lab) ) node.lab <- .insert.element(node.lab, s$name, s$node - ntip);
      nnode <- nnode + 1L;
      
      # Update the brlens:
      if( !is.null(elen) )
      {
        elen <- .insert.element(elen, (s$next.brlen), 1);
      }
    } else
    {
      # Regular internal node:
      dlink <- which(xmat[,1] == prev.node & xmat[,2] == next.node); # locate the (ancestor -> descendant) direct link
      if( length(dlink) != 1 )
      {
        warning(paste0("Error restoring collapsed single node ",s$orig.node,ifelse(!is.na(s$node.name),paste0(", '",s$node.name,"'"),"")," cannot locate branch!\n"));
        return (collapsed.tree);
      }
      
      # Adjust the node numbering and names:
      xmat[ xmat >= s$node ] <- xmat[ xmat >= s$node ] + 1L;
      if( !is.null(node.lab) ) node.lab <- .insert.element(node.lab, s$name, s$node - ntip);
      nnode <- nnode + 1L;
      
      # Replace the direct link by two links (ancestor -> single) and (single -> ancestor):
      xmat[dlink,2] <- s$node; 
      # Insert a new link (single -> descendant):
      xmat <- .insert.row(xmat, c(s$node, ifelse(next.node >= s$node, next.node+1L, next.node)), dlink+1);
      
      # Update the brlens:
      if( !is.null(elen) )
      {
        if( !(restore.brlen.method %in% c("original.proportion","equal.proportion","first.zero")) )
        {
          warning("Unknown branch length restoration method: defaulting to original proportion\n");
          restore.brlen.method <- "original.proportion";
        }
        k <- elen[dlink]; # the brlen to be split
        if( restore.brlen.method == "original.proportion" )
        {
          elen[dlink] <- k * s$next.brlen / (s$prev.brlen + s$next.brlen);
        } else if( restore.brlen.method == "equal.proportion" )
        {
          elen[dlink] <- k / 2;
        } else if( restore.brlen.method == "first.zero" )
        {
          elen[dlink] <- 0;
        } else
        {
          # This should never happen
        }
        elen <- .insert.element(elen, (k - elen[dlink]), dlink);
      }
    }
  }
  
  # Update the tree...
  collapsed.tree$edge <- xmat;
  collapsed.tree$edge.length <- elen;
  collapsed.tree$node.label <- node.lab;
  collapsed.tree$Nnode <- nnode;
  
  # ... and return it:
  return (collapsed.tree);
}

“An Introduction to Genetics for Language Scientists” is out!

My new book, An Introduction to Genetics for Language Scientists, published by Cambridge University Press, is out now!

An introduction to genetics cover

There’s no doubt that genetics (genomics, bioinformatics, and other related disciplines) have a major impact on many aspects of science and popular culture. And that, more and more, we need them to even ask relevant questions about language and speech: I can’t even imagine thinking about say language origins and evolution, the patterning of linguistic diversity or first language acquisition, to mention just a few, without the methods, concepts and results coming from genetics. But there’s no easy way to learn these as a student or researcher interested in language; most curricula do not include them and, if you’re lucky, you might hear brief mentions here and there.

When I took up this project many years ago, I missed a concise resource that would allow the language scientists (this is a very diverse audience that I take to include, among others, theoretical linguists, typologists, historical linguists, phoneticians, but also speech therapists and computational linguists) to acquire not only the foundations of genetics relevant to them, but also some of the latest findings in the field. But I also wanted it to explain the ideas and methods, not just contain a collection of results. And I wanted it to be interesting for people that work on language (cancer genetics or obesity are fascinating topics but not terribly relevant to them). And it had to be short… (oh, did I already mention that?)

I hope the result will be useful to students but also to more senior academics and, why not, to the general public. I am covering here topics such as the nature – nurture and innateness debate, the heritability of language and speech, discuss fundamental molecular concepts and mechanisms, how genes are found using linkage and association studies, illustrate some fascinating examples of genes relevant to speech and language showing the complexity and beauty of how genes affect the phenotype, but I also cover population and evolutionary genetics and the co-evolution between biology and culture. Thus, quite a bit…

And, while making it accessible without requiring too much preexisting background in mathematics and biology, I did not hide the beauty and complexity of the statistical techniques and molecular processes involved. I would like to give its reader now only knowledge about findings but a deeper understanding of how the things work and how we can study them. As I say in the introduction, I’d like to allow the readers to not only follow the literature but to actively and meaningfully contribute to it. I strongly believe that no mater how powerful and cheap our genetic technologies will become, it is essential that in those huge teams there are scientists who really understand language as well.

The Table of Contents, the Introduction and some book extracts are available for free online here. There’s also a pretty comprehensive Glossary as well, and a huge References section containing pointers to introductory materials, foundational readings but also cutting-edge findings and methods; there’s also lots of links to online resources, tools and software (and even some R code in the appendices).

I hope you will enjoy it!

However, things move very fast and newer relevant findings will undoubtedly be published, new software made available and new methods discovered. I will try to post updates here, on the BITsaying blog, trying to keep you up-to-date with what is happening. But I’d like to invite you to also post updates, comments and suggestions here, hopefully creating a community with shared interests.

Finally, what about the cover? It’s a very cool Chinese brush painting by Alexandra Dima (who’s both an artist and a scientist) and there’s more to it than meets the eye. The title hints at something (“Πάντα ῥεῖ under Darwin’s tree“), but there’s more…