Tuesday, June 27, 2017

Trees do not necessarily help in linguistic reconstruction

In historical linguistics, "linguistic reconstruction" is a rather important task. It can be divided into several subtasks, like "lexical reconstruction", "phonological reconstruction", and "syntactic reconstruction" — it comes conceptually close to what biologists would call "ancestral state reconstruction".

In phonological reconstruction, linguists seek to reconstruct the sound system of the ancestral language or proto-language, the Ursprache that is no longer attested in written sources. The term lexical reconstruction is less frequently used, but it obviously points to the reconstruction of whole lexemes in the proto-language, and requires sub-tasks, like semantic reconstruction where one seeks to identify the original meaning of the ancestral word form from which a given set of cognate words in the descendant languages developed, or morphological reconstruction, where one tries to reconstruct the morphology, such as case systems, or frequently recurring suffixes.

In a narrow sense, linguistic reconstruction only points to phonological reconstruction, which is something like the holy grail of computational approaches, since, so far, no method has been proposed that would convincingly show that one can do without expert insights. Bouchard-Côté et al. (2013) use language phylogenies to climb a language tree from the leaves to the root, using sophisticated machine-learning techniques to infer the ancestral states of words in Oceanic languages. Hruschka et al. (2015) start from sites in multiple alignments of cognate sets of Turkish languages to infer both a language tree, as well as the ancestral states along with the sound changes that regularly occurred at the internal nodes of the tree. Both approaches show that phylogenetic methods could, in principle, be used to automatically infer which sounds were used in the proto-language; and both approaches report rather promising results.

None of the approaches, however, is finally convincing, both for practical and methodological reasons. First, they are applied to language families that are considered to be rather "easy" to reconstruct. The tough cases are larger language families with more complex phonology, like Sino-Tibetan or any of its subbranches, including even shallow families like Sinitic (Chinese), or Indo-European, where the greatest achievements of the classical methods for language comparison have been made.

Second, they rely on a wrong assumption, that the sounds used in a set of attested languages are necessarily the pool of sounds that would also be the best candidates for the Ursprache. For example, Saussure (1879) proposed that Proto-Indo-European had at least two sounds that did not survive in any of the descendant languages, the so-called laryngeals, which are nowadays commonly represented as h₁, h₂, and h₃, and which leave complex traits in the vocalism and the consonant systems of some Indo-European languages. Ever since then, it has been a standard assumption that it is always possible that none of the ancestral sounds in a given proto-language is still attested in any its descendants.

A third interesting point, which I consider a methodological problem of the methods, is that both of them are based on language trees, which are either given to the algorithm or inferred during the process. Given that most if not all approaches to ancestral state reconstruction in biology are based on some kind of phylogeny, even if it is a rooted evolutionary network, it may sound strange that I criticize this point. But in fact, when linguists use the classical methods to infer ancestral sounds and ancestral sound systems, phylogenies do not necessarily play an important role.

The reason for this lies in the highly directional nature of sound change, especially in the consonant systems of languages, which often makes it extremely easy to predict the ancestral sound without invoking any phylogeny more complex than a star tree. That is, in linguistics we often have a good idea about directed character-state changes. For example, if a linguist observers a [k] in one set of languages and a [ts] in another languages in the same alignment site of multiple cognate sets, then they will immediately reconstruct a *k for the proto-language, since they know that [k] can easily become [ts] but not vice versa. The same holds for many sound correspondence patterns that can be frequently observed among all languages of the world, including cases like [p] and [f], [k] and [x], and many more. Why should we bother about any phylogeny in the background, if we already know that it is much more likely that these changes occurred independently? Directed character-state assessments make a phylogeny unnecessary.

Sound change in this sense is simply not well treated in any paradigm that assumes some kind of parsimony, as it simply occurs too often independently. The question is less acute with vowels, where scholars have observed cycles of change in ancient languages that are attested in written sources. Even more problematic is the change of tones, where scholars have even less intuition regarding preference directions or preference transitions; and also because ancient data does not describe the tones in the phonetic detail we would need in order to compare it with modern data. In contrast to consonant reconstruction, where we can do almost exclusively without phylogenies, phylogenies may indeed provide some help to shed light on open questions in vowel and tone change.

But one should not underestimate this task, given the systemic pressure that may crucially impact on vowel and tone systems. Since there are considerably fewer empty spots in the vowel and tone space of human languages, it can easily happen that the most natural paths of vowel or tone development (if they exist in the end) are counteracted by systemic pressures. Vowels can be more easily confused in communication, and this holds even more for tones. Even if changes are "natural", they could create conflict in communication, if they produce very similar vowels or tones that are hard to distinguish by the speakers. As a result, these changes could provoke mergers in sounds, with speakers no longer distinguishing them at all; or alternatively, changes that are less "natural" (physiologically or acoustically) could be preferred by a speech society in order to maintain the effectiveness of the linguistic system.

In principle, these phenomena are well-known to trained linguists, although it is hard to find any explicit statements in the literature. Surprisingly, linguistic reconstruction (in the sense of phonological reconstruction) is hard for machines, since it is easy for trained linguists. Every historical linguist has a catalogue of existing sounds in their head as well as a network of preference transitions, but we lack a machine-readable version of those catalogues. This is mainly because transcriptions systems widely differ across subfields and families, and since no efforts to standardize these transcriptions have been successful so far.

Without such catalogues, however, any efforts to apply vanilla-style methods for ancestral state reconstruction from biology to linguistic reconstruction in historical linguistics, will be futile. We do not need the trees for linguistic reconstruction, but the network of potential pathways of sound change.

  • Bouchard-Côté, A., D. Hall, T. Griffiths, and D. Klein (2013): Automated reconstruction of ancient languages using probabilistic models of sound change. Proceedings of the National Academy of Sciences 110.11. 4224–4229.
  • Hruschka, D., S. Branford, E. Smith, J. Wilkins, A. Meade, M. Pagel, and T. Bhattacharya (2015): Detecting regular sound changes in linguistics as events of concerted evolution. Current Biology 25.1: 1-9.
  • Saussure, F. (1879): Mémoire sur le système primitif des voyelles dans les langues indo- européennes. Teubner: Leipzig.

Tuesday, June 20, 2017

Cichlids, species and trees

Lake Malawi, in south-eastern Africa, is famous for its large diversity of cichlid fishes. Indeed, it sometimes seems to have more biologists studying these fish than there are actual fish in the lake, even though there are allegedly hundreds of cichlid fish species in that lake. In this sense, it is somewhat similar to Lake Baikal, in southern Siberia, home to the sole species of freshwater seals.

The cichlid biologists are interested in describing the extensive fish diversity, pondering its origin, and thus its contribution to the study of speciation. After all, we are talking about what is usually claimed to be "the most extensive recent vertebrate adaptive radiation". So, we are talking here as much about population genetics as we are about ichthyology.

Inevitably, the genome biologists have been spotted in the vicinity of the lake; and we now have a preliminary report from them:
Milan Malinsky, Hannes Svardal, Alexandra M. Tyers, Eric A. Miska, Martin J. Genner, George F. Turner, Richard Durbin (2017) Whole genome sequences of Malawi cichlids reveal multiple radiations interconnected by gene flow. BioRxiv 143859.
These authors summarize the situation like this:
We characterize [the] genomic diversity by sequencing 134 individuals covering 73 species across all major lineages. Average sequence divergence between species pairs is only 0.1-0.25%. These divergence values overlap diversity within species, with 82% of heterozygosity shared between species. Phylogenetic analyses suggest that diversification initially proceeded by serial branching from a generalist Astatotilapia-like ancestor. However, no single species tree adequately represents all species relationships, with evidence for substantial gene flow at multiple times.
The last sentence seems to be somewhat disingenuous. How could a single tree be expected to describe this scale of biodiversity? Any rapid radiation of diversity is unlikely to be completely tree-like. The increase in diversity can be modeled as a tree, sure, but it is very unlikely that there will be instant separation of the taxa, and so the tree model will be ignoring a large part of the evolutionary action. There will, for example, be ongoing introgression between the diverging taxa, as well as hybridization due to incomplete breeding barriers. These avenues for gene flow can best be modeled as a network, not a tree.

The issue here is that the authors write the paper solely from the perspective of an expected phylogenetic tree, and then feel compelled to explain why they do  not produce such a tree. Indeed, the authors present their paper as a study of "violations of the species tree concept".

For data analysis, they proceed as follows:
To obtain a first estimate of between-species relationships we divided the genome into 2543 non-overlapping windows, each comprising 8000 SNPs (average size: 274kb), and constructed a Maximum Likelihood (ML) phylogeny separately for each window, obtaining trees with 2542 different topologies.
So, only two sequence blocks produced the same tree, presumably by random chance. An example "tree" for 12 OTUs is shown in the diagram. It superimposes a possible mitochondrial trees on a summary of the "genome tree".

Example phylogeny from Malinsky (2012)

The authors continue:
The fact that we are using over 25 million variable sites suggests these differences are not due to sampling noise, but reflect conflicting biological signals in the data. For example, gene flow after the initial separation of species can distort the overall phylogeny and lead to intermediate placement of admixed taxa in the tree topology.
Note that gene flow is seen to "distort" the phylogeny rather than being an integral part of it. In this case, "phylogeny" apparently refers solely to the diversification part evolutionary history, rather than to the whole history.

The ultimate questions from this paper are: "what is a species concept?", and "what is a species tree?". The authors write a lot about species and trees, and yet their data provide very clear evidence that both "species" and "tree" are very restrictive concepts for studying the cichlids of Lake Malawi.

Coincidentally, another recent paper tackles the same problems:
Britta S. Meyer, Michael Matschiner, Walter Salzburger (2017) Disentangling incomplete lineage sorting and introgression to refine species-tree estimates for Lake Tanganyika cichlid fishes. Systematic Biology 66: 531-550.
The authors describe their work, on the same fish group but in a lake further north-west, as follows:
Because of the rapid lineage formation in these groups, and occasional gene flow between the participating species, it is often difficult to reconstruct the phylogenetic history of species that underwent an adaptive radiation. In this study, we present a novel approach for species-tree estimation in rapidly diversifying lineages, where introgression is known to occur, and apply it to a multimarker data set containing up to 16 specimens per species for a set of 45 species of East African cichlid fishes (522 individuals in total), with a main focus on the cichlid species flock of Lake Tanganyika. We first identified, using age distributions of most recent common ancestors in individual gene trees, those lineages in our data set that show strong signatures of past introgression ... We then applied the multispecies coalescent model to estimate the species tree of Lake Tanganyika cichlids, but excluded the lineages involved in these introgression events, as the multispecies coalescent model does not incorporate introgression. This resulted in a robust species tree.
Once again, phylogeny = species tree.

Tuesday, June 13, 2017

Bayesian inference of phylogenetic networks

Over the years, a number of methods have been explored for constructing evolutionary networks, starting with parsimony criteria for optimization, and moving on to likelihood-based inference. However, the development of Bayesian methods has been somewhat delayed by the computational complexities involved.

Network from Radice (2012)

The earliest work on this topic seems to be the thesis of:
Rosalba Radice (2011) A Bayesian Approach to Phylogenetic Networks. PhD thesis, University of Bath, UK.
Apparently, the only part of this work to be published has been:
Rosalba Radice (2012) A Bayesian approach to modelling reticulation events with application to the ribosomal protein gene rps11 of flowering plants. Australian & New Zealand Journal of Statistics 54: 401-426.
The method described requires the prior specification of the species tree (phylogeny), and the position and number of the reticulation events. The algorithm was implemented in the R language.

More recently, methods have been developed that infer phylogenies by using (i) incomplete lineage sorting (ILS) to model gene-tree incongruence arising from vertical inheritance, and (ii) introgression / hybridization to model gene-tree incongruence attributable to horizontal gene flow. ILS has been addressed using the multispecies coalescent.

The first of these publications was:
Dingqiao Wen, Yun Yu, Luay Nakhleh (2016) Bayesian inference of reticulate phylogenies under the multispecies network coalescent. PLoS Genetics 12(5): e1006006. [Correction: 2017 PLoS Genetics 13(2): e1006598]
The method requires the set of gene trees as input, along with the number of reticulations. The algorithm was implemented in the PhyloNet package.

In the past few months, two manuscripts have appeared that try to co-estimate the gene trees and the species network, using the original sequence data (assumed to be without recombination) as input:
Dingqiao Wen, Luay Nakhleh (2017) Co-estimating reticulate phylogenies and gene trees from multi-locus sequence data. bioRxiv 095539. [v.2; v.1: 2016]
Chi Zhang, Huw A Ogilvie, Alexei J Drummond, Tanja Stadler (2017) Bayesian inference of species networks from multilocus sequence data. bioRxiv 124982.
The algorithm for the first method has been implemented in the PhyloNet package, while the second has been implemented in the Beast2 package.

Finally, another manuscript describes a method utilizing data based on single nucleotide polymorphisms (SNPs) and/or amplified fragment length polymorphisms (AFLPs), which thus sidesteps the assumption of no recombination:
Jiafan Zhu, Dingqiao Wen, Yun Yu, Heidi Meudt, Luay Nakhleh (2017) Bayesian inference of phylogenetic networks from bi-allelic genetic markers. bioRxiv 143545.
This method has also been implemented in PhyloNet.

Due to the computational complexity of likelihood inference, all of these methods are currently severely restricted in the number of OTUs that can be analyzed, irrespective of whether these involve multiple samples from the same species or not. In this sense, parsimony-based inference or approximate likelihood methods are still useful for constructing evolutionary networks of any size. However, progress is clearly being made to alleviate the computational restrictions.

Tuesday, June 6, 2017

Bears, genomes and gene flow

It has traditionally been assumed that speciation occurs when gene flow between populations ceases. However, nothing in biology ever remains simple — the more we study any biological phenomenon the more complex it becomes. So, speciation with gene flow is becoming a more commonly discussed topic. This is especially so with the advent of genome sequencing, which allows us to study the extent of gene flow in the past, rather than solely in the present.

A case in point is the recent paper by:
Vikas Kumar, Fritjof Lammers, Tobias Bidon, Markus Pfenninger, Lydia Kolter, Maria A. Nilsson and Axel Janke (2017) The evolutionary history of bears is characterized by gene flow across species. Nature Scientific Reports 7: 46487.
This paper considers the evolutionary relationships among seven species of bears, with multiple genome samples from four of those species. The coalescent species tree (based on 18,621 genome fragments > 25 kb), which accounts for incomplete lineage sorting (ILS), is well supported, as shown here.

However, numerous individual genome-fragment trees support alternative topologies. For example, 38% of the trees support a topology where the Asiatic black bear is the sister to the American black - Brown - Polar bear clade. This suggests that there is more than simply ILS that creates the conflicting genome trees.

The authors applied several different data analyses to investigate the possibility of gene flow among the species. They found considerable evidence for gene flow, as shown in the network (the arrow colors represent different analyses).

Indeed, each of the six in-group species could conceivably be connected by gene flow to each of the other five species. The network shows evidence that the Brown, Asiatic and Sloth bears might have all five connections, while the Polar and Sun bears have four, and the American bear has three.

As the authors note, some of this potential gene flow cannot have occurred directly between species, because they live in different habitats. Instead, it may be remnants of ancestral gene flow, or gene flow through a vector species. In particular, the strongest signal of gene flow connects the Asiatic black bear with the ancestor of the American black - Brown - Polar bear clade.

Ancestral gene flow is of considerable importance when studying evolution. Charles Darwin was perhaps the first to note (in his notebooks) that we should always treat ancestors as species not as taxonomic groups, no matter how big the groups of descendants now are. Whole kingdoms and phyla were once a single species, if the contemporary groups are monophyletic