Tuesday, April 26, 2016

Phylogeny of a dataset

Phylogenetic methods have been applied to all sorts of research fields, including biology, linguistics, stemmatology and archaeology. There are many posts in this blog discussing examples of these applications, both good and bad.

However, some time ago a paper appeared that tried to apply these methods to data, instead:
Andrea K. Thomer, Nicholas M. Weber (2014) The phylogeny of a dataset. In: Andrew Grove (ed.) Proceedings of the 77th ASIS&T Annual Meeting: Connecting Collections, Cultures, and Communities, Volume 51. ASIS&T, Silver Spring, Maryland 20910, USA.
The authors do a creditable job of describing phylogenetics for the uninitiated, but I am not convinced that their empirical application to "digital objects" works particularly well.

They describe their application as follows:
The digital objects under examination are different versions of the International Comprehensive Ocean and Atmosphere dataset (ICOADS).
ICOADS data consist of marine surface measurements and observations (e.g. sea-surface temperature, sea-level pressure, wave swell, wind direction, etc.) that have been digitized from historical ship logs, or taken from floating buoys. As a result of the broad time periods that the dataset covers (approximately 450 years, 1662–2014) the quality and reliability of the data varies considerably.
Much like a piece of software, ICOADS is an evolving dataset with intermittent releases. Version 1.0 – called simply COADS – was publically [sic] released in 1987, and contained almost 100 million historical observations starting in 1854 and continuing to 1979.
Thus, understanding the ways in which ICOADS evolved into new versions, and gave rise to "offspring" datasets over a thirty-year period is the focus of the case study presented below.
The significant properties being used as phylogenetic characters included: Entry Title, Entry ID, Summary, Geographic Coverage, Start Date, End Date, Geographic Resolution, Temporal Resolution, Scientific Keywords (often dataset parameters), Geographic Keywords, Sources (platform of data collection), and Instruments. Once collected, each field was converted into binary codes for "presence" or "absence" of individual keywords.
The problem here is that tere is no implication that any of these characters are phylogenetically informative (ie. inherited), and thus that shared features might represent synapomorphies. In applications to linguistics, stemmatology and archaeology, on the other hand, it is at least likely that shared similarities might represent synapomorphies.

Given these data, the analyses cluster the datasets based on similarity — indeed, the authors explicitly refer to their tree-based analyses as "clustering algorithms". However, this form of analysis does not necessarily reveal history, in the sense that none of the analyses are explicitly historical. Historical patterns will be included in the outcome, but they will not necessarily be separable from patterns resulting from any other source. The resulting groups of datasets may or may not have historical meaning. The authors do, however, have a series of hypotheses (the groups) that can now be subject to scrutiny for possible historical interpretations.

For our purposes it is also worth noting that the authors do recognize one limitation of their analytic approach when applied to datasets:
A purely tree-based phylogenetic approach is also incapable of showing the exchange of traits between different lineages of digital objects, or cases in which several organisms merge into one; thus a reticulating network may be needed in lieu of a bifurcating tree.

Tuesday, April 19, 2016

Who first drew a family tree as a tree?

The Online Etymology Dictionary indicates that the English-language expression "Family tree" in the sense of "graph of ancestral relations" is first attested from 1752, in the novel A Genuine Account of the Life and Transactions of Howell ap David Price (which is available in Google Books).

Such pedigree diagrams have a much longer history, of course, but they were not called family trees, nor were they drawn with any particular tree-like imagery (except for the religious Tree of Jesse, pictures of which started appearing in the 10th century). See, for example:
This leaves open the question of who first drew a tree-like family tree. [Note: see also the later post Drawing family trees as trees.]

Ernest H. Wilkins (1925. The genealogy of the genealogical trees of the Genealogia deorum. Modern Philology 23: 61-65) has suggested that it might be the Italian author and poet Giovanni Boccaccio (1313-1375), in his Genealogia Deorum Gentilium (On the Genealogy of the Gods of the Gentiles).

This Renaissance book was an "encyclopedic compilation of the tangled family relationships of the classical pantheons of Ancient Greece and Rome" (according to Wikipedia). It was written in Latin, apparently starting in c. 1350, and then continuously corrected and revised until the author's death. In c. 1370 an apograph [ie. perfect copy] was made of an autograph manuscript [ie. in the author's own hand], and from that first apograph other copies were made.

The 1370 autograph is not known to still exist; but a second autograph manuscript, showing later revisions, is in the Laurentian Library in Florence (MS. LII, 9). There are some three dozen extant apographs from the 1300s and 1400s, all based on the lost first autograph. The first printed edition was produced in Venice in 1472, followed by an edition of 1473 printed in Leuven. At least seven other editions appeared during the 1400s and 1500s. A French translation was published in Paris in 1498, and an Italian translation appeared in Venice in 1547. (See Ernest H. Wilkins. 1919. The genealogy of the editions of the Genealogia Deorum. Modern Philology 17: 425-438.)

The illustrations shown here are from various versions of the book.

Wilkins (1925) notes:
The extant autograph manuscript of the Genealogia Deorum of Boccaccio is illustrated by thirteen genealogical trees, designed certainly and drawn in all probability by Boccaccio himself. At the top of each tree is a large circle, in which is written the name of a divinity. From this circle descends a stem which now expands into other lesser circles, now sends forth leaves, and now branches, which in their turn expand into circles and send forth leaves and lesser branches. In the center of each circle or leaf a name is written. The circles are used for those divinities whose progeny is represented in the same tree; the leaves, for divinities whose progeny is not represented. In the circles the words qui genuit [ie. who fathered] follow each masculine name, and the words quae peperit [ie. who bore] each feminine name. Similar trees certainly appeared in the earlier lost autograph, from which all the apograph manuscripts are derived; and similar trees appear in several apographs, and in the fourth and all later editions of the Genealogia.
So far as I can ascertain, Boccaccio's trees are the earliest secular genealogical trees properly so called: that is to say, the first non-biblical genealogical charts in which stems, branches, and leaves appear.

This claim of priority has apparently gone unchallenged by later workers; eg. Christiane Klapisch-Zuber (1991. The genesis of the family tree. I Tatti Studies in the Italian Renaissance 4: 105-129) notes:
It may well be that Boccaccio was the first to combine the old graphic system of medallions in the descending order typical of medieval genealogies with the implications of a vegetal theme.

The vegetal image is quite obvious, although the leaves do vary widely in form within any one manuscript, and also from copy to copy. In the autograph they are palmately five-lobed. In some trees the different generations are indicated by variation in the colour of the branches.

Personally, to me each of these diagrams looks more like a vine than a tree, especially with the root at the top.

Moreover, some of the printed editions do not contain the genealogies, and in others their form is modified. For example, some have a portrait of the progenitor divinity, and others bear scrolls or circles instead of leaves. Some of the trees have extra (empty) leaves or scrolls. It is thus quite clear that the tree metaphor for the pedigrees was not seen as important at the time.

Nevertheless, it is important to note that in the first two editions of the Italian translation by Giuseppe Betussi (1547 & 1554; but not in later editions) the first genealogy is drawn as an actual tree rooted in the ground, with the name of the progenitor appearing at the base of the trunk. Klapisch-Zuber notes:
In comparison with Boccaccio's divinely radiant foliage, this image must strike us as mean and desiccated. And yet, it is the triumph of the genealogical tree as we know it, planted right side up; and any one in the modern world can use it to evoke his ancestors and to express his faith in the survival of his lineage.

Wednesday, April 13, 2016

Monogenesis, polygenesis, and militant agnosticism

When playing the cognate hunting game or the etymology identification game in historical linguistics, there are many different rules that one needs to keep in mind. Words that look similar are not necessarily related — they could be simple look-alikes (Trask 2000:202). If words are too similar, they could be borrowings. If we quote colleague X from the camp of linguists believing in theory t₁ we should make sure that we also quote colleague Y from the camp of linguists believing in the theory t₂, especially if we do not know the peer reviewers, etc.

A particularly important rule that is often surprising for biologists is the rule that says we can only compare languages that we know are related. We could, of course, compare all languages in the world (and people do compare all languages in the world), but the point is that we are not allowed to compare languages historically unless we know whether they share a common origin. This rule is reflected in a long-standing debate regarding the question of how we can prove that two languages are related. Here, we have basically two opposing camps, one claiming that only grammar can prove language relationship, and one claiming that only the lexicon is suitable for that task (Dybo and Starostin 2008, Campbell and Poser 2008).

That we have to prove that two or more languages are related before we can start to compare them is in strong contrast to biology. The idea of multiple origins as an alternative to a single origin itself has also been discussed in evolutionary biology (David has shown this in an earlier blogpost dealing with networks with multiple roots). In linguistics, however, we are largely agnostic regarding the common origin of all languages, and the degree of agnosticism may go even so far that it acquires a missionary zeal. Attempts to explain how language evolved, that is, how language originated as a means for communication, always run the danger of being ridiculed by the linguistic community. Under very bad circumstances, they can even cast a very dark shadow on the linguistic reputation of those who proposed them.

Affirming our disinterest in the origin of language has a long tradition. In its Statuts from 1866 (published in 1871), the Société de Linguistique de Paris declared that it would not support any research on the origin of language. Even August Schleicher, the father of the language tree, affirmed this attitude in a letter to Ernst Haeckel (Schleicher 1863: 22), where he wrote:
It is impossible to presuppose a material descent of all languages from a single proto-language. (My translation, original text: "Eine so zu sagen materielle Abstammung aller Sprachen von einer einzigen Ursprache können wir also unmöglich voraussetzen.")
Although it is not explicitly spelled out nowadays, these statutes are still active in most linguistic institutes.
Being agnostic about the origin of language means that we cannot exclude the possibility that two languages, like, say, Chinese and English, are ultimately not related at all. And if they are ultimately not related, it would be futile to compare them with the hope to find linguistic material that goes back to their common ancestor. Biologists, who usually take the Tree of Life for granted (albeit a bush in the end), might ask themselves for the reasoning behind this agnosticism in linguistics. The reasons are rather simple to state: If we make the very conservative assumption, based on archeological records, that human language originated about 100,000 years ago (Dediu and Levinson 2013), and contrast it with the first written records of languages (about 5,000 years ago), and the presumed time depths of our current comparative method (Meillet 1925, Weiss 2014), which optimistically allows us to reach out 10,000 years back in time, we simply do not have the means to make any qualified linguistic hypothesis regarding the origin of all those 7,000 and more languages spoken today (count based on Hammarström et al. 2015).

The reasons why linguists prefer to maintain an agnostic attitude are completely comprehensible for me. Whether it is good to be agnostic, is another question. And whether it is good to be as militant as are some linguists regarding the question of language origin is yet another one. For the context of evolutionary biology, for example, a little bit of agnosticism regarding the Tree of Life might bring up interesting dynamics. The same could be said about a little bit of "faith" in linguistics, be it that one believes that language originated independently in multiple places at the same or different times, or be it that one supports a monophyletic origin of a "Language of Eden". Neither of the theories has immediate impact on the way we pursue our historical comparison of languages. Even under a monogenesis assumption we would still need to prove a close affinity between languages before we could start comparing them with our traditional methods.

In the long run, however, it might help us to get some of the tension out of our long-standing debates. If we took monogenesis for granted, for example, people would be less afraid of comparing random pairs of languages, and in the long run we could gain new insights into distant relationships. If we rejected monogenesis, on the other hand, we could try to identify how many times language originated independently.

It is (and here you see my own agnostic attitude) not really important whether we stick to monogenesis or polygenesis in the end. What is important is that we are clear about the consequences that either of these two theories might have on our research in the future. Agnosticism is a useful attitude as long as it does not prevent us from asking questions. Following up on David's earlier blogpost, it seems clear to me that  especially linguists might profit a lot from rooted network approaches that allow for multiple roots, since it would allow us to keep our agnosticism without suppressing our curiosity.

  • Campbell, L. and W. Poser (2008): Language Classification: History and Method. Cambridge University Press: Cambridge.
  • Dediu, D. and S. Levinson (2013): On the antiquity of language: the reinterpretation of Neandertal linguistic capacities and its consequences. Frontiers in Psychology 4.397. 1-17.
  • Dybo, A. and G. Starostin (2008): In defense of the comparative method, or the end of the Vovin controversy. In: Smirnov, I. (ed.): Aspekty komparativistiki.3. RGGU: Moscow. 119-258."
  • Hammarström, H., R. Forkel, M. Haspelmath, and S. Bank (2015): Glottolog. Max Planck Institute for Evolutionary Anthropology: Leipzig. http://glottolog.org.
  • Meillet, A. (1954): La méthode comparative en linguistique historique [The comparative method in historical linguistics]. Honoré Champion: Paris.
  • Schleicher, A. (1863): Die Darwinsche Theorie und die Sprachwissenschaft. Offenes Sendschreiben an Herrn Dr. Ernst Haeckel. Hermann Böhlau: Weimar.
  • Société Linguistique de Paris (1871): Statuts. Approuvés par décision ministérielle du 8 Mars 1866. Bulletin de la Société de Linguistique de Paris 1. III-IV.
  • Trask, R. (2000): The Dictionary of >Historical and Comparative Linguistics. Edinburgh University Press: Edinburgh.
  • Weiss, M. (2014): The comparative method. In: Bowern, C. and N. Evans (eds.): The Routledge Handbook of Historical Linguistics. Routledge: New York. 127-145.

Monday, April 4, 2016


The drawing of large genealogies is not easy, and phylogeneticists (among others) have tried a number of solutions, including circular diagrams as we as interactively zoomable displays. One interesting solution that does not appear to have yet been used in phylogenetics is the concept of GeneaQuilts.

These were introduced by the Visual Analytics Project:
A. Bezerianos, P. Dragicevic, J.-D. Fekete, J. Bae, B. Watson (2010) GeneaQuilts: a system for exploring large genealogies. In: IEEE InfoVis '10: IEEE Transactions on Visualization and Computer Graphics, Oct 2010, Salt-Lake City, USA.
The web page has a video introducing the concept, which does a better job than I can do here. The basic idea is to abandon the tree / network representation, and to use a diagonally-filled matrix instead, where the rows are individuals and the columns show parent-offspring relationships.

Here is an example genealogy, based on the reported relationships among the Greek Gods.

If the relationships are tree-like then the diagram will be concentrated on the diagonal of the matrix. However, network relationships (inbreeding) will cause off-diagonal elements, two of which are shown in the example: one involves Hades and his niece Persephone.

Several, much larger examples are displayed on the GeneaQuilts website. There is a program that can be downloaded, which takes as its input standard family-history files.

There seems to be no intrinsic reason why this display form could not also be used in phylogenetics.