## Wednesday, June 26, 2019

### Taxonomic ranks: lots of arbitrary conventions but also some real useful information

Quanta Magazine's Christie Wilcox has talked to comparative biologist Andreas Hejnol and others when she was writing
What’s in a Name? Taxonomy Problems Vex Biologists
In 1735, Carl von Linné (1707-1778) published Systema Naturae where he introduced the clumping of species into groups and subgroups. That was done more than a century before the evolution theory emerged – but Linné has surely noticed some "family relationships" between organisms on Earth which could have allowed him to rediscover Darwin's theory well before others.

The English names of these categories are memorized by the poem
Dear King Phillip Came Over For Great Spaghetti While Queen Elizabeth Always Prefers To Devour the Nuts in the Living Room
OK, the female part was added by your humble correspondent because of affirmative action (but the assertion is more accurate than the proposition about her husband). The categories are
domain, kingdom, phylum, division, class, order, family, genus, species
which is almost exactly copied from Latin:
regio, regnum, phylum, divisio, classis, ordo, familia, genus, species.
Note that a "family" and others are sometimes generalized to "superfamilies" and "subfamilies" or "infrafamilies", groups that are larger or smaller or much smaller than the original ones.

Only the first two ranks in the list above, domain and kingdom, are slightly different in English and Latin. The others are Anglicized versions of the Latin words – in the last cases (genus, species), they are completely identical. The first words related to the empires are a bit different in Latin. English biologists could have picked "region" and "reignerdom" to follow the Latin roots chosen for the kings (yes, Russians use "tsarstvo" for the kingdom, from a Tsar).

While English is a Germanic language, I find it rather amazing that all such elementary words have been taken from from Romance languages – either from Latin or, more directly and recently, from French. That's not how it works in Czech. Our names of these groups are
doména, říše, kmen, oddělení, třída, řád, čeleď, rod, druh. (Click at the hyperlink and the speaker icon to hear the pronunciation.)
Only "doména" is imported from Latin – all the other words are Slavic in origin. "Říše" is an empire outside taxonomy, "kmen" is a tree trunk or tribe or strain, "oddělení" is a department, "třída" is a class, "čeleď" is an archaic name for a bunch of servants in a wealthy person's house, "rod" is a genus or something like a birth (although prefixes are usually added), "druh" is a species or a type.

I believe that the Czech language still has the world's second most complete naming system for the species after Latin – a nice side effect of the Czech National Revival launched (against Germanization) by the patriotic intellectuals some 250 years ago. All names are a noun followed by an adjective (more rarely, two adjectives). This is an unusual, archaic, scientific ordering of the words – which our Polish cousins bizarrely consider normal, modern, and casual, so for example, "hedgehog" is translated as "kaktus pochodowi" (the marching cactus) into Polish. ;-) According to a well-known joke, of course. A joke whose main idea is correct, however.

OK, Wilcox's article says that the grouping into the Linné categories is imperfect, arbitrary, often illogical, and one may raise tons of complaints against them. And I completely agree with these objections. I've realized them as a kid and I was happy and proud about the revelation. In fact, I do think it's rather important to make sure that the people understand why these boxes can't be taken literally as some perfect, canonical categorization that exists in Nature.

You know, the distance or difference between two "families" in one "phylum" isn't necessarily "the same" as the distance or difference between two "families" in another "phylum". In fact, the difference between two families in one phylum may be comparable to the difference between two orders – or, on the contrary, two genera – in another phylum etc.

So the very notion of the "scale" is really dependent on the "location" in the space of possible DNAs. Well, more precisely, you probably can't invent any good enough universal definition of the "distance" that would be relevant and reasonable everywhere – which is why not only the current taxonomy is imperfect but a perfect one cannot exist in principle.

Also, whenever one of the groups is divided to more than two subgroups of a lower size, they're never quite equal (even though the taxonomy indicates that three or more equal subgroups of a larger group are almost omnipresent). Why? Because the splitting has never occurred to several subgroups simultaneously. Some phylum or division or class or... has separated from the common ancestor first, and this split was primary or more important, and then these two subgroups may have separated further. In most cases, one of them tended to be much more conservative while the other wanted to split further, quickly, and repeatedly.

So the more correct classification of the species into groups should really be binary – species 10100101000100111 would indicate whether the ancestors at the important junctions went into the left direction or the right direction – but one must also understand that some of these binary vertices were much deeper than others. There isn't even a canonical way to decide whether the genetic difference between two siblings is large enough to justify the split in the taxonomy.

Once again, it is true and very important to realize that there are lots of imperfections in this classification – and this classification by Linné depends on many arbitrary conventions that prove that it wasn't directly copied from Nature. There is no guarantee whatsoever that biologists from a civilization that has never studied our papers would clump the species in an isomorphic way.

On the other hand, the clumping also carries some tangible and genuine content. Aside from the information that reflects conventions and that is social or unphysical because it fails to be gauge-invariant, there are also some gauge-invariant, physical quantities hiding in the existing taxonomy.

In any "locus" of the possible species, you may be pretty sure that the split to divisions was deeper – and took place before – the corresponding split into classes let alone orders, families, genera, or species. There are millions of species on Earth and you may learn them "at some resolution" by learning all kingdoms or all phyla or all divisions or all classes (it's getting harder). All these categories describe some resolution at which you are looking at the differences between terrestrial organisms. There is no universal quantification of the "degree of differences" between these organisms that would be valid for all of them – the taxonomy is "state-dependent", if I use a metaphor involving quantum gravity – but there's still something that works.

That's why I also completely disagree with the "extremists" mentioned here:
“I’m not as extreme as some who think we should just get rid of them completely,” de Queiroz said. But he does see the traditional ranked system as a “weird, outmoded way of thinking” that emphasizes placement in the hierarchy over biological meaning.
Those who want to get rid of the ranks completely would surely throw the baby out with the bath water. There is a lot of arbitrary social conventions in the hierarchy and the names chosen for it. But there's a lot of real beef, too. If you don't have any viable replacement of the current taxonomy, it means that you want to throw much of the knowledge hiding in the taxonomy if not all of it out of the window. And that's bad because there's a lot of real information in it.

The second part of the quote above says that the hierarchical classification is a "weird, outmoded way of thinking" that emphasizes placement in the hierarchy over biological meaning. That's a new flavor of a complaint and it deserves a discussion, too.

Even with the hierarchies, the taxonomy involves lots of arbitrary words. Clearly, the individual words – and I have mentioned the names for the categories in Latin, English, Czech, and perhaps Russian – are "purely" social constructs and there's nothing in them. However, the information about the clumping of the words already carries some information although the information about the "difference or distance between the groups of various size" can't be taken absolutely.

The groups of organisms differ by some very specific, technical know-how – almost by some ingenious ideas in applied physics, we could say – and the details how this physics know-how works obviously isn't completely encoded in the names. For example, vertebrates are a subphylum – it's weird that such a well-defined group doesn't coincide with one of the basic ranks and it needs a sub- instead. Just to be sure, vertebrates are almost all of the species in the higher Chordata phylum.

Even though most people and almost all university employees are spineless, people generally belong to the subphylum of vertebrates. At some point, mutations in Nature created some backbone and/or notochord and after some time, perhaps earlier than after a few generations, the contest choosing the fittest survivor has seen that the backbones etc. were a rather good idea. So the organisms with this trait were spreading – and then splitting into subgroups.

There's a lot of interesting "physics of the backbones" that you may discuss. Almost none of it is included in the mere word "vertebrates". So of course if you learn just the names, you will know almost nothing about the physics of backbones and similar natural achievements of applied physics. So the "people of the linguistic type" who just love to memorize the names will know almost nothing after they learn the name, as Feynman loved to emphasize on the example of the "names of the birds" (which was another attack on Gell-Mann, a keen birdwatcher, I concluded).

On the other hand, the people who love to memorize the names may also love to know "definitions that are a few words old" – I know some of these bookworms. ;-) So they would probably like to learn the etymology of "vertebrates". "Vertebratus" means "the joint of the spine" in Latin, they would memorize, which already gives them a sketch of the physics of vertebrates. They don't learn much but to say the least, by learning the word "vertebrates", they have created a placeholder that tells them "I should learn some physics of the backbone" at some point. Well, only some of the bookworms understand even the very general concept that there's usually more real knowledge waiting to be learned beyond their 5-word memorized "definitions".

Once again, it's completely right and desirable for students – and the people – to understand why the taxonomy is imperfect and why lots of the information that it conveys is just a collection of rather arbitrarily chosen social conventions that have depended on some random people's decisions in the history. Not all the information included in the categories is "real" and reflecting some objective data on Nature.

On the other hand, the correlation is nonzero. The categorization also conveys some real information. And if your response to the "imperfection" were to ban all of Linné's taxonomy, you would surely harm science – the people's understanding of the natural world. Even if you presented just a proposal to modify the terminology, you should think twice – and others will hopefully think twice – whether the reform is actually an improvement. If the reform were just created to emphasize the general point that the current taxonomy is (or will have been) imperfect and dependent on many conventions, the newly proposed terminology would almost certainly be inferior. It would be an awkward ideological newspeak that would pump the proposition "Linné's terminology was imperfect" into the name of every species or the higher groups if there were any. And theses shouldn't be repeated this often because it's a waste of time. The idea that Linné's system is imperfect is just one deep but simple insight and people shouldn't repeat it thousands of times whenever they talk about a hedgehog.

I also want to say another thing: It would be good for the terminology to faithfully copy the important events in the history of the evolution of species – if it is possible which I doubt. However, the evolution isn't a necessary condition for the taxonomy to have some value. Linné's taxonomy has already had some value a century before the evolution was understood – as a classification of patterns in the landscape of the currently living species on Earth – and it still has a lot of value that is independent of the evolutionary history. One shouldn't always be absolutely obsessed with the history leading to the contemporary species. Knowing what the patterns in the landscape are, regardless of the history, is nontrivial knowledge, too.

The ranks, groups, and subgroups don't describe just objective and state-independent properties of species in Nature. But they make sense and the content of this well-defined framework carries some information about the natural world. And that's the optimum situation that any terminological system designed for a very complex and non-uniform system of objects and knowledge (such as life on Earth) may achieve. I would argue that this conclusion of mine is much more general and people should realize it in many fields, not just in "comparative biology". (Clearly, the political correctness attempting to delegitimize or ban the talk about groups of humans is a major target of mine right here. The discussion about human races or dog breeds – the Czech language uses the term "rasa" for both – is clearly nothing else than a finer-resolution, sub-species continuation of the debate above.)

The human language is a highly imperfect and often arbitrary image of the patterns in Nature (or the society) but it is still useful and correlated with Nature. The elimination of the terminology for categories isn't a good "solution" to the largely unavoidable imperfection of the human language.

And that's the memo.

Bonus: concerning the social conventions and scientific terminology, I must remind you that Planck's constant and other fundamental constants already have precisely known numerical values in the SI system which is great – just known multipliers away from the adult physicists' units with $$1=c=\hbar=\dots$$. On May 20th, 2019, the redefinition of one kilogram and others that I have been proposing for many years (in fact, well before 2012) has come to effect. A small victory of mine (of course, I guess I wasn't the only one, otherwise it couldn't have taken place).