## Monday, May 21, 2018

### Laurel Yanny is a sister of Marilyn Einstein

A Lumosque introduction to the spectral analysis of vowels and consonants

There are two famous people named Marilyn Einstein. One of them is the trans-sexual daughter of Marilyn Monroe and Albert Einstein (yes, Marilyn had a relationship with many famous men):

Her genes are 50% from Einstein, 50% from Monroe, and the percentages don't change much. But the other Marilyn Einstein is this one:

Ze differs from the first one because ze can be close to 100% Einstein or close to 100% Monroe, depending how we look at zir.

If your eyesight is sharp and you can see the short-distance, high-resolution details, you may see Einstein's mustache and wrinkles and you prefer to conclude it's Einstein. When your eyesight is fuzzy enough, you average the areas, the mustache and wrinkles disappear, and what is left is the big-picture shape of the head with the haircut which is closer to Marilyn Monroe.

A week or two weeks ago, millions people got obsessed with a nearly isomorphic audible illusion:

In 2007, opera singer Jay Jones recorded the pronunciation of 200,000 words for vocabulary.com. Those included this word that was meant to be "Laurel" (the name of Stan Laurel, a man from the Laurel-Hardy duo of comedians from the voiceless epoch of the movies). When your hearing is good and sensitive to the high-resolution details – which are stored especially at high frequencies because that's where the bitrate is higher – then you can hear that what he ended up saying is actually "Yanny".

If your hearing of high frequencies is poor or if you artificially amplify the low frequencies relatively to the high ones, you may hear "Laurel", as originally intended and as shown in this second audio file. There's nothing shocking about it – hearing "Laurel" is analogous to seeing Marilyn; hearing "Yanny" is analogous to seeing Einstein.

Both the visual and audible files may be filtered in ways that suppress or amplify the low frequencies relatively to the high ones, or vice versa, and that's how you can make them look or sound closer to Marilyn, Yanny, Laurel, or Einstein, whatever you prefer.

Sources such as CBS claim that the audio "technically" says "Laurel", so the answer "Laurel" is "technically" correct. I think that this formulation tries to sound fancy but it is technically deceitful. There is nothing "technical" about the sense in which "Laurel" could be preferred. The word was intended to sound like Laurel.

But when you analyze it by the best voice recognition technology you may find, and I think that the usage of "technology" could be said to determine the "technically correct" answer, I am convinced that the answer will be "Yanny". So I think it's more accurate to say that the word is technically Yanny, not Laurel – than to say the opposite thing that CBS did.

Some people may be shocked that different people hear different words. I am not shocked at all. I have struggled with Americans' incomprehensible sounds, especially vowels, for a decade, and many of them struggled with mine. It's a rule, not an exception, that some sounds sound incomprehensible.

All vowels are complicated and the determination which vowel is "actually" heard is an analog problem (a measurement of continuous observables). People love to imagine that things are simple and digital but they're not. Discrete physicists are crackpots – Stephen Wolfram and a few great other men should replace the C-word with a more diplomatic one but I can't think of any accurate replacement right now. ;-)

How do sounds differ from each other? Well, a sound is captured by some mostly fluctuating pressure $$p(t)$$ that is a function of time. If the pressure goes like $$p(t)=p_0 \cos \omega t$$, then you hear a harmonic sound of a fixed frequency. Most generic speech has changing frequencies and many of them are combined in every period of time (I can't really say "at each moment" because the sound at a "single short moment" cannot have a well-defined frequency, basically due to the "uncertainty principle"). But even when a perfectionist opera singer sings vowels that are meant to have a fixed frequency, there are other frequencies in the sound, too.

Here you have the spectral analysis of the Canadian vowels. What are the graphs? They drew $$p(t)$$, the pressure near the mouth, as a function of time, and performed the Fourier transform to get $$\tilde p(\omega)$$, a function of frequencies. That shows which frequencies are maximally represented. The graphs above – the individual ones describe the sounds "i,u; e,o; ae,are" (check the pictures to see what I mean) depict $$\log |\tilde p(\omega)|$$ where $$\omega=2\pi f$$. There's the logarithm because the vertical axis is said to be in decibels.

So you see that the graphs are very far from a delta-function. Instead, it looks like the graph is smooth and the width or error margin in the frequency is comparable to the frequencies themselves. How could the opera singers claim that such inaccurate vowels correspond to particular tones?

Well, if you focus on the Yanny Einstein aspect of the graphs above, you will see that the graphs are actually not smooth. They correspond to the sum of lots of delta-functions. All of them are localized at multiples of 100 hertz. So all these vowels sung by an opera singer will be represented as the rather deep sound of frequency 100 hertz – despite the fact that the graphs tell you about the amplitudes for the frequencies comparable to thousands of hertz!

What does it mean? It means that to determine which vowel is sung by an opera singer when he produces a tone of a fixed frequency $$f$$, you need to analyze the relative representation of the tenth or twentieth harmonic $$10f$$ or $$20f$$ included in the noise! All these very high harmonics matter a great deal.

And indeed, if you look at the relative representation of the lower and higher harmonics, you will generally see that the higher frequencies become more represented as you go from the initial sounds to the final ones in the sequence:
U - O - A - E - Y - Í
Well, I wrote the Czech vowels. (I and Y are actually pronounced exactly the same, and so are Í and Ý, so it's really the difference between the long and short vowels I/Y that changes the character of the noise. I wrote the two different sounds as Y-Í to convey the point both in the "intuitive" [Y should be "lower pitch" than I] as well as the "audible" sense.) You may translate the vowels above as "OO - AW - ARE - EH - Y - EE": the English or otherwise non-Czech spelling really sucks. Try to pronounce these six vowels and you will see that your mouth increasingly resembles a thin horizontal slit. It's the variable complex geometry of your mouth and throat, an echo chamber, that produces the higher harmonics with adjustable amplitudes.

Now, to classify the vowels by a single parameter, like I did (the coordinate along the UOAEYÍ axis), is another simplification – but it's more accurate than to talk about the fundamental frequency only. In reality, "the most general vowel" is classified by the relative representation of all higher harmonics, so you need infinitely many (or at least dozens of) real numbers (with some accuracy) to describe the character of a vowel. Two parameters are usually enough to describe the vowel rather well. You may approximate your mouth by an ellipse with semi-axes $$a,b$$ – and these two semi-axes give you the two additional parameters that, along with the fundamental frequency, describe a vowel. So you may see lots of 2D charts, basically charts in the $$ab$$-plane, where the vowels may be attached to individual points.

Now, you shouldn't be surprised that if you artificially lower the loudness of the high frequency sounds in the Yanny/Laurel file, you increase the ratio of low-to-high frequencies' volumes, and that shifts you to the left on the UOAEYÍ axis. For example, the first vowel in Yanny is "a" – but that's very close to my Czech "E". But if you increase the low-frequency sounds, you may get through "A" up to "O" – and indeed, "AU" is pronounced as the Czech "O". So it makes complete sense. The emphasis on the low frequencies moves the first vowel from "A" to "AU" (Czech: from "E" to "O") and similarly, the second vowel is moved from "Y" to "E" (both in Czech and English).

Similar comments apply to some consonants. Foreigners sometimes say that Czechs like syllables without vowels – like the Hebrew folks with their JHVH except that the Czechs really mean it and don't pronounce any vowels! ;-) Now, this is a deceitful simplification. We can write syllables without the normal vowels or vowel pairs such as AEIOUY, ÁÉÍÓÚÝ, AU, OU... but Czech and Slovak have some "replacement vowels" in these syllables, namely the liquid consonants.

So they are sounds considered consonants but if you think about it, they may be sounded for a prolonged time just like vowels, so they effectively may behave as vowels. They include especially the syllabic R but also the syllabic L and, in a few exceptional words, a syllabic M (and there's arguably a syllabic N in imported words such as "schlafen"). What do I mean?

The Czech word for "seven" is "sedm". It's usually pronounced as "sedum" (English: "sedoom"). However, when people try to sound kosher, they really say "sedm" and it has two syllables. The second syllable uses a syllabic M in the role of a vowel. Check it, it can be done. You can sing two tones on SE-DM. Your mouth is shut during the second tone but you still produce a sound.

The case of the syllabic L is much more widespread. In past tense verbs, you often find things like "KOPL" (he kicked) with a syllabic L. By far the most important Czech word with a syllabic L that you can encounter is Motl, of course. ;-)

However, the syllabic R is by far the most frequent one. You may construct incredibly long sentences which not only lack any AEIOUY-style vowels. In fact, the syllabic R (which is pronounced as an intensely trilled R in Czech, so R really sounds like a HaRRley Davidson's engine) is their only "vowel"! The canonical minimum tongue twister of this kind is Push your finger through your throat (strč prst skrz krk). But you may construct much, much longer sentences with lots of animals and actions in them. One example in my Quora answer I just linked to says:
Škrt plch z mlh Brd pln skvrn z mrv prv hrd scvrnkl z brzd skrz trs chrp v krs vrb mls mrch srn čtvrthrst zrn.
Well, it uses a syllabic L thrice, too, the rest is a syllabic R. Now, I hope that you will appreciate the efficiency of the Czech language. The tongue twister above may be translated to English as:
A cheapskate dormouse, richly dotted by manure, who hails from the mists of Brdy (hills 25 miles East of Pilsen in Czech Republic) at first proudly flicked a snack for those goddamn deers – consisting of a quarter of a cupped hand of corn – from brakes through a tuft of cornflowers into dwarf willows.
Some of you may need to convey exactly this information tomorrow – although the percentage of such people may not be too high – so it may be a good idea for you to learn Czech and do it much more effectively.

So L,R,M,N are usually "short sounds" which is why we use them as consonants but their spectral analysis is rather similar to the vowels above. The sound "Y" in "Yanny" may be considered a consonant – written as "J" in Czech and other languages – but the Fourier analysis is the same as the analysis of the vowel "Y" (or, more precisely in Czech, "Í").

These sounds have various profiles of amplitudes for the higher harmonics and just like the vowels A,Y in "Yanny" become AU,E in "Laurel", "Y" at the beginning may become "L" if you increase the amplitude of the low frequency components. And "NN" may become "R" – they're also consonants that are close to vowels and may be syllabic, with some extra "explosion".

"Laurel" also has an extra consonant at the end which is not present in "Yanny" at all. But there's no "sharp discontinuity" of the sound in "Yanny", either, so they're roughly compatible. But I see no simple verbal explanation why the ends of the words "Yanny" and "Laurel" may morph to each other. In the written form, the conversion sounds more natural because "L" exists both at the beginning and end of "Laurel", and so does "Y" in "Yanny", so according to the written form of the words, it looks like the same analysis may be applied at the beginning of the words and at the end, too.

I have discussed the fate of vowels like AEIOUY and potentially syllabic consonants such as LMNR. The remaining consonants contain noise that doesn't respect any fundamental frequency so they can't be clearly sung as a given tone.

Some of the consonants – FTPKSŠ – are voiceless. And they have corresponding voiced partners – VDBGZŽ – which combine the voiceless partners with a neutral vowel from the throat, basically with a Schwa. Except for H, which is a really deep Schwa-like vowel used shortly as a consonant, and CH [KH] which is a noisy version of H, I have basically depleted the full Czech alphabet! Well, I also need to discuss C,Č,Q,X – but they're just shortened composed sounds TS,TŠ,KV,KS.

Well, and indeed, I have forgotten Ř, the terrifying Czech sound that makes Czechs spit at you and you can't learn it. Ř may be both voiced or voiceless – it's written as the same Ř. Some of the sounds above (CČ-FSŠ/VZŽ) are sibilants and they may "last"; others are "clicks" that simply happen in a split second whether you like it or not (PTK/BDG).

Even the unvoiced noisy consonants FTPKSŠ – while they don't respect any well-defined fundamental frequency (so you couldn't decode a melody from a song that only contains these consonants) because they are composed of noise of all frequencies, not just some higher harmonics – depend on the relative representation of different frequencies. So obviously if you suppress low or high frequencies, they may start to sound like different consonants. For example, Š (SH) is probably rather close to a lower-pitch S while PTK – with barriers created by liPs, Tongue, and Krk (throat) – are analogous sounds with increasing frequencies because the echo chamber gets smaller as the barrier moves towards the throat.

The spectral analysis of the sounds is fun. The sounds belong to some continuum and different languages prefer different "sweet spots" in this multi-dimensional space of the relative amplitudes of the higher harmonics. Different cultures also choose words for different "sweet spots" in continuous spaces of other types, including colors. For example, Russians love to use two words for "blue" – they're basically light blue and dark blue except that if you analyzed the expected admixture of green or red in these colors, Russians would expect something slightly different than other nations.

In other words, there is some conversion of continuous/analog quantities to digital ones going on when the real world is tranformed into human languages. And this conversion means a simplification and the precise rules for this simplification depend on cultures and languages. And they are affected by continuous adjustments of the signal.