A Lumosque introduction to the spectral analysis of vowels and consonants



There are two famous people named Marilyn Einstein. One of them is the trans-sexual daughter of Marilyn Monroe and Albert Einstein (yes, Marilyn had a relationship with many famous men):







Her genes are 50% from Einstein, 50% from Monroe, and the percentages don't change much. But the other Marilyn Einstein is this one:







Ze differs from the first one because ze can be close to 100% Einstein or close to 100% Monroe, depending how we look at zir.









If your eyesight is sharp and you can see the short-distance, high-resolution details, you may see Einstein's mustache and wrinkles and you prefer to conclude it's Einstein. When your eyesight is fuzzy enough, you average the areas, the mustache and wrinkles disappear, and what is left is the big-picture shape of the head with the haircut which is closer to Marilyn Monroe.









A week or two weeks ago, millions people got obsessed with a nearly isomorphic audible illusion:









In 2007, opera singer Jay Jones recorded the pronunciation of 200,000 words for vocabulary.com. Those included this word that was meant to be "Laurel" (the name of Stan Laurel, a man from the Laurel-Hardy duo of comedians from the voiceless epoch of the movies). When your hearing is good and sensitive to the high-resolution details – which are stored especially at high frequencies because that's where the bitrate is higher – then you can hear that what he ended up saying is actually "Yanny".









If your hearing of high frequencies is poor or if you artificially amplify the low frequencies relatively to the high ones, you may hear "Laurel", as originally intended and as shown in this second audio file. There's nothing shocking about it – hearing "Laurel" is analogous to seeing Marilyn; hearing "Yanny" is analogous to seeing Einstein.



Both the visual and audible files may be filtered in ways that suppress or amplify the low frequencies relatively to the high ones, or vice versa, and that's how you can make them look or sound closer to Marilyn, Yanny, Laurel, or Einstein, whatever you prefer.



Sources such as CBS claim that the audio "technically" says "Laurel", so the answer "Laurel" is "technically" correct. I think that this formulation tries to sound fancy but it is technically deceitful. There is nothing "technical" about the sense in which "Laurel" could be preferred. The word was intended to sound like Laurel.



But when you analyze it by the best voice recognition technology you may find, and I think that the usage of "technology" could be said to determine the "technically correct" answer, I am convinced that the answer will be "Yanny". So I think it's more accurate to say that the word is technically Yanny, not Laurel – than to say the opposite thing that CBS did.



Some people may be shocked that different people hear different words. I am not shocked at all. I have struggled with Americans' incomprehensible sounds, especially vowels, for a decade, and many of them struggled with mine. It's a rule, not an exception, that some sounds sound incomprehensible.



All vowels are complicated and the determination which vowel is "actually" heard is an analog problem (a measurement of continuous observables). People love to imagine that things are simple and digital but they're not. Discrete physicists are crackpots – Stephen Wolfram and a few great other men should replace the C-word with a more diplomatic one but I can't think of any accurate replacement right now. ;-)



How do sounds differ from each other? Well, a sound is captured by some mostly fluctuating pressure \(p(t)\) that is a function of time. If the pressure goes like \(p(t)=p_0 \cos \omega t\), then you hear a harmonic sound of a fixed frequency. Most generic speech has changing frequencies and many of them are combined in every period of time (I can't really say "at each moment" because the sound at a "single short moment" cannot have a well-defined frequency, basically due to the "uncertainty principle"). But even when a perfectionist opera singer sings vowels that are meant to have a fixed frequency, there are other frequencies in the sound, too.







Here you have the spectral analysis of the Canadian vowels. What are the graphs? They drew \(p(t)\), the pressure near the mouth, as a function of time, and performed the Fourier transform to get \(\tilde p(\omega)\), a function of frequencies. That shows which frequencies are maximally represented. The graphs above – the individual ones describe the sounds "i,u; e,o; ae,are" (check the pictures to see what I mean) depict \(\log |\tilde p(\omega)|\) where \(\omega=2\pi f\). There's the logarithm because the vertical axis is said to be in decibels.



So you see that the graphs are very far from a delta-function. Instead, it looks like the graph is smooth and the width or error margin in the frequency is comparable to the frequencies themselves. How could the opera singers claim that such inaccurate vowels correspond to particular tones?



Well, if you focus on the Yanny Einstein aspect of the graphs above, you will see that the graphs are actually not smooth. They correspond to the sum of lots of delta-functions. All of them are localized at multiples of 100 hertz. So all these vowels sung by an opera singer will be represented as the rather deep sound of frequency 100 hertz – despite the fact that the graphs tell you about the amplitudes for the frequencies comparable to thousands of hertz!



What does it mean? It means that to determine which vowel is sung by an opera singer when he produces a tone of a fixed frequency \(f\), you need to analyze the relative representation of the tenth or twentieth harmonic \(10f\) or \(20f\) included in the noise! All these very high harmonics matter a great deal.



And indeed, if you look at the relative representation of the lower and higher harmonics, you will generally see that the higher frequencies become more represented as you go from the initial sounds to the final ones in the sequence:



U - O - A - E - Y - Í



