Thursday, August 02, 2018

Lumo spiral visualization of sound for the deaf

...and for all other true lovers of music and sounds...

In a previous text, I proposed a visualization of the sounds that would allow the deaf people to hear – including the distinguishing of vowels, accents, different people's voices, consonants, other noises, music, frequencies, octaves, melodies, different musical instruments, chords and several people talking or singing simultaneously, and everything else.

The idea is that the brain connected to the eye just gets trained to evaluate very similar information as the information that is coming from the ear. Ideally, the eye (plus a piece of the brain connected to it) should work "almost the same" as the ear.

Someone has mentioned the WinAmp visualizations (or Milk Drop), click for an example. Yes, that's what I roughly mean except that this video example and most others don't allow me to "hear" anything. They seem like pretty pictures that are just slightly affected by the sound that is being played but there's no straightforward way to extract the precise sound from the picture.

I have some trouble to get the spectrum from a trimmed audio file in Mathematica – some audio commands don't seem to exist in my version of Mathematica at all. So let me describe what I have in mind almost exactly so that a better programmer may write the full code.

You take the latest 0.1 seconds of your music and do the Fourier decomposition (the image should be updated 60 times a second, for every refreshed screen). All the Fourier components have frequencies that are integer multiples of 10 Hz. You may want to make the spectrum continuous by Fourier transforming the whole sound for \(t\lt 0\), but with the power multiplied by \(\exp(t/t_0)\) for \(t\lt 0\) where \(t_0\) is 1/10 of a second, OK?

Now, you need to visualize the powers \(P(f)\) as a color in the polar coordinates \((r,\phi)\) in the plane. For a given frequency \(f\), the coordinates are given by\[

\phi &= 2\pi \cdot \frac{\log (f/10\,{\rm Hz})}{\log 2},\\
r &= 10-\frac{\phi}{2\pi} \pm 0.5

\] The term \(\pm 0.5\) means that the power of a given frequency is visualized in a whole line interval between two points of a spiral. The spiral has \(r\) proportional to \(\phi\).

If you add some component of the sound with a doubled frequency \(2f\), it will be visualized in the next arc of the spiral, i.e. for \(\phi\to \phi+2\pi\) and \(r\to t-2\pi\).

The term \(10\) in the radial coordinate says that the very low frequencies around \(f\approx 10\,{\rm Hz}\) that correspond to \(\phi=0\) will be seen as points with \(r\approx 10\), very far from the origin of the coordinates. On the contrary, the highest audible frequencies will be about \(10\times 2^{10}\,{\rm Hz}\) or 10 kilohertz, and they will be represented by the features of the image near the origin.

The spiral will be wound around about 10 times, which corresponds to the 10 octaves (a factor of 1024 in frequencies) that this "ear" will be able to "hear". I find it natural that the high-pitch sounds are going to be drawn near the point at the origin.

The visualization along the spirals where \(\phi\) is linear in the logarithm of the frequency guarantees that if you take any sound and just uniformly increase the frequency by a factor, you increase the tone or the pitch, the corresponding picture will be just rotated relatively to the original one. Helpfully enough, one octave is composed of 12 half-tones, so one half-tone will correspond to the same angle as one hour on the clocks!

The coefficient relating \(\phi\) and \(\log f\) is chosen in such a way that one revolution corresponds to one octave. There are ten octaves visible in the elipse – about ten layers of the elipse will make it to the picture.

To make the sound even more readable, you should visualize the power \(P(f)\) as the intensity of the red color at the point \(r,\phi\), and the powers of \(P(3f/4)\) and \(P(5f/4)\) should be drawn as the intensity of green and blue at the same point, respectively. Maybe the red and green should be reserved for ratios \(P(3f/4)/P(f)\) and \(P(5f/4)/P(f)\), I am not sure. The addition like that could make "nice chords" (like C+E, C+G) immediately distinguishable from the "ugly chords" (C+H).

But the main point is that if you have a uniform sound that lasts, like some tone or a vowel, there should be a corresponding image that looks like a particular, not rotationally invariant, object – with a particular fingerprints of color – that sits on a spiral clock. When you just increase the frequency of the sound, the object should get rotated and nothing else.

If the details are adjusted correctly, the vowels U,O,A,E,I (I mean OO, AW, AH, EH, EE) should be visualized as unforgettable objects that you may train to recognize just like recognize the vowels. If the sound just changes the accent, these objects slightly change, too. You may train yourself to recognize these small deformations, too. Similarly, the consonants would correspond to some noisy pictures. The consonants probably contain mostly higher frequencies, so they would be objects sitting close to the origin. And they don't have too well-defined "main" frequencies so their character would be more rotationally symmetric and boring.

You know, the goal should be that from the characteristic pictures that you generate for given sounds, the deaf person will be able to hear the melody, chords, which instruments play them, whether the singer has good pitch (whether the clocks are exactly sitting at the whole hours). And when two people sing together, the visualization above should still respect some superposition principle so the deaf person should see-hear overlapping pictures on top of each other, which might still be reconstructed. A deaf person who is not tone-deaf will learn to appreciate complicated chords from the pictures, too.

Maybe you want to draw more information extracted from the power spectrum – in between the ellipses and elsewhere. But I think that you should respect 1) the superposition principle (the sounds combine to be heard together should be translated to overlapping images), 2) the fact that the uniform increase of the frequency corresponds to a simple rotation/scaling (well, deformed scaling because I decided to make the spiral linear, not exponential, after all; you may want to reparameterize the \(r\) coordinate by some nonlinear transformation), 3) the full power spectrum \(P(f)\) should be possible to decode from the picture at each moment.

If someone has understood what I am saying and can write the program, it would be nice if he could create the video file with the visualization of some speech, song, or a symphony. Needless to say, all the powers and colors should be adjusted so that the picture is neither too dim, nor it surpasses the maximum possible intensity of the color etc. It must be chromatically balanced to transmit maximum information.

I believe that the ear effectively works as some touch-sensitive organ that perceives pretty much equivalent shapes that you draw with your program sketched above, but perceives them by "touch", not by "vision" – so the ear basically works like the fingers' skin that reads the visualization you are going to code as if it were Braille (the writing system for the blind people, to make it more confusing LOL). But the relative representation of different frequencies must be perceived by the ear as some "characteristic fingerprint" that touches different places of the touch sensors in the ear – which analyze the frequencies.

You can make the whole system stereo – separate images for the left eye and the right eye that are calculated from the left speaker and right speaker, respectively. At least, you may draw both of them next to each other for the starters.

Note that the deaf people with their head and display kept straight would have perfect pitch. To make it really nice, concert pitch (the frequency 440 Hz, the A above the middle C) should be rotated so that it points towards "12" on the clocks. ;-)

Just a consistency check that you understood the basic idea: if you play C, C#, D, D#, E, F, F#, G, G#, A, A#, H, C, on the piano (in well temperament where the frequency ratios are powers of the 12th root of two), your visualization should simply show the short hour hand jumping from 12:00, 13:00, ... back to 12:00. Music should be pretty and logical! Chords would be like several short hour hands added on a clock: the chord C+E+G should look like three hands on a clock pointing to 12, 4, 7. The shape of these short hour hands would represent the musical instrument (the relative representation of higher harmonics), and so on.

I hope to make more progress with the Mathematica deconstruction of the audio files in coming days or weeks if no one creates it before me.

No comments:

Post a Comment