Sine-wave speech can be thought of as a sort of auditory illusion, a sensory edge case in which one’s experience has a clear “before” and “after” moment, like going through a one-way door.
Sine-wave speech (SWS) is intentionally-degraded audio. Here are the samples, and here’s what to do:
- Choose a sample and listen to the sine-wave speech version (SWS). Most people will perceive an unintelligible mix of tones and beeps.
- Listen to the original version of the sentence.
- Now listen to the SWS version again.
Most people will hear only some tones and beeps when first listening to sine-wave speech. But after hearing the original version once, the SWS version suddenly becomes intelligible (albeit degraded-sounding).
These samples were originally part of research by [Chris Darwin] into speech perception, but the curious way in which one’s experience of a SWS sample can change is pretty interesting. The idea is that upon listening to the original sample, the brain — fantastic prediction and learning engine that it is — now knows better what to expect, and applies that without the listener being consciously aware. In fact, if one listens to enough different SWS samples, one begins to gain the ability to understand the SWS versions without having to be exposed to the originals. In his recent book The Experience Machine: How Our Minds Predict and Shape Reality, Andy Clark discusses how this process may be similar to how humans gain fluency in a new language, perceiving things like pauses and breaks and word forms that are unintelligible to a novice.
This is in some ways similar to the “Green Needle / Brainstorm” phenomenon, in which a viewer hears a voice saying either “green needle” or “brainstorm” depending on which word they are primed to hear. We’ve also previously seen other auditory strangeness in which the brain perceives ever-increasing tempo in music that isn’t actually there (the Accelerando Illusion, about halfway down the list in this post.)
Curious about the technical details behind sine-wave speech, and how it was generated? We sure hope so, because we can point you to details on SWS as well as to the (free) Praat software that [Chris] used to generate his samples, and the Praat script he wroteΒ to actually create them.
Kinda reminds me of Silbo Gomero (q.v. wikipedia)
We are also a model based on training.
I got the effect on the first three samples, but after that I was able to understand the last 3 samples on the first listen, without having heard the un-distorted version.
Same here. I guess we’re fast auditory learners? Sharp ears? I’ve always had a good “radio in my head” and my dad’s a professional musician for my whole life, though I never had any formal music training.
Then there’s a good chance to learn telegraphy now. That way, the hearing and the brain’s pattern recognition can be trained, maybe. A talent for music and rhythms shouldn’t hurt, either. :)
someone needs to do it out of order, I caught on by the Kettle one
Yeah I found the first sample just about understandable myself, wasn’t exactly right but did pick out some of it it turns out correctly (which very much surprised me wasn’t confident at all). And after that first one the rest I had way more confidence on.
I wonder if it is because the voice is relatively familiar to us anyway – the accent, tone and pace of the normal version of the speech seems like somebody I could know – So we are already listening and trying to make it match our expectations.
When I did my voice recording one day on a guys machine for recording (a music sincizer) it was portable by the way my voice was terrible but a few hours later it returned to normal not sure why maybe i was upset as they say not right not sure guess thats why stress is bad for a singers voice even if u dont use it for a while maybe i should of just been better in my education when i was younger than the world could of seen me better
Likewise – I half got the first one, but after hearing the original for it I could hear the remaining 5 fine first time.
I was getting my ear in by sample 5 and got the last one perfectly.
Okay. Sounds like a bird trying to speak.
It’s as if merely the changes in pitch are being recorded, rather than full speech.
It’s as if a metallophone is being used to mimic speech.
What makes me wonder, which kind of speech is best suited here.
The stereotypical stiff or more defined BE accent or the AE one?
Also, is a male voice better or a female one?
Say, the male one usually is deeper and more clear, while female one is higher and squeaky.
I suspect all of that depends very much more on how your brain is already wired – you spend you time listening to and speaking the local dialect and this is recorded in that dialect you I expect have a huge leg up on everyone else. So for more global easy to comprehend I suspect the BBC newsreader style of speaking is probably going to be more familiar than any American accent.
What it makes ME wonder is, how little more information can be added to make the speech intelligible? This demonstrates how close you can get with an extremely low bitrate. How much more data does it take to add plosives and sibilants, and is this enough?
There’s no single British English accent.
Btw, most British accent are non-rhotic. I would half expect that might make them harder to pick out. And most of them aren’t any more defined or stiff..
Woah. Flashbacks. Reminds me of all those hours I spent messing with Dennis Klatt’s speech synthesis code in the late 80s (later made famous by Stephen Hawking). One of the first things I found on the primordial internet. Fun stuff. Apparently Praat’s author Boersma thought so to — his PhD thesis appeared a few years later.
arrggh. … thought so *too* …
Enter key works now, but still can’t edit, at least not with “anonymous coward” credentials.
Try the player piano doing this, not sine nor electronic. Freaking cool. Its on the tube.
Try to mumble “Idunno” and you’ll get the same effect :)
interesting… now suddenly it makes sense why some people can understand what R2D2 is saying
The original sounds a bit like a young Bill Nighy (the British actor from such great films as The Best Exotic Marigold Hotel, Love Actually, World’s End and amazingly funny Hot Fuzz).
“Most people will hear only some tones and beeps when first listening to sine-wave speech. But after hearing the original version once, the SWS version suddenly becomes intelligible (albeit degraded-sounding).”
Like the little voice in our head reading over the input, filling in the blanks.
What is “Sine Wave Speech?” Why is it named that? How is it made?
https://www.mrc-cbu.cam.ac.uk/people/matt.davis/sine-wave-speech
Sine-wave speech is a form of artificially degraded speech first developed at Haskins Laboratory.
Generating Sine-Wave Speech:
Sine-wave speech is generated by using a formant tracker to detect the formant frequencies found in an utterance, and then synthesising sine waves that track the centre of these formants.
Best regards,
A/P Daniel F. Larrosa
(Montevideo – Uruguay)
Thanks for the link!
This sounds like a less robot version of tiny speech.
I think the more interesting question would be how algorithmically to go from this and back to something sounding like the original.
Another kind of degraded speech is produced by cochlear implants. Only a few frequencies yet it doesn’t take that long to understand.
Would that imply that perhaps as density and computing power on a chip continues to improve, along power economy/capability, that future cochlear implants could have greater resolution, resulting in truer reproduction?
There’s a similar phenomenon where if you use wav2midi or similar to convert audio of a song to a MIDI file, people who are familiar with the song can discern the lyrics while those who aren’t cannot. Mark Rober did a demonstration of something similar with a MIDI-controlled player piano a few years ago, where by providing subtitles the piano’s “speech” was totally intelligible but without them it was difficult/impossible to tell what it was saying.
I wonder if this is related to the perception aspects of a cochlear implant?
So that is how R2D2 is communicating…
So this is how R2D2 is communicating…
Like most pop music. There are always parts that are ambiguous and since the lyrics are often poetic and we don’t know the context in which the lyrics were made up it can be difficult to figure out what the lyrics are. Listening to music again after reading the lyrics changes it.
It’s the ambiguous parts that make both music and poetry work. These are the parts that transform the consumer into a participant.
You misunderstood. I mean ambiguous in that a word will be heard differently by different people. Some of those interpreted lyrics are incorrect and don’t match the actual lyrics. Ambiguity in meaning is a separate discussion as this can be done on purpose.
This seems really familiar. Was this same method used to generate the voice of some sort of alien or android in some old 70s or early 80s sci fi?
What if I retrain my brain on enough distorted recordings of phonemes and phoneme combinations?
Too much information has been removed to recover the original without adding more meta data. You can interpret the sine wave speech in multiple ways. Only by adding more information such as lip reading, a script, context, or the original audio, can your brain fill in the gaps.
You probably could train your brain to become better at it. People with cochlear implants can only hear a limited number of frequencies, but they can learn to understand speech.
Then you will have gained a skill that will never serve you.
Like an arts degree from Harvard?
:-)
Remember the old claims of backward masking in songs, and how those who said they existed would always tell you what you were supposed to hear before playing it? This was kind of like that. I did listen to each processed version repeatedly before I listened to the original. A time or two, I got pretty close, but not often. (Wish I’d written them down.)