Sine-wave Speech Demonstrates An Auditory One-way Door

Sine-wave speech can be thought of as a sort of auditory illusion, a sensory edge case in which one’s experience has a clear “before” and “after” moment, like going through a one-way door.

Sine-wave speech (SWS) is intentionally-degraded audio. Here are the samples, and here’s what to do:

  1. Choose a sample and listen to the sine-wave speech version (SWS). Most people will perceive an unintelligible mix of tones and beeps.
  2. Listen to the original version of the sentence.
  3. Now listen to the SWS version again.

Most people will hear only some tones and beeps when first listening to sine-wave speech. But after hearing the original version once, the SWS version suddenly becomes intelligible (albeit degraded-sounding).

These samples were originally part of research by [Chris Darwin] into speech perception, but the curious way in which one’s experience of a SWS sample can change is pretty interesting. The idea is that upon listening to the original sample, the brain — fantastic prediction and learning engine that it is — now knows better what to expect, and applies that without the listener being consciously aware. In fact, if one listens to enough different SWS samples, one begins to gain the ability to understand the SWS versions without having to be exposed to the originals. In his recent book The Experience Machine: How Our Minds Predict and Shape Reality, Andy Clark discusses how this process may be similar to how humans gain fluency in a new language, perceiving things like pauses and breaks and word forms that are unintelligible to a novice.

This is in some ways similar to the “Green Needle / Brainstorm” phenomenon, in which a viewer hears a voice saying either “green needle” or “brainstorm” depending on which word they are primed to hear. We’ve also previously seen other auditory strangeness in which the brain perceives ever-increasing tempo in music that isn’t actually there (the Accelerando Illusion, about halfway down the list in this post.)

Curious about the technical details behind sine-wave speech, and how it was generated? We sure hope so, because we can point you to details on SWS as well as to the (free) Praat software that [Chris] used to generate his samples, and the Praat script he wrote to actually create them.

40 thoughts on “Sine-wave Speech Demonstrates An Auditory One-way Door

    1. Same here. I guess we’re fast auditory learners? Sharp ears? I’ve always had a good “radio in my head” and my dad’s a professional musician for my whole life, though I never had any formal music training.

      1. Then there’s a good chance to learn telegraphy now. That way, the hearing and the brain’s pattern recognition can be trained, maybe. A talent for music and rhythms shouldn’t hurt, either. :)

    2. Yeah I found the first sample just about understandable myself, wasn’t exactly right but did pick out some of it it turns out correctly (which very much surprised me wasn’t confident at all). And after that first one the rest I had way more confidence on.

      I wonder if it is because the voice is relatively familiar to us anyway – the accent, tone and pace of the normal version of the speech seems like somebody I could know – So we are already listening and trying to make it match our expectations.

      1. When I did my voice recording one day on a guys machine for recording (a music sincizer) it was portable by the way my voice was terrible but a few hours later it returned to normal not sure why maybe i was upset as they say not right not sure guess thats why stress is bad for a singers voice even if u dont use it for a while maybe i should of just been better in my education when i was younger than the world could of seen me better

  1. Okay. Sounds like a bird trying to speak.
    It’s as if merely the changes in pitch are being recorded, rather than full speech.
    It’s as if a metallophone is being used to mimic speech.

  2. What makes me wonder, which kind of speech is best suited here.
    The stereotypical stiff or more defined BE accent or the AE one?
    Also, is a male voice better or a female one?
    Say, the male one usually is deeper and more clear, while female one is higher and squeaky.

    1. I suspect all of that depends very much more on how your brain is already wired – you spend you time listening to and speaking the local dialect and this is recorded in that dialect you I expect have a huge leg up on everyone else. So for more global easy to comprehend I suspect the BBC newsreader style of speaking is probably going to be more familiar than any American accent.

    2. What it makes ME wonder is, how little more information can be added to make the speech intelligible? This demonstrates how close you can get with an extremely low bitrate. How much more data does it take to add plosives and sibilants, and is this enough?

    3. There’s no single British English accent.

      Btw, most British accent are non-rhotic. I would half expect that might make them harder to pick out. And most of them aren’t any more defined or stiff..

  3. Woah. Flashbacks. Reminds me of all those hours I spent messing with Dennis Klatt’s speech synthesis code in the late 80s (later made famous by Stephen Hawking). One of the first things I found on the primordial internet. Fun stuff. Apparently Praat’s author Boersma thought so to — his PhD thesis appeared a few years later.

  4. The original sounds a bit like a young Bill Nighy (the British actor from such great films as The Best Exotic Marigold Hotel, Love Actually, World’s End and amazingly funny Hot Fuzz).

  5. “Most people will hear only some tones and beeps when first listening to sine-wave speech. But after hearing the original version once, the SWS version suddenly becomes intelligible (albeit degraded-sounding).”

    Like the little voice in our head reading over the input, filling in the blanks.


      Sine-wave speech is a form of artificially degraded speech first developed at Haskins Laboratory.

      Generating Sine-Wave Speech:
      Sine-wave speech is generated by using a formant tracker to detect the formant frequencies found in an utterance, and then synthesising sine waves that track the centre of these formants.

      Best regards,

      A/P Daniel F. Larrosa
      (Montevideo – Uruguay)

  6. This sounds like a less robot version of tiny speech.
    I think the more interesting question would be how algorithmically to go from this and back to something sounding like the original.

    1. Would that imply that perhaps as density and computing power on a chip continues to improve, along power economy/capability, that future cochlear implants could have greater resolution, resulting in truer reproduction?

  7. There’s a similar phenomenon where if you use wav2midi or similar to convert audio of a song to a MIDI file, people who are familiar with the song can discern the lyrics while those who aren’t cannot. Mark Rober did a demonstration of something similar with a MIDI-controlled player piano a few years ago, where by providing subtitles the piano’s “speech” was totally intelligible but without them it was difficult/impossible to tell what it was saying.

  8. Like most pop music. There are always parts that are ambiguous and since the lyrics are often poetic and we don’t know the context in which the lyrics were made up it can be difficult to figure out what the lyrics are. Listening to music again after reading the lyrics changes it.

      1. You misunderstood. I mean ambiguous in that a word will be heard differently by different people. Some of those interpreted lyrics are incorrect and don’t match the actual lyrics. Ambiguity in meaning is a separate discussion as this can be done on purpose.

    1. Too much information has been removed to recover the original without adding more meta data. You can interpret the sine wave speech in multiple ways. Only by adding more information such as lip reading, a script, context, or the original audio, can your brain fill in the gaps.
      You probably could train your brain to become better at it. People with cochlear implants can only hear a limited number of frequencies, but they can learn to understand speech.

  9. Remember the old claims of backward masking in songs, and how those who said they existed would always tell you what you were supposed to hear before playing it? This was kind of like that. I did listen to each processed version repeatedly before I listened to the original. A time or two, I got pretty close, but not often. (Wish I’d written them down.)

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.