The Audio of Things, Part 2: What's the hold up?

This is the second of a three-part series that explores the role of audio in our evolving relationship with machines.


What do flying cars and sound-based HMI's have in common?

They're both super convenient, they're both challenged by the laws of physics, and growing up, I definitely thought they'd both be everywhere by now!  OK, so gravity is tough to beat, but what gives with voice interfaces? Why are we just now starting to see the spread of Machine Hearing? 


Working in a very noisy world

If it were just you and a voice-controlled device in an anechoic chamber, the voice interface would work almost flawlessly. There’d be no music playing, no others talking, no reverberance and no background noise. The system’s microphones would get a perfectly clean signal to send back to the servers.

But out here in the real world, a voice control system has a tough job to do. Think about the number of things that can interfere with the user’s voice. Of course, there are the voices of other people, plus the TV playing the game, the vent-fan over the stove is running, a YouTube video's playing on one kid’s phone, and a video game's being played on another. Then there’s the background hum of the HVAC system, the dishwasher, and the refrigerator. Plus echoes of all these things (including the user’s voice) off the walls, ceiling, floors, windows, and furniture.

The job is even harder in wearable and portable products, where you have little idea what kind of noise might be interfering with the voice, or what kind of acoustical environment the speaker will be in. Will the device be outside in a quiet field? Inside an office? At a loud restaurant, or a packed baseball stadium, or a busy shopping mall?

Talking to a Speaker

It’s not just the random noise of modern life that may drown out a users voice-commands - many voice-controlled devices themselves make a lot of noise! Consider a smart speaker blasting some music or soundbar playing an action movie. The microphones on that device are nearly touching the loudspeakers and dominated by sound made by the device itself!  (If you've ever listened to music with earphones, then you know what it's like to have speakers really close to your  "mics")

Cancelling out the direct sound of its own speaker as it leaks in through the microphones is reasonably easy, because the speaker knows what that sound is and when to expect it. But cancelling the reflections of those sounds (called Acoustic Echo Cancellation, or AEC) is immensely complicated, because the speaker doesn’t know how long it’ll take for them to bounce back, how many times they’ll bounce back, or how the acoustics of the room might change them.

In the car, it’s even harder. Obviously you’ve now got road and wind noise to contend with, but even worse, you have as many as 20 speaker drivers spreading sound all over the car. That’s a lot of audio signals to cancel, and every channel you add roughly doubles the difficulty. (Now you know why nearly all smart speakers are either mono or stereo.)


You can't fight the laws of physics

When you’re dealing with sound moving through air and interacting with its environment, you’re not dealing with easily manipulated 1s and 0s, but with the laws of physics. The complex fluid dynamics of sound waves moving through an environment -- bouncing off some objects, being absorbed by others, and interacting with the other sound waves -- is exponentially more difficult to predict than the passage of electrons through a circuit or the trace of a software function.

Complicating the task even further is that it usually requires not just one microphone, but an array of multiple microphones. Getting two to seven microphones to work together requires acoustical, electrical and mechanical expertise -- choosing the appropriate mics, deciding on the optimum number and array geometry, ensuring they’re properly mounted and gasketed, and designing the overall acoustic / product chassis without mechanical coupling between the microphones and loudspeakers. A lot can go wrong!

Brain vs Machine

Most of us have had that feeling of disappointment after trying to snap a photo of an amazing sunset or cozy fire-side moment - because the photo never comes out as well as what our eyes see.  The perceptive ability of our eyes and brain - the depth of focus, field of view, and dynamic range is just so much better than that of a camera.  

The same is true for our ears. We take it for granted, but consider the number of things your brain does instantly to respond to someone asking you a question from across the room:

  1. Identify that someone has uttered your name
  2. Determine the location of that person
  3. Focus perceptual attention on that person’s voice.
  4. Start ignoring all the other noise sources (and acoustic echoes of those noises)
  5. Extract meaning and intent from the person’s language and intonation 


In part 3, we’ll look at some technologies that are finally enabling Machine Hearing to work just as well as the human ear … and maybe even better.


<< The Audio of Things, Part 1: What’s the best human/machine interface?


About DSP Concepts

DSP Concepts, Inc. provides embedded audio digital signal processing solutions delivered through its Audio Weaver® embedded processing platform. DSP Concepts specializes in microphone as well as playback processing and is the leading supplier to top tier brands in consumer and automotive products. Founded by Dr. Paul Beckmann in 2003, DSP Concepts is headquartered in Santa Clara, California with offices in Boston and Stuttgart.