In this third and final part of the AoT blog series, we’ll take a look at some of the recent innovations that are finally bringing voice-based human/machine interfaces (HMIs) to the majority of consumer products.
While there’s been continual improvement in cloud-based technologies like automatic speech recognition (ASR) and natural language understanding (NLU), it’s largely been “edge-based” innovations that have knocked down the final barriers to the spread of voice-based HMIs.
Smaller, Faster, Cooler, Cheaper
Moore’s Law says the number of transistors in IC’s roughly doubles every two years - which means every two years or so, the same processing power can be delivered in a chip that’s half the size. This is significant because smaller chips cost less to produce (each silicon wafer yields more saleable product), and over time, less expensive products can afford to include signal processing in their bill of materials.
For reference, in the not too distant past, the 200MHz digital signal processor (DSP) in your home theater receiver might have cost $10 at high volumes. Today, IC vendors like ST Micro are able to offer that same audio-processing power in Cortex-M based MCUs for just a few dollars!
Chip size also impacts another important budget - a product’s power budget. As transistors shrink, so does the power required to flip all those 1’s and 0’s millions of times per second. This allows audio-processing to be built into ever smaller products (thanks to ever smaller batteries) and opens the door for entirely new applications where commercial-viability requires extreme battery life.
For context, that venerable old 200MHz DSP might have burned something like 1.4 Watts - or 7,000 uW/MHz. Today, companies like Ambiq Micro are delivering power-optimized MCUs (with audio-processing capabilities) that operate at less 20 uW/MHz!
Beyond price and power, there’s also been a big change in the variety of processors that can even process audio in the first place. Traditionally, if you had audio processing to do, you’d have to add a stand-alone DSP to the board as an audio coprocessor - MCU’s (which might have done a single math problem in the time a DSP could do twenty) just didn’t have the horsepower.
While modern DSP cores are still the most efficient way to get audio processing done (particularly in low-power and high-channel-count applications), today’s system-architects also have other options. Thanks to instruction set enhancements like Arm’s Neon for Cortex-A and Helium for Cortex-M, most processors now support the floating-point & SIMD operations needed for efficient audio processing - which means most products now have access to the compute-power needed to integrate voice-control.
As evidence of this shift, consider that the AWE Core™ audio-processing engine from DSP Concepts (the highly-optimized embedded software component at the heart of the Audio Weaver platform) is available on IC’s from over 20 different silicon vendors - yet less than half include traditional DSP cores!
The Edge of the Edge
One of the final obstacles to widespread Voice UI adoption has been poor speech-recognition performance. This inability for “the stupid thing” to hear you has now been solved by innovations “where the rubber meets the road” for voice-controlled devices: the Audio Front End (AFE). The AFE is the functional block that sits between a device’s microphones and the rest of the voice-processing ecosystem; it takes the raw, noisy audio from [typically 2-6] microphones and attempts to create a single audio output-stream with only the user’s voice commands.
Fixed-function and hardware-based AFE’s have been used for years, but they can be difficult to integrate into a custom form factor and their performance can be underwhelming. Now, with the Audio Weaver® processing platform available on nearly every IC, software-based AFE’s like TalkTo™ are enabling machines that can match and sometimes exceed a human’s ability to understand speech in a noisy environment!
TalkTo’s breakthrough performance is largely due to two innovations: DSP Concepts’ Multichannel Acoustic Echo Canceller (AEC) and proprietary Adaptive Interference Canceller™ (AIC). AEC, which cancels the “known” sounds made by the device's own speakers, has historically only been possible in mono and stereo products - because the computational-load rises exponentially with channel count. Today, thanks to several algorithmic innovations in Audio Weaver’s AEC module, TalkTo has made voice control possible even in high-channel count systems like Samsung’s Q800T soundbar, which offers 3.1.2 channel Dolby Atmos playback.
While AEC handles the "known" noises, it’s often all the “unknown” sounds in the room that keep your machine from hearing you. This is where AIC comes in. TalkTo's AIC uses machine learning and advanced microphone-processing techniques to continually map and characterize the ambient sound-field. Based on this model, TalkTo is able to identify any voice data and reconstruct a pristine, voice-only signal for use by the rest of the voice-ecosystem.
Voice-control Without a Net
One final innovation enabling the spread of voice-based HMIs also leverages the audio-processing power now available in most edge devices: the ability to detect a wake-word (e.g. “Alexa”) and user-intent (e.g. “Turn off the AC”) entirely on the device itself.
This ability is delivered by innovations in neural networks for embedded-systems from companies like Sensory, who’s TrulyHandsfree™ and TrulyNatural™ technologies can operate with acoustic models an order of magnitude smaller than the previous state-of-the-art. This has not only opened the door to voice interfaces on products with resource-constrained IC’s, it also enables a myriad of use-cases where an internet connection is either unavailable or unwanted.
This the final installment in a three-part blog series that explores the growing role of audio in our evolving relationship with machines. In part one, we looked at the evolution of HMI's and some of the fundamental problems as our devices that have grown ever more capable and flexible. In part two, we explored the unique challenges of sound-based interfaces that have, until now, limited the broader adoption of voice-control.
About DSP Concepts
DSP Concepts, Inc. provides embedded audio digital signal processing solutions delivered through its Audio Weaver® embedded processing platform. DSP Concepts specializes in microphone as well as playback processing and is the leading supplier to top tier brands in consumer and automotive products. Founded by Dr. Paul Beckmann in 2003, DSP Concepts is headquartered in Santa Clara, California with offices in Boston and Stuttgart.