Cascade mode

Overview

Wake Word is said to be operated in cascade mode when a relatively permissive first-stage WW detector is cascaded with a more conservative second-stage WW detector. This page describes the operation of wake word in cascade mode in more detail.

Description

In cascade mode, on-device wake word detection occurs in two stages. The first stage runs a smaller, permissive wake word model that is tuned to have lower false rejects and higher false accepts, while the second-stage runs a larger, more accurate model to perform a second-pass verification. Cascade or two-stage detection is typically required in scenarios when the device needs to be operated in low-power conditions, for instance, in tablets, in battery-powered devices and in other such low-power devices. When wake word detection is cascaded, the same level of accuracy can be achieved with considerable power savings because the second stage gets invoked only when the smaller first-stage model makes a detection. The first-stage detector can also be combined with a low-cost Voice Activity Detector which triggers the first-stage detector only when there is valid speech activity. This further reduces power consumption since CPU is spent running only the VAD for most of the time.

Typical Use Case: Cascade WW Detection using VAD

Screenshot

VAD and a first-stage Wake Word Engine (50K model) runs on the DSP. The SoC is powered down.
Most of the time spent is in VAD, so the DSP can be clocked to run at an extremely low rate
When VAD triggers, DSP clock speed is increased and runs the 50K (more permissive) 1st stage model
If 1st stage triggers, power up the SoC, transfer audio, and do WW detection with a large rmodel
Reduces power consumption greatly as most of the time is spent running only VAD

However, using this power-saving cascade configuration has the following trade-offs:

Increased latency due to two-pass verification. The exact amount is dependent on system hardware in transferring audio from the first-stage chip to the second-stage chip.
System Complexity: Two wake word engines & models, and code to coordinate the two between two chips
Extra memory required on the DSP for a 2 second audio ring buffer

FAQ

See the Cascade Mode section of the FAQ.