Wake Word Pre-Roll Integration Guide

Background

For the definition and requirements for pre-roll for AVS devices, see the Overview.

Implementation

The basic idea is to identify the pre-roll start index as being 500 milliseconds ahead of the wake word start index, as illustrated in the AVS Streaming Requirements Shared Memory Ring Buffer recommendation.

ring_buffer_pre_roll_illustration

We also have a detailed reference available to guide general implementation and integration, in the AVS Device SDK.

Before we continue with highlighting details specific to how the AVS Device SDK prepends pre-roll to wake word audio for AVS Recognize events, we reiterate that while an audio ring buffer implementation is recommended for maintaining history of audio associated with the PryonLite wake word engine, integrators are free to meet the pre-roll requirements using alternative means.

Basic AVS Device SDK Architecture

There is an AudioInputProcessor that implements the AVS SpeechRecognizer interface. The SDK relies on an instance of a Shared Data Stream to collect and share audio data associated with (among other things) wake word detection. The SDK's Keyword Detector emits detection events identifying the wake word segment within the shared data stream. The SDK's AudioInputProcessor Capability Agent then rolls the start reference back by 500 milliseconds, to identify where the pre-roll segment starts, for the purpose of composing the AVS Recognize event.

The specific code where this pre-roll is managed is the executeRecognize method, where we see the following:

begin -= preroll;

where preroll is defined as

// 500ms preroll.
avsCommon::avs::AudioInputStream::Index preroll = provider.format.sampleRateHz / 2;

Note that the 500 millisecond pre-roll duration is equivalent to 8000 samples at the 16 kHz sampling rate used for wake word detection. In addition, a 500 millisecond pre-roll definition

/// Preroll duration is a fixed 500ms.
static const std::chrono::milliseconds PREROLL_DURATION = std::chrono::milliseconds(500);

is also used to adjust timestamps in related parts of the audio input processor:

startOfStreamTimestamp -= PREROLL_DURATION;

For further details on the AVS C++ Device SDK, and integration of the PryonLite wake word engine as a KeywordDetectorProvider adapter, see the AVS Device SDK section.

Collecting audio from an empty state.

Q: What about the initial startup period where audio history is being collected?

A: If (a) - the audio history collection buffer and wake word engine are streamed audio starting from the same time index, and (b) a wake word exists very early on in that audio stream, the PryonLite wake word detector may emit a detection event, such that the start index of the detected wake word lies within the first 500 milliseconds of audio streamed to the engine (which is the same audio collected in the audio history buffer). This leaves us with a possibility where we have collected insufficient pre-roll. This is illustrated below.

Case where wake word is late enough into audio stream for sufficient pre-roll to be available:

sufficient_pre_roll

Case wake word is encountered early enough into audio stream such that pre-roll collected is insufficient.

insufficient_pre_roll

Your integration should be designed to minimize or eliminate the insufficient pre-roll condition, by avoiding unnecessary tear-downs and restarts of the audio history collection mechanism that would flush collection history, as well as unnecessary tear-downs and restarts of the wake word engine. With a wake word detector instance supporting an always-listening use case, the frequency with which these 'possible that insufficient pre-roll has been collected' periods should be confined to a one-time occurrence at the beginning of the listening period. One must keep in mind that each time streaming to the wake word engine is interrupted, this can lead to a service outage that impairs the "always listening" nature of the AVS product.

If you absolutely cannot avoid the 'insufficient preroll' condition, prepending zero samples to make up for any pre-roll history shortfall is not advised.

If a wake word engine tear-down and restart is absolutely necessary, one can still ensure that the necessary pre-roll is still available, by avoiding a teardown of the associated audio history collector:

no_flush_of_audio_history_collector

We can see in the middle of the diagram above, that if the audio history collector retains its state - even if a wake word is detected very early on into the audio streaming of a restarted wake word detector, there is sufficient pre-roll history available.

Note that we can see above in the left of the diagram, that we still have the period of time from when we start filling our audio history collection buffer, that there may be insufficient history to meet pre-roll requirements. In that case, one can offset the streaming of audio to the wake word detector, such that the first 500 milliseconds of audio in the audio ring buffer are not transferred as part of the audio stream to the wake word detector:

offset_streaming_and_collection

We can see that in the diagram above, even if the wake word engine detects a wake word at the very beginning of the audio stream it receives, the surrounding history collection framework is guaranteed to be able to meet the AVS system pre-roll requirements because it retains audio state older than that streamed to the wake word detector. Note that while this delays the time from which wake words can be detected by 500 milliseconds, it does not introduce any latency to wake word detection itself, nor does it delay the immediate transfer of new audio samples from the source microphone to the wake word detector. Be careful not to implement any mechanisms that would unnecessarily introduce any such buffering latency in the microphone-to-wake-word-detector path, just to meet pre-roll requirements.

For most products, a delay of about 500 milliseconds to fully engage to "always listening" wake word detection mode should be acceptable. Also note that if subsequent teardowns and restarts do not flush any collected audio history, this 500 millisecond delay is unnecessary, because the audio history collection buffer is not starting from a completely empty state.