Browse Prior Art Database

SMART VOICE ACTIVITY DETECTION USING ADAPTIVE CUSTOM LANGUAGE MODELS

IP.com Disclosure Number: IPCOM000249837D
Publication Date: 2017-Apr-13
Document File: 4 page(s) / 290K

Publishing Venue

The IP.com Prior Art Database

Related People

Dario Cazzani: AUTHOR

Abstract

Current solutions for voice user interfaces do not adapt to user habits and sentences. Adaptive language models and voice activity detection algorithms are combined to provide a custom adaptive language model linked to each user of a voice service to better predict when the spoken command ends.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 53% of the total text.

Copyright 2017 Cisco Systems, Inc. 1

SMART VOICE ACTIVITY DETECTION USING ADAPTIVE CUSTOM LANGUAGE MODELS

AUTHOR: Dario Cazzani

CISCO SYSTEMS, INC.

ABSTRACT

Current solutions for voice user interfaces do not adapt to user habits and sentences.

Adaptive language models and voice activity detection algorithms are combined to provide

a custom adaptive language model linked to each user of a voice service to better predict

when the spoken command ends.

DETAILED DESCRIPTION

Digital assistance, bots, and devices configured to interface via voice need to know

when a command has been spoken in its entirety in order to process the command and

provide the proper feedback to the user. Voice activity detection (VAD) systems determine

when a user has finished speaking. However, VAD systems are oblivious to the words

being spoken and instead base their determination on recognizing whether the analyzed

sound is speech (as opposed to non-speech noise).

Conventional techniques take advantage of the language model used in the speech-

to-text engine to detect whether the spoken utterance is a complete sentence. The amount

of time required to wait for non-speech (VAD_PATIENCE) is shortened or lengthened

accordingly. However, these techniques do not take into account that different users have

different ways of issuing commands, and above all it does not adapt to a user's common

sentences / way of speaking. The user experience could be greatly improved if the

VAD_PATIENCE adopted to user habits.

It is possible to use a VAD system, or a combination of a VAD system with a

language model, in order to detect when a user has terminated a statement (e.g., finished

saying a command). The VAD system is responsible for detecting whether, in the current

timeframe, the input is speech or non-speech. If the VAD has detected non-speech for a

sufficient amount of time (VAD_PATIENCE), the end of the command is determined to

Copyright 2017 Cisco Systems, Inc. 2

have been reached. The VAD_PATIENCE may be increased or decreased based on

whether the words so far uttered by the user constitute a finished sentence/command. In

order to make this estimation, a language model may predict whether the most recent

spoken word is intended as the last spoken word.

When building a voice-interfaced system, it is possible to use third party speech-

to-text engines. However, this solution does not enable a smart VAD that can adjust the

VAD_PATIENCE because the language model used in the third party speech-to-text is not

customizable/modifiable. Described herein is a solution that creates a custom and adaptive

language model for each user.

This adaptiveness is used every time the user issues commands, and the language

model may be shaped around the domain...