Browse Prior Art Database

GENERATING TEXT AT THE CLIENT TO COPE WITH VOICE SAMPLE LOSS

IP.com Disclosure Number: IPCOM000249822D
Publication Date: 2017-Apr-11
Document File: 3 page(s) / 290K

Publishing Venue

The IP.com Prior Art Database

Related People

Pascal Thubert: AUTHOR [+3]

Abstract

Text or other forms of compressed speech (e.g., stenography) is generated at a user device using speech-to-text or lip reading technology. The text is time-stamped with the voice samples but is sent separately (e.g. through different networks) to a conference system. Conference software may use the text as a prompt that is directly presented to the other users, or to help regenerate missing samples.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 53% of the total text.

Copyright 2017 Cisco Systems, Inc. 1

GENERATING TEXT AT THE CLIENT TO COPE WITH VOICE SAMPLE LOSS

AUTHORS: Pascal Thubert

Gabriel Bouvigne Patrick Wetterwald

CISCO SYSTEMS, INC.

ABSTRACT

Text or other forms of compressed speech (e.g., stenography) is generated at a user

device using speech-to-text or lip reading technology. The text is time-stamped with the

voice samples but is sent separately (e.g. through different networks) to a conference

system. Conference software may use the text as a prompt that is directly presented to the

other users, or to help regenerate missing samples.

DETAILED DESCRIPTION

Upstream links that carry audio (e.g., voice) streams to a conference system often

have poor quality, thereby causing loss or delay of voice samples. This results in voices

“breaking up.” The conferencing system described herein copes with the voice sample loss

by regenerating the missing voice samples based on hints from the client. These hints may

be in the form of text or images. Text may be generated at the client device based on lip

reading and/or text-to-speech technology. The text is sent to the conference system for

voice re-creation (e.g., to change the voice or to restore missing samples) or presented to

the other participants to help them understand the speech.

In one embodiment, a microphone in a conference room has speech-to-text

capabilities and generates encoded phonemes. The encoding may include

expressiveness/emotions (e.g., surprise, anger, etc.). Optionally, user vocabulary may be

recognized on the client device.

In another embodiment, the camera on the client couples lip reading with voice

recognition. This enables the system to generate and send text parallel to the speech.

Metadata may be inferred from the user expression. This embodiment may handle

situations involving trouble with the microphone, (e.g., if the impuls...