Browse Prior Art Database

SINGLE-MICROPHONE SPEAKER TRACKING

IP.com Disclosure Number: IPCOM000239554D
Publication Date: 2014-Nov-14
Document File: 4 page(s) / 161K

Publishing Venue

The IP.com Prior Art Database

Related People

Haohai Sun: AUTHOR

Abstract

A speaker tracking solution for video conference systems is presented herein that has a low implementation (hardware) cost, small form factor, and low complexity. It is particular useful for small spaces.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 53% of the total text.

Page 01 of 4

SINGLE-MICROPHONE SPEAKER TRACKING

AUTHORS:

Haohai Sun

CISCO SYSTEMS, INC.

ABSTRACT

    A speaker tracking solution for video conference systems is presented herein that has a low implementation (hardware) cost, small form factor, and low complexity. It is particular useful for small spaces.

DETAILED DESCRIPTION

    The system has at least one microphone and one camera. The system is connected to cloud/server, either directly or via a coder/decoder (codec), or a software client. The cloud/server has a speaker/speech and face recognition capability and can identify most people in an organization using their speech/face profiles.

Figure 1 below illustrates the equipment useful for this solution.
Figure 1

Copyright 2014 Cisco Systems, Inc.

1


Page 02 of 4

    Figure 2 below shows a block diagram of a system that includes the equipment for multiple instances of the endpoint shown in Figure 1, a conference server (cloud/server) and a communication network.

Figure 2

    In one example, there are N people participating in a meeting with the proposed speaker tracking system.

Embodiment 1

    The camera captures images/snapshots, and sends them to the cloud/server. The cloud/server detects, locates, and recognizes N1 meeting participants using face detection and face recognition, where N1<=N. The microphone captures speech signals, and sends them to the cloud/server. The cloud/server tries to recognize the speaking participants by searching a limited speech profile database (of N1 speech profiles).

    If the speech matches the speech profile of one identified participant, the cloud/server informs the tracking system to crop the speaking participant (using digital or mechanical pan/tilt/zoom) for a period of time.

Copyright 2014 Cisco Systems, Inc.

2


Page 03 of 4

    If the speech does not match any of the N1 faces, the cloud/server informs the tracking system to show an overview or best overview of the meeting room/environment.

Embodiment 2

    The microphone captures speech signals, and sends them to the cloud/server. The cloud/server recognizes the speaking participant. The camera captures images/snapshots, and sends them to the cloud/server. The cloud/server tries to find, recognize, and locate the speaking participant's face.

    If one face matches the speaker recognition r...