Browse Prior Art Database

METHODS AND APPARATUSES FOR AUDIO/VIDEO SYNCHRONIZATION IN A HYBRID VIDEO CONFERENCING SYSTEM

IP.com Disclosure Number: IPCOM000248428D
Publication Date: 2016-Nov-28
Document File: 7 page(s) / 1M

Publishing Venue

The IP.com Prior Art Database

Related People

Hank Peng: AUTHOR [+4]

Abstract

Described herein are methods and devices to provide audio/video (A/V) synchronization in a hybrid conferencing system in which the audio stream is mixed and the video stream is switched. These may be implemented as conferencing systems/services move to the cloud.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 33% of the total text.

Page 01 of 7

METHODS AND APPARATUSES FOR AUDIO/VIDEO SYNCHRONIZATION IN A HYBRID VIDEO CONFERENCING SYSTEM

AUTHORS:

Hank Peng

Rui Zhang

 Keith Yan
Wilford Wang

CISCO SYSTEMS, INC.

ABSTRACT

    Described herein are methods and devices to provide audio/video (A/V) synchronization in a hybrid conferencing system in which the audio stream is mixed and the video stream is switched. These may be implemented as conferencing systems/services move to the cloud.

DETAILED DESCRIPTION

    In certain video conference architectures, there exists an A/V synchronization problem in which the video signal is switched on a media server while the audio signal is mixed on the media server. This problem will become more widespread as conferencing systems move to the cloud and switch-based or hybrid-based architectures become more mainstream.

    The Real-time Transport Protocol (RTP) and/or RTP Control Protocol (RTCP) standards define a timestamp in the RTP header such that the A/V media streams can be synchronized at the receiving client side for improved user experiences. In a point-to- point call, each RTP stream includes a timestamp in its respective RTP header and, with the timestamp, an RTCP sender report (SR) that typically represents the wall clock time for the sender. This enables the A/V streams to be synchronized. Figure 1 below illustrates lip sync in a point-to-point call.

Copyright 2016 Cisco Systems, Inc.

1


Page 02 of 7

Figure 1

    In traditional video conferencing systems, a central media server/bridge performs audio mixing and video transcoding/composition. The media server regenerates the A/V streams with respect to a "wall" clock for the downlink to enable the receiver to synchronize the A/V streams. Figure 2 below illustrates lip sync with audio mixing and video composition.

Figure 2

    In recent years, video conferencing systems have evolved from an on-premises software deployment to a cloud-based deployment, and a video switching-based architecture has become more popular to reduce operational costs. In an example, the video stream can be forwarded directly to the receiving client, while the audio stream can be (1) forwarded if the format is allowed and the client is able to perform mixing locally, or (2) mixed on the server. In this hybrid case (i.e., where the audio stream is mixed and

Copyright 2016 Cisco Systems, Inc.

2


Page 03 of 7

the video stream is switched), the timestamps of A/V streams are sent from different sources. As such, A/V synchronization is difficult in the hybrid case. Figure 3 below illustrates this challenge in the hybrid case.

Figure 3

    This problem does not arise if the audio stream is switched from the server and locally mixed at the receiver side. However, there are certain situations in which the audio stream needs to be mixed on the server side (e.g., when the audio codecs are different from each client such that the receiver cannot decode every format, when the receiver processing capability is too low to perform local audio mixing,...