Near real-time automatic speech recognition using speaker change events
Publication Date: 2010-Nov-21
The IP.com Prior Art Database
Disclosed is a method for a practical Automatic Speech Recognition (ASR) in the context of audio (and video) multi-party conference meetings, while maintaining low delay and recognition accuracy.
Page 01 of 3
Disclosed is a method for near real time ASR to be used in
multi-party conference calls. The need for results in
real-time or near real-time is strong, for example when a
person want to catch-up what he missed from the meeting start
or when there is a need to find relevant live meetings. The
disclosure relates to this need and brings a practical
solution with acceptable results, compared to post-processing
process and in many cases even better than a post-processing
on a mixed audio stream.
Automatic speech recognition (ASR) in the context of
audio (and video) conference calls is being developed for a
long period, having mixed success:
Systems that were extensively trained for specific user
("user enrollment") bring good results in case the user is
being identified reliably , but it is not practical for
server-side processing in systems with thousands of clients.
Systems that process speech in post-processing (multi-pass
decoder) , have improved accuracy, but doesn't fulfill all
the needs, as in many cases real-time or near real-time
results are required.
Systems that process speech in real-time have unacceptable
accuracy results (even if they have built in adaptation,
because we are in a multi-speaker environment).
The proposed system processes a meeting audio stream in a
real-time, splitting it to a small segments, but big enough
for an effective ASR process.
The split performed using predefined thresholds for minimal
and maximal segment length and using active speaker changes
events, in order to create optimal segments. The goal is to
create relatively small segments with minimal number of
various speakers. The ASR service is usually able better to
update the model for one or a small number of speakers than
for a big plurality of speakers. With small threshold values,
there is a big chance for segments with one speaker only.
Near real-time processing with much better results than a
real-time processing results