Browse Prior Art Database

%BLT% A System and a Method for Unsupervised Sentence Boundary Detection Using Syntactic Parsers

IP.com Disclosure Number: IPCOM000203888D
Publication Date: 2011-Feb-08
Document File: 4 page(s) / 101K

Publishing Venue

The IP.com Prior Art Database

Abstract

Speech enabled applications use manual or automatic text transcripts of spoken speech that typically do not contain punctuations and sentence boundaries. An abundance of such data in call center settings is thus subject to linguistic and text processing tasks like POS tagging, parsing, named entity recognition, and information extraction with limited accuracy and validity. In this paper we explore the sentence boundary detection task specifically for such datasets. While the typical approach to this task is learning sequential tagging models like CRFs or HMMs, and employing prosodic features, we would like to explore unsupervised techniques that need no human supervision. We explore utilizing the widespread availability of linguistic parsers for the English language to proxy for human supervision. We take a black box approach to parsing text transcripts and present a method that uses the parser's cost function to dynamically detect sentences from continuous text fragments.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 23% of the total text.

Page 01 of 4

%BLT% A System and a Method for Unsupervised Sentence Boundary Detection Using Syntactic Parsers

A System and a Method for Unsupervised Sentence Boundary Detection Using Syntactic Parsers

Abstract: Speech enabled applications use manual or automatic text transcripts of spoken speech that typically do not contain punctuations and sentence boundaries. An abundance of such data in call center settings is thus subject to linguistic and text processing tasks like POS tagging, parsing, named entity recognition, and information extraction with limited accuracy and validity. In this paper we explore the sentence boundary detection task specifically for such datasets. While the typical approach to this task is learning sequential tagging models like CRFs or HMMs, and employing prosodic features, we would like to explore unsupervised techniques that need no human supervision. We explore utilizing the widespread availability of linguistic parsers for the English language to proxy for human supervision. We take a black box approach to parsing text transcripts and present a method that uses the parser's cost function to dynamically detect sentences from continuous text fragments.

Introduction: Spoken speech in enterprise customer care and call center environments, is a rich source of data feeding into speech enabled applications like call routing, quality monitoring, and agent assistance. This speech data is consumed by these applications after manual (expensive) or automatic (noisy) speech transcription. However manual transcription as well as automatic speech recognition (ASR) output is very noisy and produces text without punctuations and sentence boundaries. Sentence boundary detection as a problem was first pointed out by [1] and an initial solution was proposed. Further research on this and similar problems
[2] usually use supervised sequential modeling techniques like HMMs and CRFs. However we note three factors that motivate our work: 1) such supervision is not amenable to transfer across domains and it is hence expensive to train models for every single task at hand, 2) the expertise required to deploy and tune such models is simple not easily available, 3) this problem is highly subjective especially in call center settings where a valid goal is to infer troubleshooting steps or instructions from the call center agent's speech.

Providing supervision in terms of tagging sentence boundaries in continuous text is not an easy task. Consider the following text snippets, respectively a synthetic sentence, a newswire fragment, and an ASR utterance:
Synthetic: i went to school(.) yesterday(.) it was fun.

News: the group said it may buy more shares and plans to study Robeson's operations(.) afterwards(.) it may recommend that management make changes in its operations.

ASR: yeah uh here i just got it again(.) so maybe you can help me get it again. I'm I'm preparing for a customer call at umm four o'clock(.) so I'm in kinda dire straits.

In all...