Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

System and Method for Application-Oriented Hybrid Word Segmentation with Supervised and Unsupervised Data

IP.com Disclosure Number: IPCOM000240823D
Publication Date: 2015-Mar-05
Document File: 2 page(s) / 41K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a system and method to improve word segmentation processes for natural language applications. The novel contribution is a hybrid multi-pass word segmentation framework that takes advantage of both human-segmented data and the monolingual or bilingual data without human annotations.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 01 of 2

System and Method for Application -

-Oriented Hybrid Word Segmentation with

Oriented Hybrid Word Segmentation with

Supervised and Unsupervised Data

Word segmentation is a necessary step in many natural language applications for the languages that do not have words delimited in the written text, such as Chinese and Japanese. The quality of word segmentation can have a significant impact on the system performance, in terms of machine translation, speech recognition, or information extraction.

Existing approaches to solving the word segmentation problem can be divided into supervised and unsupervised approaches, but all have some limitations. The supervised approaches usually train a statistical segmentation model [1] based on some human-segmented training data. Since such data is limited due to its high cost, the supervised approaches have difficulties in correctly segmenting unseen words . On the other hand, the unsupervised approaches [2] do not rely on human-segmented data, but show inferior performances compared to the supervised approaches , due to the lack of guidance from the gold references . Another limitation is that subtle word segmentation sometimes is application dependent , and most word segmentation methods today do not explicitly consider specific applications in minds .

Therefore, a better solution is needed to improve the word segmentation quality and make the approaches more application aware.

The novel contribution is a hybrid multi-pass word segmentation framework that takes advantage of both human-segmented data and the monolingual or bilingual data

without human annotations. A statistical model is trained on limited human -segmented data first. Then, such a model can be used to segment a large amount of text as the first pass. A word list can be generated from the segmented data , with or without using the additional text from another language (e.g., the English translation of the original Chinese text). The word list can also be interactively r...