System and Method for Application-Oriented Hybrid Word Segmentation with Supervised and Unsupervised Data
Publication Date: 2015-Mar-05
The IP.com Prior Art Database
Disclosed is a system and method to improve word segmentation processes for natural language applications. The novel contribution is a hybrid multi-pass word segmentation framework that takes advantage of both human-segmented data and the monolingual or bilingual data without human annotations.
Page 01 of 2
System and Method for Application -
-Oriented Hybrid Word Segmentation with
Oriented Hybrid Word Segmentation with
Supervised and Unsupervised Data
Word segmentation is a necessary step in many natural language applications for the languages that do not have words delimited in the written text, such as Chinese and Japanese. The quality of word segmentation can have a significant impact on the system performance, in terms of machine translation, speech recognition, or information extraction.
Existing approaches to solving the word segmentation problem can be divided into supervised and unsupervised approaches, but all have some limitations. The supervised approaches usually train a statistical segmentation model  based on some human-segmented training data. Since such data is limited due to its high cost, the supervised approaches have difficulties in correctly segmenting unseen words . On the other hand, the unsupervised approaches  do not rely on human-segmented data, but show inferior performances compared to the supervised approaches , due to the lack of guidance from the gold references . Another limitation is that subtle word segmentation sometimes is application dependent , and most word segmentation methods today do not explicitly consider specific applications in minds .
Therefore, a better solution is needed to improve the word segmentation quality and make the approaches more application aware.
The novel contribution is a hybrid multi-pass word segmentation framework that takes advantage of both human-segmented data and the monolingual or bilingual data
without human annotations. A statistical model is trained on limited human -segmented data first. Then, such a model can be used to segment a large amount of text as the first pass. A word list can be generated from the segmented data , with or without using the additional text from another language (e.g., the English translation of the original Chinese text). The word list can also be interactively r...