Browse Prior Art Database

Robust Chinese Forum Search by PinYin based Query Expansion and Bi-index scheme

IP.com Disclosure Number: IPCOM000029102D
Original Publication Date: 2004-Jun-16
Included in the Prior Art Database: 2004-Jun-16
Document File: 2 page(s) / 148K

Publishing Venue

IBM

Abstract

Because of many homophone input errors existing in Chinese forum document collections, people will have big trouble during full-text search on those collections. To solve this problem, in this invention we proposes a special query expansion technology based on PinYin and a bi-index scheme to support fuzzy search.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 2

Robust Chinese Forum Search by PinYin based Query Expansion and Bi-index scheme

   BBS (Bulletin Board System) is an open online e-forum for people to communicate with each other and it is very popular in china, especially in colleges. Most of Chinese universities have their BBSs, some of them have several systems online. For example, for Tsinghua University, there is a college-level bbs which is SMTH (bbs.smth.org), with more than 200,000 (Aug/9th 2003) registered user IDs and online users number can be more than 10'000. Everyday SMTH will generate thousands of new posters and some of them will be saved to "essential area" for permanent store because of their great usefulness for other users. Search in this document storage will be very important feature for users.

   PinYin input method is the most important Chinese input method. According to statistical result ( http://www.pcworld.com.cn/99/9948/4831.asp ), about 97% use PinYin as Chinese input method. However, there are so many homophones in Chinese and BBS is such a free area without any quality control for people to post documents. The typos errors by homophones are very popular in BBS. Also, people from southern china may be confused by some of PinYin(such as "平平平" and "翘平平" and leads to more errors. Some errors are caused by carelessness; some are even for fun, for example, "大大(expert)" and "大 虾(big shrimp)"...

As a summary, this invention claims a robust search method for Chinese forum search and contains the following key points:

1. A bi-index schema
2. A query expansion method based on PinYin

   Just as the previous discussion, so many homophone errors exist in Chinese BBS document collections. It will cause big trouble for full-text search. To solve this problem, in this disclosure, we propose a special query expansion technology based on PinYin and a bi-index scheme to support this kind of search.

   Before the further discussion, firstly, some preliminary definitions are introduced which are useful for subsequent discussions.

   Inverted Index: An index into a set of texts of the words in the texts. The index is accessed by some search method. Each index entry gives the word and a list of texts, possibly with locations within the text, where the word occurs. Please refer to attachment part.

Index Item: Index is a map from index item to articles. It is often represented by keyword or phrase.

PinYin: For each Chinese character, it has its corresponding PinYin according to its pronunciations.

   For the general purpose full-text search engine, it will follow following steps S1 to build the full text index and to do searching:
a) Word segmentation
i. With a dictionary, there will be keyword based segmentation, the dictionary is represented by a list of keywords.
ii. Without a dictionary, there will be N-gram model based segmentation
b) Inverted index building to build index I
i. See attachment part
c) Full text search interface provided.
i. For the given query q , extrac...