A new indexing method based on Named-Entity-Recognition and Alias Generation for Chinese Search
Original Publication Date: 2006-Sep-30
Included in the Prior Art Database: 2006-Sep-30
An indexing method based on NER(Named Entity Recognition) and entity alias generation, which will offline generate the alias for the entity in the original text and pre-build the alias in the inverted index. Entity aliases are iteratively constructed using a set of representation, abbreviation, substitution, or induction rules.
A new indexing method based on Named -Entity-Recognition and Alias Generation for Chinese Search
1. Background: What is the problem solved by your invention? Describe known solutions to this problem (if any). What are the drawbacks of such known solutions, or why is an additional solution required? Cite any relevant technical documents or references.
Search engine can be considered as a matching system to build the connection between submitted query and indexed contents . There are two major performance measurements for search result evaluation, precision and recall. To improve the recall, search engine application need to handle different kind of queries which could be in abbreviation form, or even typos. For example, the submitted query is "IBM", user also want to get the result in the form of "International Business Machines Corp." or "INTL BUSINESS MACH". In Map search application, for example, the submitted query is "建 外外外"， "建建建外外外" will be the right answer for this query also. For typos, for example, the submitted query could be "章章章", the actual results could be "章章章" " 彰章章",or "彰章章". To solve this problem, current system applied fuzzy search approach. For example, edit distance is used in measure the similarity between similar terms.
Generally, current approaches have two big issues. One is the slow response time in query stage which is caused by huge computation cost for fuzzy match. Another problem is the fuzzy matching can not be easily fit into the classic retrieval system framework based on reverted index. In this disclosure, we proposed a new indexing method to solve these problems. The basic idea is the alias will be pre-built in indexing stage based on Named-entity-recognition and Alias Generation. For example, if the content has entity " 建建建外外外", "建外外外" and "建外" will be generated as alias form and the alias will be build into index also. So, if user submit a query "建外", he can also got the result about "建建建外外外". In query stage, no fuzzy match need to be applied, as a matter of fact, this method will reduce the huge computation cost. Also, the general search engine framework can be used without any modification.
2. Summary of Invention: Briefly describe the core idea of your invention (saving the details for questions #3 below). Describe the advantage(s) of using your invention instead of the known solutions described above.
Firstly, some preliminary definitions are introduced which are useful for subsequent discussions.
Inverted Index: An index into a set of texts of the words in the texts. The index is accessed by some search method. Each index entry gives the word and a list of texts, possibly with locations within the text, where the word occurs.
Entity : an object or an event about which information is sto...