Double tokenization based on linguistic analysis and N-gram for better global search experience
Original Publication Date: 2004-Aug-30
Included in the Prior Art Database: 2004-Aug-30
Employing the both tokenization techniques, stemming backed by linguistic analysis and N-gramming, will better prepare a search engine for queries in any languages spoken in the world.
Double tokenization based on linguistic analysis and N -gram for better global search experience
Disclosed is a technique that better prepares a search engine for
finding online contents possibly for a query in a different
language than the one used in the source contents.
Current text search technology is based on either linguistic
analysis or simple "N-gram." Linguistic approach parses the
text, segments it into words, then either lemmatizes or stems
them before storing the "base form" of the words in the index.
Optionally the original forms are also stored in order to support
"exact match." Upon querying, the query string is parsed and
transformed to the "base forms" and looked up in the index for
matches. For example, let us say a source document contains a
sentence: "I saw boys running." During the indexing phase, the
sentence becomes a series of base-form words like "I", "see",
"boy", and "run." This enables the engine to find this document
for a query like "Did they run yesterday?" On the other hand,
N-gram is a simpler text segmentation method, which requires
neither dictionary nor linguistic analysis. Therefore, it
performs very fast but produces many more tokens because it
simply extracts any possible 2 ("bi-gram") or 3 ("tri-gram")
consecutive characters as indexable tokens. N-gramming works
well for Asian languages that in many cases do not use obvious
word boundaries, unlike Latin-based languages with spaces,
commas, and so on. Extracting words out of such Asian text
"correctly" is still a difficult task for computers because it
requires advanced morphological analysis and dictionary lookups,
which takes a lot of computing power, and is still erroneous to
even untrained eyes. Let us say "ABCDEF" is a valid sentence in
an "N-gram language" where each character is ideographic. When
tri-gramming it, the text yields the following tokens: "ABC",
"BCD", "CDE", and "DEF." A query needs to be tri-grammed, too,
before looking for matches. For instance, the query "BCDX" would
produce "BCD" and "CDX" as searchable tokens, and the "BCD" will
match and the system will return it as a match. Both
tokenization techniques work very well when the contents and the
query are in the same language. However, mixing other language
contents and queries in the picture gives it new challenges.
Always performing accurate language detection and proper
tokenization is difficult but crucial.
The technique proposed here is to apply always both linguistic
analysis and N-gram for the source contents and queries.By
tokenizing the contents in two ways, the system captures "words"
in more ways for later lookups, and therefore has a greater
chance to find them. However, this also means such a search
engine needs to sift through and rank more matches for the "top
hits" which end users care most.
Language identification plays an important role for tokenizing
the source text in the correct way and for ranking results.
However, because a query is usually very short, often one or two