Browse Prior Art Database

Online Phrase Representation System Disclosure Number: IPCOM000243004D
Publication Date: 2015-Sep-08
Document File: 2 page(s) / 34K

Publishing Venue

The Prior Art Database


This disclosure proposed a new method to represent the phrase into vectors, and it also supports the online generation of vectors for the new coming phrases. Representing the phrases using vectors is an important task in Natural language processing. After representing the phrases into vectors, the phrases with higher semantic similarity will have shorter distance in the vector space. A good representation is useful to measure the semantic similarity of text phrases. Existing methods are proposed to generate the vector presentation of phrases offline using nerual network based method or matrix factorization based method. However, the number of phrases is huge and new phrases continue to come, the coverage of the offline method is not good. To address the above drawback, we have proposed an oneline method to generate the vector representation of phrases. The key idea is to generate the vector represtnation of a set of core word offline, and then use the representation of these core words to generate the vector representation of new phrase online.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 100% of the total text.

Page 01 of 2

Online Phrase Representation System

Our method could be divided into two steps, i.e. training process and implement process. The training process is conducted offline. During the training process, we will collect a large number of training corpus. The wikipedia data could be a good data source for the training corpus due to its large coverage on various topics. Then we will calculate the occurrence of each word in the corpus, and select the top-k frequently used words as the core words. In our implementation, we choose k as 15,000 to obtain a good balance between offline processing and online processing. Then, we will generate the context matrix for these core words and use the matrix factorization to generate the vectors for these core words.

For the implement process, we generate a larger corpus with the data sources from the Web and use the inverted index to index the data. Given an incoming phrase, we try to find the context words of the phrases from the index. Finally, the vector of the phrase is generated using the distribution and the vectors of the context words from the pre-calculated vectors for the core words.


Page 02 of 2