Browse Prior Art Database

Application of context question tests as a golden rule on supervising the latent semantic analysis of Chinese language

IP.com Disclosure Number: IPCOM000237629D
Publication Date: 2014-Jun-27
Document File: 5 page(s) / 245K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a method to apply a set of context questions as feedback for optimizing the parameters of the latent semantic analysis of Chinese language, in the field of the natural language processing.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 43% of the total text.

Page 01 of 5

Application of context question tests as a golden rule on supervising the latent semantic analysis of Chinese language

Within Natural Language Processing (NLP) latent semantic analysis [1-3] is a technique for analyzing relationships between a set of documents and the terms by converting this relation to a matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph). The raw matrix is then put through singular value decomposition to reduce the dimension of the matrix for optimizing the signal over noise. Thus, the dimension of the singular value decomposition process is the key for optimizing the information retrieval [4] result of this latent semantic analysis. For English, the optimal dimension of the singular value decomposition is about 300, utilizing psycholinguistic experimental methods as an independent validation . For the Chinese language [5], however, this validation has not been obtained.

The novel contribution is a method to apply the Context Question Tests as a golden rule to optimize the dimensionality in the Latent Semantic Analysis of the Chinese language.

The method employs a combination of Latent Semantic Analysis of Chinese language and/or Singular Value Decomposition. It also applies the Context Question Tests as a golden rule for supervising the Latent Semantic Analysis. The design includes software that implements the methods for optimizing the Latent Semantic Analysis of Chinese language. The system is applied for the analysis or information retrieval of Chinese language.

Figure 1 illustrates the Latent Semantic Analysis supervised by the golden rule of Context Question Tests. Original Chinese text documents or corpus 101 is preprocessed to identify each sentence and segment each word in the sentences ,

which is defined as 102. In 103, word-by-sentence matrix A is constructed in such a

way that each element A(i,j) is corresponding to the frequency of word # i that appears in sentence # j , plus various local and global weighting [6]. In 104, singular value decomposition (SVD) is applied to A, with a pre-chosen dimension of SVD. Based on the input from the SVD process in 104, Context Question Tests are performed (will be explained in detail in Fig. 2), and the efficiency evaluated in terms of how many questions are correctly answered, which is defined as 105. If the result is good, then the SVD and latent semantic analysis is optimized (106) and verified to be useful for other information retrieval applications. Otherwise, 104 is repeated with a different dimension of SVD, and then 105, until the result is acceptable.

Figure 1: Schematic of process of Latent Semantic Analysis supervised by the golden rule of Context Question Tests

1


Page 02 of 5

Figure 2 shows a schematic example of the Context Question Test . 201-205 are five different words and 206-210 are five different sentences with one word missing in each . Word 201-205 is one-to-one corresponding to t...