Browse Prior Art Database

Partial Matching for Natural Language Processing using TF/IDF, Token Matching, and Concept Matching

IP.com Disclosure Number: IPCOM000241540D
Publication Date: 2015-May-08
Document File: 4 page(s) / 51K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a method to find stronger matches of a search text to elements of a dataset while reducing the level of noise in the response. The method leverages the existing concept based matching in Smart Meta Data (SMD) with preprocessed Term Frequency-Inverse Document Frequency (TF-IDF) results, and measures of the strength of a match from the question tokens to column names in datasets.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 01 of 4

Partial Matching for Natural Language Processing using TF/IDF, Token Matching, and Concept Matching
When relating natural language to data, finding a direct match in spelling and phrasing in the sentence to the available data is often improbable.

A current solution relies upon concept matching to trigger a match. This means that both the dataset and the search text had elements that converged to the same concept or hierarchy of concepts. This could allow a concept of Date to be related to the child concept of Year. This approach, however, does not provide good flexibility in the types of questions and matches allowed. This is because words have to be outlined and explicitly related to concepts. A more flexible approach, including other criteria for matching, is needed.

The novel solution is to find stronger matches of a search text to elements of a dataset while reducing the level of noise in the response. The method scores matches on a one-to-one (1-1) basis, evaluating each token from the question to each column in the dataset. The algorithm works three-fold, leveraging the existing concept based matching in Smart Meta Data (SMD) with preprocessed Term Frequency / Inverse Document Frequency (TF/IDF) results and a measure of the strength of a match from the question tokens to column names in datasets.

SMD provides the ability to assign a category to a word. An example is anything related to a date having a concept date assigned to it. This shows a potential match between words with similar categories. This, however, does not provide enough information to know with much certainty the strength of the match.

Example: A particular dataset contains the following columns with the key word, "employee", totaling 12 columns.
{"Is Employee Exempt", "Is Employee Parttime", "Employee Common Stock Target", "Employee Common Stock Mid Point", "Employee Common Stock Minimum", "Employee Common Stock Maximum",
"Employee Restricted Stock Target", "Employee Restricted Stock Mid Point", "Employee Restricted Stock Minimum", "Employee Restricted Stock Maximum","Employee Code", "Employee Type"}

The question, "Count of employee exempt" renders all of the 12 columns because all are related to the keyword, "employee". However, the query actually only wants, "Is Employee Exempt".

The core novelty is to user different measures to address the level (i.e. the strength) of the relationship between the question and

1


Page 02 of 4

the column. Using all of the column headings of the data as documents for the TF /IDF formula, the approach preprocesses all tokens of each column to determine the frequency to which a word is referred. Some data sets might have many columns with the token "Employee", which makes its importance for a match fairly low.

Because TF/IDF determines frequency, the novel method takes the inverse of the TF/IDF results to measure the uniqueness of a

word. Uniqueness is defined as a ranking of how infrequently a word appears i...