Dismiss
InnovationQ will be updated on Sunday, September 30, from 10am - noon ET. You may experience brief service interruptions during that time.
Browse Prior Art Database

Method and System for Optimizing Weight Hyper-Parameters of Keyword-Feature Groups based on Labeled Data from Source

IP.com Disclosure Number: IPCOM000255289D
Publication Date: 2018-Sep-14
Document File: 3 page(s) / 130K

Publishing Venue

The IP.com Prior Art Database

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 53% of the total text.

Method and System for Optimizing Weight Hyper-Parameters of Keyword-Feature Groups based on Labeled Data from Source

While solving term matching problem, a machine learning model is trained with terms from customer defined business glossary. Weight value of keyword features from business terms is calculated using term frequency–inverse document frequency (tf-idf) that represents how important a word is to a document in a collection or corpus. But weight values calculated by tf-idf alone do not necessarily reflect the relevancy of some keywords for a customer, say, if a full-form keyword with higher weight value is used for term matching with customer’s dataset having abbreviation or acronym of the full-form keyword. Therefore, there exists a need for tuning weight values of keywords based on source/customer’s dataset. Disclosed is a method and system for optimizing weight hyper-parameters of keyword- feature groups based on labeled data from source. In accordance with the method and system, a tf-idf technique is applied to generate/calculate a base weight value of each keyword feature for each business term instance. Then, the method and system considers a keyword type for applying corresponding regularizing weight ratio on top of the calculated weight. In an aspect of the method and system, the weight ratio values are not hardcoded, but are continuously tuned by learning from customer’s labeled data (i.e. term assignments). The method and system is then updated with the values that tweak machine learning model to get highest accuracy score on testing dataset (subset of customer’s labeled data). In an embodiment, the keywords extracted from different sources are managed in different bags, and weight values of keyword features are calculated from different bags separately with tf-idf. Subsequently, one weight matrix for one type of keywords is generated. In an exemplary scenario, there are 12 different type of keywords, namely “Term Names, Term Description, Term Category, Column Name, Column Description, Database Name, Schema Name, Table Name, Data Class, Abbreviation Matching, Stemmed Words, Synonym”. Accordingly...