InnovationQ will be updated on Sunday, Jan. 21, from 9am - 11am ET. You may experience brief service interruptions during that time.
Browse Prior Art Database

Using Unstructured Data for Identifying Trends to Provide Predictive Insights

IP.com Disclosure Number: IPCOM000243447D
Publication Date: 2015-Sep-22
Document File: 5 page(s) / 123K

Publishing Venue

The IP.com Prior Art Database


Disclosed is a system that automatically identifies trends to provide predictive insights in massive unstructured data.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 40% of the total text.

Page 01 of 5

Using Unstructured Data for Identifying Trends to Provide Predictive Insights

Limitations in current analytics related to structured and unstructured data cause the identification of patterns of dissimilar data to require significant human interaction. Using the currently available tooling, the user is responsible for acquiring the data, manually reading a significant amount of entries, recalling trends apparent to points of interest, formulating a description of the trends, and confirming the trends using keyword searches or manual calculations. Utilizing keyword searches, this system fails to capture the complexity of the patterns within the data and neglects to produce quantifiable insight on the nature of these trends. With the current method, it is not possible to view patterns in real time, nor is it possible to make comparisons between the past and the future. The anecdotal result is difficult to confirm, and cross-referencing the results is not possible.

The system discussed herein automatically identifies trends to provide predictive insights in massive unstructured data. The system receives unstructured data, which

usually are texts as input. To extract useful information in the unstructured data, the system extracts features (useful information), eliminates noises (useless information), and transforms texts to word vectors by leveraging natural language processing (NLP) techniques. After obtaining the useful information represented by the word vectors, the system discovers the significant patterns (human-readable knowledge) by leveraging data mining techniques. The patterns are ranked by the metrics of domain knowledge,

which provides the predictive insights for future.

Using this system does not require previous knowledge of the material within the data; nor does it require input in a known form. It is creates relational data and fields from unstructured and structured data. The system is dynamic, defining patterns within the data and then acting upon those patterns without making assumptions from previous output. The system provides statistically relevant accuracy and removes analyst bias in the output.

Figure 1: Probability Models for Ranking (PMR) Data Analytics


Page 02 of 5

The system can be divided into seven major parts. (Figure 1)

Unstructured Data 101 is the input of the whole system, which usually are texts of

emails and phone call records in business environments. Unstructured Data 101 contain a lot of valuable information; however, it is hidden in the massive noisy data.

Feature Extraction 102 is to format and clean the data, eliminate noisy (useless texts) data, and extract informative texts, which are the features of the texts. Feature Extraction 102 contains two main components: Format Parser 102a and Natural Language Processor 102b.

Format Parser 102a is for roughly cleaning the Unstructured Data 101. It extracts useful texts based on labels or tags of associated with the texts. The labels or tags usu...