Browse Prior Art Database

Using NLP, Content Analytics, And Predictive Analytics To Analyze And Group Similar Documents From A Larger Pool Of Disorganized Documents

IP.com Disclosure Number: IPCOM000239116D
Publication Date: 2014-Oct-13
Document File: 2 page(s) / 41K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a system that automatically groups documents by topics and features to reduce the amount of user effort required to organize and locate items. The system uses Natural Language Processing (NLP), Content Analytics, and Predictive Analytics to analyze and group similar documents from a larger pool of disorganized documents, uses document-clustering algorithms to identify features common to groups of documents/emails, and then uses those features as document tags for processing and organizing.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 01 of 2

And Predictive Analytics To Analyze And Group

And Predictive Analytics To Analyze And Group

Similar Documents From A Larger Pool Of Disorganized Documents

The disclosed method assists users with keeping documents in electronic file folders or email folders organized.

Using NLP, ,

Content Analytics

Content Analytics , ,

According to embodiments of the present invention, documents are grouped by topics and features two ways: supervised and unsupervised. Responsive to users entering freeform text into an interface, the text is processed into groups of data, which subsequently are stored in a database. The method automatically creates a folder-like structure for unstructured documents/emails. This greatly reduces the amount of user effort required to organize and locate items.

The method uses Natural Language Processing (NLP), Content Analytics, and Predictive Analytics to analyze and group similar documents from a larger pool of disorganized documents. Document clustering algorithms identify features common to groups of documents/emails. These features are then used as tags for the documents/emails. Tags allow a document/email to exist in multiple locations, which is an improvement over existing filing systems.

Machine learning may be used to identify spam and other types of documents. Words that are common to spam emails, for example, become "features", the presence of

which can indicate that a particular email is a spam email. While the word, "sale", may

be common to spam emails, it is also common in non-spam emails. Therefore, a feature vector is constructed using several indicator words as features, and machine learning techniques are applied (minimizing the classification error via a golden data set) to reach a determination as to the spam content of the email.

In a similar fashion, a document set can be analyzed using text analytics annotators, generating annotation metadata (...