Browse Prior Art Database

Multi-input, pluggable, extensible classification agent

IP.com Disclosure Number: IPCOM000178879D
Original Publication Date: 2009-Jan-28
Included in the Prior Art Database: 2009-Jan-28
Document File: 3 page(s) / 30K

Publishing Venue

IBM

Abstract

Disclosed is a solution that will accurately classify any content arriving on a user’s system. It will accomplish this by 1) using all content that a user views from his system (even files on remote systems) to form the corpus used to accurately classify new content for that user and by 2) using the learned correlation between words in a document to determine accurate classification of new content.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 3

Multi-input, pluggable, extensible classification agent

A solution is disclosed that will accurately classify any content arriving on a user's system based on a new way of forming the corpus used for comparison and a way for applications and users to associate classification actions with content. Users accumulate a lot of content on their personal systems, including documents they download and create and emails they receive. Managing this growing content is difficult, and results in many problems. For example, users receive junk email that they have to identify and remove (or leave to take up space on their system or mail server). Another example is managing the location of documents on a user's system; when a user saves a file to his system, he chooses where the file will go and attempts to place files that are related in the same location for ease of management, but this is problematic when the number of locations and files is very large, as it is for typical users (duplicate documents are created in different locations, some documents aren't filed in the right location and are then difficult to find, etc).

What is needed is a way to classify content coming onto a user's system, so that it is automatically handled appropriately. For example,

junk email should be deleted, and

files being saved to the system should be placed in the correct directory without extensive user action.

There are some solutions to these problems. Email spam filters automatically file or delete junk email. However, these solutions are inadequate. Today's trainable spam filters perform document vector computations and auto-flag an incoming message based on its n-spatial proximity to clusters of other messages already known to be spam. Right now, filters either a) associate individual words with spam/not-spam probabilities and then calculate a sum of the whole document based on the probabilities of the words in it, or b) index the document based on frequency (and maybe position) of individual words and perform vector math against other "averaged" clouds of spam documents ( http://email.about.com/cs/bayesianfilters/a/bayesian

_

                                   filter.htm). This compares words found in one message with the classification of that word based on all other messages examined, but doesn't consider correlation factors between pairs or groups of words within a message to determine that email is junk (a popular trick used by spammers today is using nonsense phrases with "good" words to throw off typical spam filters).

Existing "desktop" solutions index all files that contain text on a system, so it can be searched by users. However, they do not attempt to classify any content, and only index content in local files on the system.

Proposed is a solution that will more accurately classify content arriving on a user's system. It will accomplish this by 1) Using all content that a user views from his system (even files on remote systems) to form th...