Cooperative e-mail classification based on selective in-band notification within pattern based contextized groups
Original Publication Date: 2005-Oct-06
Included in the Prior Art Database: 2005-Oct-06
We have made two major additions to the original idea to address the implementation aspect of the idea and also enhance its practical value. Basically this disclosure is about a system for collaborative e-mail classification and spam classification is contained within the generic e-mail ranking function of this system. 1. Context based classification of users and limitation of confidence measures to these contexts. One of the problems pointed out during the evaluation of our previous disclosure was that confidence measures may not be valid because a user can get potentially conflicting recommendations from different peers. By describing confidence only within a context we overcome this problem. In our refined approach, whenever a recommendation is made the recommendation has a confidence measure and also an indication of which context it applies to. What is a context? A context is a dynamic grouping of users based on similarities derived by our system based on different parameters including, - historical behavior (interactions between users for example) - implicit grouped behavior (being on the same Lotus Notes mailing "group" for example) - patterns mined from legitimate messages (both Rob and Bob receive a number of messages about WebSphere) for example Users are hence grouped not only based on their historical behavior but also based on message contexts (this is descibed in the summary section). Within a particular context, confidence is established based on user interaction patterns. This is an effective way of deriving knowledge implicitly - by contexting the knowledge, we further refine the learning. 2. Our idea now goes beyond our initial narrow spam notification space to e-mail ranking and classification. This increases the scope of practical applicability of our idea and we find our idea is more useful for e-mail ranking based on relevance to contexts deduced by the system. By leveraging implicit collaboration and using suggested heuristics again within the context of message similarity, we can make e-mail a more efficient tool. It is true that this system as any heuristic system will not provide perfect accurate results but the classification (though not perfect all the time) will definitely help users deal will massive amounts of e-mail. Our approach to implicit heurisitcal classification together with selective notification and notification propogation can be extended to many other areas but we have chosen to focus on e-mail classification as it is in that area where we see this idea benefitting the IBM business tremendously as a unique useful value-add to IBM's offerings. This is from a IBM Academy study proposal on spam technologies.... "To win in this area, where Microsoft is investing aggressively, IBM will need to enunciate a complete vision for fighting spam, and deliver a superior suite of antispam technologies and solutions to the market."
Cooperative e-mail classification based on selective in -band notification within pattern based contextized groups
Spam e-mail is considered to be one of the biggest hurdles towards productivity. As it is well-known, the amount of spam e-mail will soon surpasses the amount of legitimate e-mail that is being passed around. According to a Ferris Research white paper, such unsolicited commercial e-mail makes up 30% of all e-mail exchanged today. The same paper points out the issues with the mobile messaging market becoming a breeding ground for spam and the amount of time, effort and money corporations spend on maintaining their spam defenses.
A number of systems are in place to control spam. Some of these systems are server-based and try to filter spam based on blacklists while allowing the ones listed on global whitelists. Other approaches use different filtering mechanisms. Very few existing solutions work on the user's side - many of these allow the user to specify what is spam and what is not spam.
Here are some issues with existing mechanisms to control and remove spam, Existing approaches are centralized and classify spam at a global level without regard to individual user's perceptions and analyses. When user perceptions are taken into account they take a global effect. A centralized system (particularly one that takes user perceptions into account) poses the problems of scalability and there is a need to automate the process of classifying spam by allowing the system of users to perform the process implicitly in a collaborative fashion. The problem is worse in mobile messaging markets where the intermittent connectivity means that spam updates are not guaranteed to reach all users and users may not be able to update the centralized black lists in time. In existing centralized approached, the time to update the blacklists or filtering knowledge is higher than just the network latency because of the time required to determine the information that will then be pushed onto the actual centralized classifiers.
Centralized approaches that allow for user feedback have a built in latency while they wait for
more indications of the same problem with the inherent problem of starvation - a user
notification may never be useful until there are more; a situation that worsens with the size of the
organization because most thresholds are relative. High effectiveness leads to higher false
positive rates. Corporate organizations will accept only a false positive rate between 0.001% to
0.01% (Ferris Research White Paper). False positives may become more of a problem if there is no agreement between e-mail users in an enterprise on what is spam and what is not. Finally, existing spam mechanisms use different approaches (genetic signatures, word based filtering, rule based filtering etc.) and do not interoperate requiring deployment of a single or more mechanisms on a enterprise wide basis restricting user choice and reducing the effectiveness t...