Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Building a Spam Corpus Automatically

IP.com Disclosure Number: IPCOM000019303D
Original Publication Date: 2003-Sep-10
Included in the Prior Art Database: 2003-Sep-10
Document File: 3 page(s) / 44K

Publishing Venue

IBM

Abstract

Techniques for extending spam/non-spam corpus files automatically. The first technique builds the non-spam corpus by using internet browsing history to customize a non-spam corpus. The second technique builds a spam corpus by using email deletion history. The third technique uses the words from a user's chat session to help build the non-spam corpus.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 46% of the total text.

Page 1 of 3

Building a Spam Corpus Automatically

Disclosed are several methods for automatically seeding a non-spam corpus with material of interest to the individual user to decrease the learning period required for making Bayesian filtering of incoming email effectively separate the unsolicited bulk email from the interesting email.

Unsolicited bulk email is alternatively known as unsolicited commercial email and colloquially known as spam. A leading information technology industry analyst projects that the total number of email messages sent daily will exceed 60 billion worldwide in 2006, almost double that of the estimated 31 billion in 2002. Unfortunately, they also expect that approximately 50 percent of this email will be spam or unsolicited email. Spam email is even more prevalent among internet service providers. A prominent spam-filtering service company, reports that on average 67 percent of the email their customers received on a typical day was unsolicited commercial email. They also say that spam has increased 65 percent over January 2002.

Due to the proliferation of spam, spam filtering has become widespread. Frequently, these spam filters use Bayesian filtering which is a simplified pseudo-Bayesian algorithm is described. Many open source and proprietary spam fighting solutions exist that utilize Bayesian filtering technology.

Bayesian spam filters work by creating two lists. The first list is a list of words that occur in spam email and their occurrence probability. It is referred to as the spam corpus. The second list contains a list of words which occur in the non-spam email along with their occurrence probability. This list is referred to as the non-spam corpus (colloquially, the "ham" corpus). (For performance reasons, the actual implementation may opt to keep a relatively small number of the most highly rated words are in the corpora.)

In the pseudo-Bayesian algorithm, the spam index to categorize a single word in an email is calculated as:

sp=[occurrences in spam]/[spam corpus size] np=[occurrences in nonspam]/[nonspam corpus size] spamliness=sp/(sp+np)

What all of the implementations of the Bayesian filtering algorithms have in common is that the lists are trained by the user when he or she reads the email and categorizes it as either spam, or non-spam as a manual step or a series of manual steps. Collections of spam email are also available for download to use for quick training of the spam corpus. Default spam/non-spam word lists are provided by many spam filtering service providers. However, since the non-spam corpus is inherently very individual

1

Page 2 of 3

there is no quick way to train the filter for non-spam mail unless the user has previously saved a high-quality archive of their most interesting email. The preexisting email archive may weaken the non-spam corpus if the user only weakly pruned the incoming email or if the user's interests have changed significantly.

It is very time consuming to categorize ema...