Browse Prior Art Database

Method and System for Improved Spam Discrimination Disclosure Number: IPCOM000021447D
Original Publication Date: 2004-Jan-19
Included in the Prior Art Database: 2004-Jan-19
Document File: 2 page(s) / 37K

Publishing Venue



A program is disclosed which uses the "Received:" lines which document the path that a piece of e-mail takes from its sender to its receiver as part of a spam discrimination mechanism (which could be Bayesian classification or any other method), but only those "Received:" lines which provide information about the path of the e-mail before its final entry into the receiving enterprise. A system which ignores all "Received:" lines cannot take advantage of information indicating that a piece of e-mail may have been routed through known spam relay systems (or, conversely, that all of the steps in the path are known to be well-behaved systems). A system which uses all "Received:" lines can be confused in its analysis by using information about systems within the enterprise, even though that information is of no value in making a determination whether an e-mail is spam or not.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 2

Method and System for Improved Spam Discrimination

Spam (broadly defined for the purpose of this document as "unwanted electronic mail (e-mail) from outside the enterprise") is a major problem for users of e-mail (the US Federal Trade Commission, in announcing a forum on the problem of spam, cited estimates that more than 1/3 of all e-mail is spam [1]). There are many techniques for separating spam from desired e-mail; one of the more successful such techniques is known as "Bayesian classification", as proposed by Paul Graham in his "A Plan for Spam" [2] and "Better Bayesian Filtering" [3] and as implemented in many tools.

However, existing Bayesian classification tools can be confused by analyzing information pertaining to the routing of e-mails within the receiving enterprise. This causes mis-classification of the e-mail in certain circumstances; in particular, e-mails which originate outside the enterprise and which consist solely (or almost entirely) of attached documents may easily be misdiagnosed as spam.

This disclosure provides an improvement in the analysis of the routing information provided with an e-mail so that information within the enterprise is ignored but information about routing outside the enterprise (which can, for example, show that the e-mail was routed through known spam relays) is used.

This improvement relies on the behavior of mail transfer agents implementing the Simple Mail Transfer Protocol (SMTP), as defined in RFC 2821 [4]. A compliant mail transfer agent adds one header line to each piece of e-mail it processes; this line, called the "Received:" line, is added before any other "Received:" line in the e-mail and has a standard format (defined in RFC 2821). In particular, the line identifies both the system receiving the e-mail (the "by" system) and the system from which the mail was received (the "from" system).

When processing a piece of e-mail to determine if it is spam or not, knowledge of the systems through whi...