Large-Scale Spam Filtering
Original Publication Date: 2004-May-06
Included in the Prior Art Database: 2004-May-06
Current mechanisms to filter spam include black-/whitelists, challenge-response systems, and Bayesian classifiers. To reduce spam load, the rate of false positives/false negatives, and challenge mails should be reduced. A combination scheme is presented below, using a bayesian filter as a frontend to a challenge-response system.
Large-Scale Spam Filtering
Spam (aka Unsolicited Commercial E-Mail, UCE) is becoming an increasing problem. Many users today receive over 100 Spam messages a day.
Traditional mechanisms include blacklists, whitelists, domain-/IP-address-based filters, challenge-response systems, and "self-learning artificial intelligence" (AI) mechanisms.
Static black-/whitelists are too inflexible and are always behind the spammers' efforts
Domain-/address-based filters are often too coarse-grained (e.g., ORDB.org, RBL (
Challenge-response systems are a burden on the users that need to take multiple
steps before a message reaches the intended recipient. Once a sender is authenticated, it can send messages without further challenges. As this does not work for automatically generated mail and mailing lists, pre-authenticated addresses can be used for subscription to these services (e.g., TMDA.net)
AI systems are not accurate; depending on the thresholds set, the number of false
positives or negatives can be significant. Except for useless settings of the thresholds, there will always be both false positives and negatives (e.g., SpamAssassin.org).
Certified Spam databases are user-collaboration tools against Spam. There, a user
may register a mail received as Spam; all further users can verify their mails against the DB to find out whether it was blacklisted there. Disadvantages include a delay until the Spam has been identified as such, as well as continuous database requests, which may not be suitable for large mail systems, as well as potential privacy risks (Vipul's Razor, http://razor.sourceforge.net/)
Therefore, the current systems are not useful if mail delivery needs to be reliable and spam needs to be efficiently sorted out at the same time.
The idea is to combine AI with a challenge-response system. At the external gateway(s), each message is first sent through an AI system for classification (the AI might include domain-/address filters that may include verification of whether the sender domain has a reachable mail server). In addition, two thresholds are chosen by the administrator; in effect, the spam potential of a message will be divided into three groups: low, medium, high.
Messages with a low rating are di...