Browse Prior Art Database

Improve traditional Phishing blacklist mechanism Disclosure Number: IPCOM000244391D
Publication Date: 2015-Dec-09
Document File: 8 page(s) / 313K

Publishing Venue

The Prior Art Database


An approach to update the blacklist and improve the URL (Uniform Resource Location) matching algorithm from excatly match to a regular expression match methodology. The idea of this invention is to turn the traditional excatly match, meaning an excat URL must match to a URL backlist, to be effective, to a new form of regular expression based matching mechanism. To do this a new way of segment process and segment rule generation are appled. The advantages of this change are, an improvement the performance of traditional excatly match process, which greatly reduce high similarity URLs (the size of blacklist URLs), the time taken for discover a problematic URL among millions of blacklist URLs, and flexible performance index can be applied for aggressive matching tuning.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 01 of 8

Improve traditional Phishing blacklist mechanism

The purpose of this disclosure is to address the current challenges facing by the url filtering rules (url black-lists) to detect phishing url. Current challenges are..

1. Traditional blacklist uses exactly match which is easily to be escaped by the modification of the phishing url (Such as session id)

For example, if there is a Phishing URL: "" exist in blacklist. The phisher will easily escape the matching by change the Phishing URL to ""

2. If we put all possible Phishing URLs into blacklist, it will increase the size of blacklist too fast/large and hard to maintain.

3. The efficiency of blacklist will drop dramatically when it filled with too many invalid records.

So the new solution is needed as the new approach proposed can refine the exiting blacklist and improve the URL matching algorithm from exactly match to regular expression based match, so that the performance improvement can be achieved.

After survey on related paper, we found an interesting one and only one. From this paper, it separates URL into three segment: Hostname, Directory and File names. Then it uses group of malicious URLs to extract lexical pattern from them for the three segments. Below are the top ranking segment found by the algorithms of this paper:


Page 02 of 8

According to the paper experiment, it uses malicious probability ration as performance index:


Page 03 of 8

Comparing to this paper, our approach has some different points than it:

We separate the hostname with dot to have more accurate black-list rule to avoid high FP rate (False Positive rate).

We have customized configuration (performance indexes and tuning training sets) approach to create a more meaningful black-list rule, to fulfill the needs of different users.

We have improve the efficiency of generating black-list rule by refine the process.

We separate a single url to several blocks, and by comparing the length of an incoming url, if the length is different, then no comparing will be done.

We consolidate the original blacklist and use regular-expression token structure to present phishing urls with highly similarity which will help to maintain and c...