Browse Prior Art Database

Bulk Website Monitoring Robot Disclosure Number: IPCOM000020250D
Original Publication Date: 2003-Nov-06
Included in the Prior Art Database: 2003-Nov-06

Publishing Venue



There has been for many years, an ongoing effort by Law Enforcement organisations and Child Advocacy organisations to close down websites hosting illegal pictures of children. In the past, this has mainly been done by acting on reports from the public which is very slow due to the number of sites and it also puts the investigators at risk of harmful code, such as javascript which is sometimes implemented at some of these sites. It is also expensive due to the amount of network traffic generated. ISPs also have to be wary of what their customers are publishing on the web, and need an easier way to monitor the large number of sites owned by each customer (can be up to 65,000) as they generally don't want to be on the wrong side of the law. This application is designed to store the details of the owner of either an IP Address or block of IP addresses and scan through each of these IP Addresses to locate the associated websites and collect information about what JPG images are on each of the pages associated with that site. The technique involved does not execute the page source and so removes the chance of any malicious code being executed. It also prevents the loading of unnecessary bitmaps, sound files etc which cuts down on network traffic and allows a large number of sites to be scanned in a short period. Once this information has been collected, it is a simple matter to view the images associated with the websites in the form of a slide show. Images can be flagged as needing to be reported or to be ignored. At the end of this processing a report can be generated which can be sent to the appropriate law enforcement agency and or websites owner. Some websites lure users by putting popular keywords which the search engines find. The user then is lured to this site as it appears to be exactly what they are looking for, only to find that the website has just modified their registry or corrupted files etc. This Robot can also be configured to do a search for potentially harmful javascript code embedded in these sites. This is done by storing a set of potentially harmful code fragments in the robot's database and performing a string search for these in each of the pages that are being examined. Once a fragment has been found, a record is made in the database of the location of that code. At the end of the robots run a report is produced (see below). The robot cannot remove the potentially offending code due to (1) The websites are accessed in a read only manner and (2) the fragment may not necessarily be bad. IE registry updates may be done for legitimate reasons. Summary of advantages: (1) Page source (HTML/ASP etc) is not executed and therefore there is no risk of damage to the computer (It can take 1 hour to run virus scan programs to fix infections, also time previously taken to reinstall damaged software is no longer spent). (2) The information can be summarised without attendance by the user. This is the bulk of the work. However, the checking of collected data is manual. (3) The data handled is much less. The savings are hard to quantify as it depends on the amount of graphics etc on each page but on average a saving of up to 50% is possible. (4) By law a minimum of 3 offending images are required for a site to be reported. Once the user has flagged 3 images from a site, they can chose to ignore the remainder for that site - Psycologically easier on the user than having to view a large number of images via Netscape etc. (5) Each image is recorded only once, also cutting down on traffic. Some websites refer to the same image from multiple pages. Current Limitations IP addresses are the identifiers for servers, Routers and other items on the internet. A server can be used to host many websites. This application currently only checks the default website for each IP address and therefore is missing other websites on that server. Research is currently being do to find a workaround for this problem. Because of this problem, this application only scans the relative addresses (IE href="/mydir/mypage.html") because it is currently hard to determine wether is the same as A solution has been planned and it will be tested and implemented in a later version.