Browse Prior Art Database

Bulk Website Monitoring Robot

IP.com Disclosure Number: IPCOM000020250D
Original Publication Date: 2003-Nov-06
Included in the Prior Art Database: 2003-Nov-06
Document File: 18 page(s) / 89K

Publishing Venue

IBM

Abstract

There has been for many years, an ongoing effort by Law Enforcement organisations and Child Advocacy organisations to close down websites hosting illegal pictures of children. In the past, this has mainly been done by acting on reports from the public which is very slow due to the number of sites and it also puts the investigators at risk of harmful code, such as javascript which is sometimes implemented at some of these sites. It is also expensive due to the amount of network traffic generated. ISPs also have to be wary of what their customers are publishing on the web, and need an easier way to monitor the large number of sites owned by each customer (can be up to 65,000) as they generally don't want to be on the wrong side of the law. This application is designed to store the details of the owner of either an IP Address or block of IP addresses and scan through each of these IP Addresses to locate the associated websites and collect information about what JPG images are on each of the pages associated with that site. The technique involved does not execute the page source and so removes the chance of any malicious code being executed. It also prevents the loading of unnecessary bitmaps, sound files etc which cuts down on network traffic and allows a large number of sites to be scanned in a short period. Once this information has been collected, it is a simple matter to view the images associated with the websites in the form of a slide show. Images can be flagged as needing to be reported or to be ignored. At the end of this processing a report can be generated which can be sent to the appropriate law enforcement agency and or websites owner. Some websites lure users by putting popular keywords which the search engines find. The user then is lured to this site as it appears to be exactly what they are looking for, only to find that the website has just modified their registry or corrupted files etc. This Robot can also be configured to do a search for potentially harmful javascript code embedded in these sites. This is done by storing a set of potentially harmful code fragments in the robot's database and performing a string search for these in each of the pages that are being examined. Once a fragment has been found, a record is made in the database of the location of that code. At the end of the robots run a report is produced (see below). The robot cannot remove the potentially offending code due to (1) The websites are accessed in a read only manner and (2) the fragment may not necessarily be bad. IE registry updates may be done for legitimate reasons. Summary of advantages: (1) Page source (HTML/ASP etc) is not executed and therefore there is no risk of damage to the computer (It can take 1 hour to run virus scan programs to fix infections, also time previously taken to reinstall damaged software is no longer spent). (2) The information can be summarised without attendance by the user. This is the bulk of the work. However, the checking of collected data is manual. (3) The data handled is much less. The savings are hard to quantify as it depends on the amount of graphics etc on each page but on average a saving of up to 50% is possible. (4) By law a minimum of 3 offending images are required for a site to be reported. Once the user has flagged 3 images from a site, they can chose to ignore the remainder for that site - Psycologically easier on the user than having to view a large number of images via Netscape etc. (5) Each image is recorded only once, also cutting down on traffic. Some websites refer to the same image from multiple pages. Current Limitations IP addresses are the identifiers for servers, Routers and other items on the internet. A server can be used to host many websites. This application currently only checks the default website for each IP address and therefore is missing other websites on that server. Research is currently being do to find a workaround for this problem. Because of this problem, this application only scans the relative addresses (IE href="/mydir/mypage.html") because it is currently hard to determine wether http://203.201.105.03/mydir/mypage.html is the same as http://www.mysite.com/mydir/mypage.html. A solution has been planned and it will be tested and implemented in a later version.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 43% of the total text.

Page 1 of 18

Bulk Website Monitoring Robot

This flowchart summarises the main logic behind the robot.

Main Flowchart

1

Page 2 of 18

2

[This page contains 1 picture or other non-text object]

Page 3 of 18

3

[This page contains 1 picture or other non-text object]

Page 4 of 18

4

Page 5 of 18

5

[This page contains 1 picture or other non-text object]

Page 6 of 18

6

[This page contains 1 picture or other non-text object]

Page 7 of 18

7

Page 8 of 18

8

[This page contains 1 picture or other non-text object]

Page 9 of 18

9

[This page contains 1 picture or other non-text object]

Page 10 of 18

10

Page 11 of 18

11

[This page contains 1 picture or other non-text object]

Page 12 of 18

12

[This page contains 1 picture or other non-text object]

Page 13 of 18

This is also a simple design. The user selects the appropriate IP Address from a grid on the screen, a database search locates all images for that IP and displays the first one by loading it's full address into a Web Browser Control located on the form. Navigation buttons on the screen control navigation through each image record. Other buttons are used to set a status field on the image record to indicate wether it should be included in the report.

(3) Producing the picture report. This involves locating the registration details and writing to a text file with appropriate HTML tags. The image records which have been flagged are also written to the same text file within appropriate HTML tags. The file is given a ".html" extension so that it can be viewed by the appropriate program (MS Word, Netscape, Internet Explorer etc)

(4) Producing the Javascript report.

This is similar to the generation of the picture report. The details stored in the database during the processing are now displayed in an easily readable format.

A sample is attached here.

13

Page 14 of 18

Harmful Javascript Report. Run date 22/05/2003

The following IP addresses contained javascript which has been determined to be harmful.

IP Address Script Key Description
111.222.255.255 Abcdefghi Deletes essential system files
111.222.255.255/mypage.ht ml

Xyzkkkkk Corrupts registry

Abcdefghi Deletes essential system files

(5) Updating the Javascript fragment table.

This is a very simple procedure, the user is presented with a spreadsheet like grid which enables them to view existing entries and if necessary select one for modification. The user is also able to add new entries or delete existing entries. Each fragment is also accompanied by a description IE "corrupts registry". These descriptions make the javascript report (above) more readable.

User Manual

Basic Usage of the Robot

Part A Registering a site

(1) Obtain the registration details for a site that has been reported using either www.amnesi.com or some other tool.
(2) Copy and the registration details onto the clipboard.
(3) Start the application and the main screen will be displayed (see below). Click the Add button and paste these details into the registration field. This is a scrollable fiel...