Publication Date: 2011-Aug-17
Disclosed is a method to generate a site map by crawling through a website using any of the existing methods for crawling. The generator produces information regarding usage of web technology per page and scans the web page source for tags and information on the content generation methods, type of programming used, database queries, multimedia representation, etc.

A Method for Site Map Generation with Deep Content Analysis

In order to perform website work for a customer, developers and consultants need to gather information in order to estimate and size the work. Current site maps and generators only produce a tree of links for the website. There is no known content analysis in existing site map tools to indicate to a developer what type of data is used for the site. Today, developers manually perform the content analysis: visiting every page in the site, manually following every link, and looking at the page source. The human error in this can at times be high. In addition, the consultant must at times determine which pages are practically the same and will only require minimal customization.

For instance, to build an eCommerce product for a customer, developers must examine the customer's current website and estimate its level of complexity and how many hours of labor it will take to migrate it to the eCommerce product. If documentation on the website design exists, it may be outdated and not nearly extensive enough to aid a consultant in determining an accurate number.

An automated method is needed to gather this type of data and assist the consultant in offering a more realistic estimate of what the job entails and the amount of work hours it requires.

Known solutions only provide a skeleton of the website with information on how many links are present in a site, but no information on the actual content or technology involved in the website itself. Searches have not produced any prior art for the analysis of website content. There are examples of patents for the graphical representation of links in a website, and statistical analysis of usage and web page link frequency.

The idea is unique in that it...