Browse Prior Art Database

SYSTEM AND METHOD TO AUTOMATICALLY DETECT THE NATIVE LANGUAGE OF TEXT-BASED DOCUMENTS

IP.com Disclosure Number: IPCOM000014493D
Original Publication Date: 2000-Feb-01
Included in the Prior Art Database: 2003-Jun-19

Publishing Venue

IBM

Abstract

System and method to automatically detect the native language of text-based documents The system described in this article is related in the area of Internet Search Engine technology. Internet Search Engines usually consist of an information collection component, which is also called "gatherer" or "crawler". This crawler component actively searches the World Wide Web (WWW) for available documents by following recursively hyperlinks (URLs) on web based documents. When a document is detected, it can be analyzed, indexed, and the summary information (Metadata) of a document is usually stored in a database system. The database system can then be queried by people who are searching for particular information. Note that this description gives only a brief overview of how current Internet Search Engines system are working. In most times the whole process is more complicated and involves more steps in order to improve the index quality. This document will provide an overview of the system and is comprised in the following sections: 1. Problem Statement