Browse Prior Art Database

SYSTEM AND METHOD TO AUTOMATICALLY DETECT THE NATIVE LANGUAGE OF TEXT-BASED DOCUMENTS

IP.com Disclosure Number: IPCOM000014493D
Original Publication Date: 2000-Feb-01
Included in the Prior Art Database: 2003-Jun-19
Document File: 5 page(s) / 221K

Publishing Venue

IBM

Abstract

System and method to automatically detect the native language of text-based documents

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 40% of the total text.

Page 1 of 5

  SYSTEM AND METHOD TO AUTOMATICALLY DETECT THE NATIVE LANGUAGE OF TEXT-BASED DOCUMENTS

  System and method to automatically detect the native language of text-based documents

    The system described in this article is related in the area of Internet Search Engine technology. Internet Search Engines usually consist of an information collection component, which is also called "gatherer" or "crawler". This crawler component actively searches the World Wide Web (WWW) for available documents by following recursively hyperlinks (URLs) on web based documents. When a document is detected, it can be analyzed, indexed, and the summary information (Metadata) of a document is usually stored in a database system. The database system can then be queried by people who are searching for particular information. Note that this description gives only a brief overview of how current Internet Search Engines system are working. In most times the whole process is more complicated and involves more steps in order to improve the index quality.

This document will provide an overview of the system and is comprised in the following sections:

1. Problem Statement
2. Proposed Solution
3. Benefits and Advantages

1. Problem Statement

    The main problem the system addresses is that web based data sources and documents are generally written in different languages in nature, but they are all deposited into repositories and indexed based on a default language (mostly English). As a result, a search to the repositories yields results that are far from accurate for non-English text documents.

    To illustrate the problem we can take a look at the following example using jCentral ( http://www.ibm.com/developer/java). jCentral is a Java specific Internet search engine and web crawler, which allows software developers to search for Java Applets, Java Source Code, Java Beans, Java related newsgroup articles, and other Java related resources.

    When the jCentral gatherer finds a Java Applet, it will analyze the HTML document in which the Java Applet is embedded to collect additional information. The jCentral system uses English as the default language. If the document, which will be indexed, is written in English, then there's no problem. The indexing system has knowledge about the English syntax, grammar, and semantic to properly create an automatic index for a document. Thus in our example the indexing process of an extracted sentence like "This Java Applet will help you to sort and organize your data using the quick sort algorithm" could lead to the keywords "Java, Applet, sort, organize, data, quick, sort, algorithm". Note that so called "noise words" are eliminated. The indexing system can eliminate these because it knows the grammar and language structure. In the English language for instance the word "and" is a conjunction and not necessarily needed to be stored in an index. The problem we address now appears when the document is not written in English. In our example, the same se...