Browse Prior Art Database

Configurable Method for Detection of Data Encoding Disclosure Number: IPCOM000016118D
Original Publication Date: 2002-Oct-20
Included in the Prior Art Database: 2003-Jun-21
Document File: 2 page(s) / 41K

Publishing Venue



Disclosed is a configurable method of detecting the encoding of a datastream.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 2

Configurable Method for Detection of Data Encoding

Disclosed is a configurable method of detecting the encoding of a datastream.

As businesses become more global, applications and data used in those businesses becomes less likely to be in English and more likely encoded differently. Detecting the encoding in html tags has become increasingly unreliable. For instance users of tools in a non-English locale create documents in English with tools (MS FrontPage etc.) and have language and locale tags inserted into the html markup from the locale. For example a Japanese user creates an English web page with western encoding through the document's html tags are for Japan and Japanese.

Some web browsers incorporate testing of these tags and some applications implement a limited number of tests to detect encoding and display though none make extensive use of encoding sampling. Disclosed is a system employing configurations to test known encodings and sample expected data formats to identify different encodings, heuristic table probabilities. This system learns to match encoding with greater accuracy over time.

The csdetect tool relies on internal tables of expected and unexpected characters for each supported language. These tables are in data_(locale).h header files (e.g. data_zh_CN.h). The gendata tool generates these tables from UTF-8 text in the language, so you can "train" the tool for a particular language by using gendata to generate the corresponding header file. A better approach would be more dynamic, storing the information in external files.

Unexpected characters are those that were never seen in the input text Expected characters are those whose frequency of occurrance is significant ASCII characters and surrogates are ignored in all of the...