Configurable Method for Detection of Data Encoding
Original Publication Date: 2002-Oct-20
Included in the Prior Art Database: 2003-Jun-21
Disclosed is a configurable method of detecting the encoding of a datastream. As businesses become more global, applications and data used in those businesses becomes less likely to be in English and more likely encoded differently. Detecting the encoding in html tags has become increasingly unreliable. For instance users of tools in a non-English locale create documents in English with tools (MS FrontPage etc.) and have language and locale tags inserted into the html markup from the locale. For example a Japanese user creates an English web page with western encoding through the document's html tags are for Japan and Japanese. Some web browsers incorporate testing of these tags and some applications implement a limited number of tests to detect encoding and display though none make extensive use of encoding sampling. Disclosed is a system employing configurations to test known encodings and sample expected data formats to identify different encodings, heuristic table probabilities. This system learns to match encoding with greater accuracy over time. The csdetect tool relies on internal tables of expected and unexpected characters for each supported language. These tables are in data_(locale).h header files (e.g. data_zh_CN.h). The gendata tool generates these tables from UTF-8 text in the language, so you can "train" the tool for a particular language by using gendata to generate the corresponding header file. A better approach would be more dynamic, storing the information in external files.