Browse Prior Art Database

Method of Context Aware Codeset Detection Service for Enhanced Performance and Accuracy

IP.com Disclosure Number: IPCOM000244411D
Publication Date: 2015-Dec-09
Document File: 4 page(s) / 80K

Publishing Venue

The IP.com Prior Art Database

Abstract

Described is a rule-based codeset detection method that provides for enhancing existing codeset detecting service. The illustrative embodiment generates an iconv codeset detect management framework to manage a set of context aware modules for detecting any codesets based on predefined codeset detection rules and user/application profiles.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 47% of the total text.

Page 01 of 4

Method of Context Aware Codeset Detection Service for Enhanced Performance and Accuracy

Code Set (character set, charset, codepage) is used to represent a repertoire of characters by some kind of an encoding system. Choosing a right codeset is the most important thing for supporting multilingual features in modern information management. Especially in an information exchange network (such as crossing platform information exchanges and network computing), using right codeset and encoding are the first step on information exchange, security authentication, data transfer, and database access. Efficiently and accurately determining a codeset is critical for real-time information exchange. In order to prevent data damages and loss, Internet Mail Consortium (IMC) recommends that all e-mail programs be able to display and create mail using UTF-8, and the W3C recommends UTF-8 as the default encoding in XML and HTML.XML

    More than 130 codesets are used for different regions, industries, and/or products because many applications and platforms are not using the recommended codeset and encoding for historic reasons (such as legacy codeset and standards in existing storage: Bank, Library;

National standard (GB18030, JISX0213, ISCII); Technical factors (efficiency/cost), platform limitation, etc.). Using wrong codeset and improper encoding setting are major data corruption problems in information management. People have to determine and verify codeset/encoding on inbound and outbound data streams. For instance, ICU and Java* support codeset detection APIs to solve the problem. And, some major cloud service providers also provide codeset detection services. However, above codeset determination methods are either inaccurate or inefficient. There are two major codeset detection methods:

Metadata-based codeset determination is used by HTTP Protocol. It is simple and fast to defect codeset, but it needs right metadata from request side. If those metadata are not available or accurate, then it will corrupt users' data. For instance, HTML head section can include meta charset attribute "charset=UTF-8" according to http protocol. If the meta charset attribute is missing or contains mistakes, then web contents will be full of garbage characters.

Statistic-based codeset detection is used in many applications or software libraries when the metadata attributes are not available. But it is expansive, lack of statistic modules for all codeset (e.g., Java API, file command) and less accurate (not 100% confidence on given sample strings). Modern web browser is a good example which can choose certain sample strings and analyze encoding patterns from byte to byte. This is an expensive method without correctness proof.

    Since both current codeset detection methods have big limitations to cover all codesets, most web browser vendors have added manual codeset selection options and allow users to override incorrect codeset setting as needed. Therefore, it is necessar...