Browse Prior Art Database

A way to identify character variants defined in each language column in ISO/IEC 10646 and Unicode CJK unified ideographs

IP.com Disclosure Number: IPCOM000013869D
Original Publication Date: 2001-May-26
Included in the Prior Art Database: 2003-Jun-18
Document File: 4 page(s) / 90K

Publishing Venue

IBM

Abstract

Disclosed is an architecture to modify the UCS-4 (Universal Multiple-Octet Coded Character Set ISO/IEC

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 54% of the total text.

Page 1 of 4

  A way to identify character variants defined in each language column in ISO/IEC 10646 and Unicode CJK unified ideographs

Disclosed is an architecture to modify the UCS-4 (Universal Multiple-Octet Coded Character Set ISO/IEC
10646) encoding scheme to differentiate the languages: Japanese, Korean, Simplified Chinese and Traditional Chinese.

A character is represented by 4 octets (32 bits=4 bytes) in the original UCS-4 architecture defined in ISO/IEC 10646. A character is still represented by 4 octets in this main idea; however, the first octet (byte) is used as a character variant identifier. By utilizing the first octet as a character variant identifier, every character encoded in 4 octets can be examined and identified whether it is for Japanese, Korean, Simplified Chinese or Traditional Chinese.

For example, a code point X'9AA8' defines the character which means "bone" in the original UCS-4 (ISO/IEC 10646) architecture. Based on the character glyph unification rules, the following four character glyph shapes were unified at the code point X'00 00 9AA8'.

S T K J

Where S, T, K and J means Simplified Chinese, Traditional Chinese, Korean and Japanese, respectively. Note: The actual character glyph shapes were slightly retouched to exaggerate the differences.

As the code point of the character for bone is represented as X'00 00 9AA8', it is not possible to uniquely identify if it means the bone used in Simplified Chinese, Traditional Chinese, Korean or Japanese.

In the new architecture, for example, X'11' is used to imply Simplified Chinese, X'22' is used to imply Traditional Chinese, X'33' for Korean and X'44' for Japanese. The values to identify languages can be any from X'00' through X'FF' as long as they are mutually exclusive among those four languages.

By replacing the first octet X'00' with X'11' in X'00 00 9AA8', X'11 00 9AA8' explicitly represents for

Simplified Chinese, X'22 00 9AA8' for , X'33 00 9AA8' for , and X'44 00 9AA8' for . Thus, the identification of languages becomes possible by checking the first byte value and the differentiation in presentation of the data becomes possible by referring to the first byte.

Thus, the multilingual processing will become easier than the existing architecture.

The rationale to use the first octet for any special purposes su...