Browse Prior Art Database

Method of Converting UTF-8 to/from EBCDIC Using an "Escape Character"

IP.com Disclosure Number: IPCOM000123958D
Original Publication Date: 1999-Aug-01
Included in the Prior Art Database: 2005-Apr-05
Document File: 4 page(s) / 216K

Publishing Venue

IBM

Related People

Hahn, TJ: AUTHOR

Abstract

The EBCDIC-based string format presented below allows for a non-lossy, one-to-one and onto conversion of data in UTF-8 to a "portable" EBCDIC-based string (and back to UTF-8). This conversion, while not optimized for string lengths, allows programs which do string operations against EBCDIC-based character sets to operate on UTF-8 data. Systems which require data in EBCDIC-based character sets can be used to process UTF-8 data that has been passed through this conversion. Since the conversion to the EBCDIC-based character set is non-lossy, the original UTF-8 string can be reconstructed without loss of data.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 32% of the total text.

Method of Converting UTF-8 to/from EBCDIC Using an "Escape Character"

   The EBCDIC-based string format presented below allows for
a non-lossy, one-to-one and onto conversion of data in UTF-8 to a
"portable" EBCDIC-based string (and back to UTF-8).  This
conversion, while not optimized for string lengths, allows programs
which do string operations against EBCDIC-based character sets to
operate on UTF-8 data.  Systems which require data in EBCDIC-based
character sets can be used to process UTF-8 data that has been passed
through this conversion.  Since the conversion to the EBCDIC-based
character set is non-lossy, the original UTF-8 string can be
reconstructed without loss of data.

   Support for international character sets in programs has
required the need to support locale-based operations.  This has
usually entailed setting the locale in effect for the program using
the setlocale() function call, separating all message texts into
message catalogs, and using different message catalogs based on the
locale in effect on the target system.  This usually covers the needs
of the program for supplying message information to users.  Another
aspect of handling international characters is that of handling
incoming requests from other computer systems in a global network
where client systems are, in general, running in a different locale
than the server system they are contacting.  In these cases, the
typical solution is for the client and server program to agree on a
character set that both can support and for each to convert data to
this common character set before sending data and after receiving
data.  A problem occurs when the only mutually agreeable character
sets cannot express all characters of the respective target operating
environments.  In this case, conversions will be "lossy" meaning
that the conversion between character sets is not one-to-one and
onto.

   A common solution to this problem is to define "wire
format" data in terms of Unicode (UCS-2) characters, a 2-byte
character definition that covers all glyphs (visual characters)
known on Earth.  Since Unicode characters can contain imbedded zero
bytes (whenever the UCS code is less than 256 for example) the
encoding does not lend itself to processing in terms of C and C++
language operations which expect to operate in NULL-terminated
character strings.  As a solution to this problem, an encoding of
UCS-2 has been defined which both optimizes the length strings of
UCS-2 characters as well as removes embedded NULL characters in
UCS-2 strings.  The encoding, known as UTF-8, is a variable-length
encoding of UCS-2.  UTF-8 has an additional feature that for
characters in the range 0x00 - 0x7F, the encoding is exactly the
same as 7-bit ASCII.

   The affinity of UTF-8 data to ASCII is very useful on
platforms whose character sets are based on ASCII and character sets
built from the basic ASCII character set.  However, on EBCDIC-based
systems, UCS-2 and UTF-8 handling can...