Dismiss
The IQ application will be briefly unavailable on Sunday, March 31st, starting at 10:00am ET. Access will be restored as quickly as possible.
Browse Prior Art Database

UTF-9 and UTF-18 Efficient Transformation Formats of Unicode (RFC4042)

IP.com Disclosure Number: IPCOM000117115D
Original Publication Date: 2005-Apr-01
Included in the Prior Art Database: 2019-Feb-12
Document File: 9 page(s) / 14K

Publishing Venue

Internet Society Requests For Comment (RFCs)

Related People

M. Crispin: AUTHOR

Related Documents

10.17487/RFC4042: DOI

Abstract

ISO-10646 defines a large character set called the Universal Character Set (UCS), which encompasses most of the world's writing systems. The same set of codepoints is defined by Unicode, which further defines additional character properties and other implementation details. By policy of the relevant standardization committees, changes to Unicode and amendments and additions to ISO/IEC 10646 track each other, so that the character repertoires and code point assignments remain in synchronization. The current representation formats for Unicode (UTF-7, UTF-8, UTF-16) are not storage and computation efficient on platforms that utilize the 9 bit nonet as a natural storage unit instead of the 8 bit octet. This document describes a transformation format of Unicode that takes advantage of the nonet so that the format will be storage and computation efficient. This memo provides information for the Internet community.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 18% of the total text.

Network Working Group M. Crispin Request for Comments: 4042 Panda Programming Category: Informational 1 April 2005

UTF-9 and UTF-18 Efficient Transformation Formats of Unicode

Status of This Memo

This memo provides information for the Internet community. It does not specify an Internet standard of any kind. Distribution of this memo is unlimited.

Copyright Notice

Copyright (C) The Internet Society (2005).

Abstract

ISO-10646 defines a large character set called the Universal Character Set (UCS), which encompasses most of the world’s writing systems. The same set of codepoints is defined by Unicode, which further defines additional character properties and other implementation details. By policy of the relevant standardization committees, changes to Unicode and amendments and additions to ISO/IEC 646 track each other, so that the character repertoires and code point assignments remain in synchronization.

The current representation formats for Unicode (UTF-7, UTF-8, UTF-16) are not storage and computation efficient on platforms that utilize the 9 bit nonet as a natural storage unit instead of the 8 bit octet.

This document describes a transformation format of Unicode that takes advantage of the nonet so that the format will be storage and computation efficient.

1. Introduction

A number of Internet sites utilize platforms that are not based upon the traditional 8-bit byte or octet. One such platform is the PDP- 10, which is based upon a 36-bit word. On these platforms, it is wasteful to represent data in octets, since 4 bits are left unused in each word. The 9-bit nonet is a much more sensible representation.

Although these platforms support IETF standards, many of these platforms still utilize a text representation based upon the septet,

Crispin Informational [Page 1]

RFC 4042 UTF-9 and UTF-18 1 April 2005

which is only suitable for [US-ASCII] (although it has been used for various ISO 10646 national variants).

To maximize international and multi-lingual interoperability, the IAB has recommended ([IAB-CHARACTER]) that [ISO-10646] be the default coded character set.

Although other transformation formats of [UNICODE] exist, and conceivably can be used on nonet-oriented machines (most notably [UTF-8]), they suffer significant disadvantages:

[UTF-8] requires one to three octets to represent codepoints in the Basic Multilingual Plane (BMP), four octets to represent [UNICODE] codepoints outside the BMP, and six octets to represent non-[UNICODE] codepoints. When stored in nonets, this results in as many as four wasted bits per [UNICODE] character.

[UTF-16] requires a hexadecet to represent codepoints in the BMP, and two hexadecets to represent [UNICODE] codepoints outside the BMP. When stored in nonet pairs, this results in as many as four wasted bits per [UNICODE] character. This transformation format requires complex surrogates to represent codepoints outside the BMP, and can not represent non-[UNICODE] codepoints at all.

[UTF-7] requires one to five...

Processing...
Loading...