Browse Prior Art Database

UTF-16, an encoding of ISO 10646 (RFC2781)

IP.com Disclosure Number: IPCOM000003380D
Original Publication Date: 2000-Feb-01
Included in the Prior Art Database: 2000-Sep-13
Document File: 11 page(s) / 28K

Publishing Venue

Internet Society Requests For Comment (RFCs)

Related People

P. Hoffman: AUTHOR [+2]

Abstract

This document describes the UTF-16 encoding of Unicode/ISO-10646, addresses the issues of serializing UTF-16 as an octet stream for transmission over the Internet, discusses MIME charset naming as described in [CHARSET-REG], and contains the registration for three MIME charset parameter values: UTF-16BE (big-endian), UTF-16LE (little-endian), and UTF-16.

This text was extracted from a ASCII Text document.
This is the abbreviated version, containing approximately 11% of the total text.

Network Working Group P. Hoffman

Request for Comments: 2781 Internet Mail Consortium

Category: Informational F. Yergeau

Alis Technologies

February 2000

UTF-16, an encoding of ISO 10646

Status of this Memo

This memo provides information for the Internet community. It does

not specify an Internet standard of any kind. Distribution of this

memo is unlimited.

Copyright Notice

Copyright (C) The Internet Society (2000). All Rights Reserved.

1. Introduction

This document describes the UTF-16 encoding of Unicode/ISO-10646,

addresses the issues of serializing UTF-16 as an octet stream for

transmission over the Internet, discusses MIME charset naming as

described in [CHARSET-REG], and contains the registration for three

MIME charset parameter values: UTF-16BE (big-endian), UTF-16LE

(little-endian), and UTF-16.

1.1 Background and motivation

The Unicode Standard [UNICODE] and ISO/IEC 10646 [ISO-10646] jointly

define a coded character set (CCS), hereafter referred to as Unicode,

which encompasses most of the world's writing systems [WORKSHOP].

UTF-16, the object of this specification, is one of the standard ways

of encoding Unicode character data; it has the characteristics of

encoding all currently defined characters (in plane 0, the BMP) in

exactly two octets and of being able to encode all other characters

likely to be defined (the next 16 planes) in exactly four octets.

The Unicode Standard further defines additional character properties

and other application details of great interest to implementors. Up

to the present time, changes in Unicode and amendments to ISO/IEC

10646 have tracked each other, so that the character repertoires and

code point assignments have remained in sync. The relevant

standardization committees have committed to maintain this very

useful synchronism, as well as not to assign characters outside of

the 17 planes accessible to UTF-16.

The IETF policy on character sets and languages [CHARPOLICY] says

that IETF protocols MUST be able to use the UTF-8 character encoding

scheme [UTF-8]. Some products and network standards already specify

UTF-16, making it an important encoding for the Internet. This

document is not an update to the [CHARPOLICY] document, only a

description of the UTF-16 encoding.

1.2 Terminology

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",

"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this

document are to be interpreted as described in RFC 2119 [MUSTSHOULD].

Throughout this document, character values are shown in hexadecimal

notation. For example, "0x013C" is the character whose value is the

character assigned the integer value 316 (decimal) in the CCS.

2. UTF-16 definition

UTF-16 is d...