UTF-16, an encoding of ISO 10646 (RFC2781)
Original Publication Date: 2000-Feb-01
Included in the Prior Art Database: 2000-Sep-13
Internet Society Requests For Comment (RFCs)
P. Hoffman: AUTHOR [+2]
This document describes the UTF-16 encoding of Unicode/ISO-10646, addresses the issues of serializing UTF-16 as an octet stream for transmission over the Internet, discusses MIME charset naming as described in [CHARSET-REG], and contains the registration for three MIME charset parameter values: UTF-16BE (big-endian), UTF-16LE (little-endian), and UTF-16.
Network Working Group P. Hoffman
Request for Comments: 2781 Internet Mail Consortium
Category: Informational F. Yergeau
UTF-16, an encoding of ISO 10646
Status of this Memo
This memo provides information for the Internet community. It does
not specify an Internet standard of any kind. Distribution of this
memo is unlimited.
Copyright (C) The Internet Society (2000). All Rights Reserved.
This document describes the UTF-16 encoding of Unicode/ISO-10646,
addresses the issues of serializing UTF-16 as an octet stream for
transmission over the Internet, discusses MIME charset naming as
described in [CHARSET-REG], and contains the registration for three
MIME charset parameter values: UTF-16BE (big-endian), UTF-16LE
(little-endian), and UTF-16.
1.1 Background and motivation
The Unicode Standard [UNICODE] and ISO/IEC 10646 [ISO-10646] jointly
define a coded character set (CCS), hereafter referred to as Unicode,
which encompasses most of the world's writing systems [WORKSHOP].
UTF-16, the object of this specification, is one of the standard ways
of encoding Unicode character data; it has the characteristics of
encoding all currently defined characters (in plane 0, the BMP) in
exactly two octets and of being able to encode all other characters
likely to be defined (the next 16 planes) in exactly four octets.
The Unicode Standard further defines additional character properties
and other application details of great interest to implementors. Up
to the present time, changes in Unicode and amendments to ISO/IEC
10646 have tracked each other, so that the character repertoires and
code point assignments have remained in sync. The relevant
standardization committees have committed to maintain this very
useful synchronism, as well as not to assign characters outside of
the 17 planes accessible to UTF-16.
The IETF policy on character sets and languages [CHARPOLICY] says
that IETF protocols MUST be able to use the UTF-8 character encoding
scheme [UTF-8]. Some products and network standards already specify
UTF-16, making it an important encoding for the Internet. This
document is not an update to the [CHARPOLICY] document, only a
description of the UTF-16 encoding.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [MUSTSHOULD].
Throughout this document, character values are shown in hexadecimal
notation. For example, "0x013C" is the character whose value is the
character assigned the integer value 316 (decimal) in the CCS.
2. UTF-16 definition
UTF-16 is d...