Browse Prior Art Database

UTF-8, a transformation format of ISO 10646 (RFC2279)

IP.com Disclosure Number: IPCOM000002840D
Original Publication Date: 1998-Jan-01
Included in the Prior Art Database: 2000-Sep-13
Document File: 8 page(s) / 20K

Publishing Venue

Internet Society Requests For Comment (RFCs)

Related People

F. Yergeau: AUTHOR

Abstract

ISO/IEC 10646-1 defines a multi-octet character set called the Universal Character Set (UCS) which encompasses most of the world's writing systems. Multi-octet characters, however, are not compatible with many current applications and protocols, and this has led to the development of a few so-called UCS transformation formats (UTF), each with different characteristics. UTF-8, the object of this memo, has the characteristic of preserving the full US-ASCII range, providing compatibility with file systems, parsers and other software that rely on US-ASCII values but are transparent to other values. This memo updates and replaces RFC 2044, in particular addressing the question of versions of the relevant standards.

This text was extracted from a ASCII document.
This is the abbreviated version, containing approximately 14% of the total text.

Network Working Group F. Yergeau

Request for Comments: 2279 Alis Technologies

Obsoletes: 2044 January 1998

Category: Standards Track

UTF-8, a transformation format of ISO 10646

Status of this Memo

This document specifies an Internet standards track protocol for the

Internet community, and requests discussion and suggestions for

improvements. Please refer to the current edition of the "Internet

Official Protocol Standards" (STD 1) for the standardization state

and status of this protocol. Distribution of this memo is unlimited.

Copyright Notice

Copyright (C) The Internet Society (1998). All Rights Reserved.

Abstract

ISO/IEC 10646-1 defines a multi-octet character set called the

Universal Character Set (UCS) which encompasses most of the world's

writing systems. Multi-octet characters, however, are not compatible

with many current applications and protocols, and this has led to the

development of a few so-called UCS transformation formats (UTF), each

with different characteristics. UTF-8, the object of this memo, has

the characteristic of preserving the full US-ASCII range, providing

compatibility with file systems, parsers and other software that rely

on US-ASCII values but are transparent to other values. This memo

updates and replaces RFC 2044, in particular addressing the question

of versions of the relevant standards.

1. Introduction

ISO/IEC 10646-1 [ISO-10646] defines a multi-octet character set

called the Universal Character Set (UCS), which encompasses most of

the world's writing systems. Two multi-octet encodings are defined,

a four-octet per character encoding called UCS-4 and a two-octet per

character encoding called UCS-2, able to address only the first 64K

characters of the UCS (the Basic Multilingual Plane, BMP), outside of

which there are currently no assignments.

It is noteworthy that the same set of characters is defined by the

Unicode standard [UNICODE], which further defines additional

character properties and other application details of great interest

to implementors, but does not have the UCS-4 encoding. Up to the

present time, changes in Unicode and amendments to ISO/IEC 10646 have

tracked each other, so that the character repertoires and code point

assignments have remained in sync. The relevant standardization

committees have committed to maintain this very useful synchronism.

The UCS-2 and UCS-4 encodings, however, are hard to use in many

current applications and protocols that assume 8 or even 7 bit

c...