Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Method to Make Double-Byte and Single-Byte Identifiers URL Addressable

IP.com Disclosure Number: IPCOM000031664D
Original Publication Date: 2004-Oct-04
Included in the Prior Art Database: 2004-Oct-04
Document File: 3 page(s) / 36K

Publishing Venue

IBM

Abstract

The HTTP 1.1 Specification (RFC 2616) specifies that a subset of 8 bit ASCII characters are the only valid content in a URL. This restriction becomes a problem when a URI containing double-byte character content (Japanese characters, Chinese characters, etc.) needs to be addressable. For instance, if a web-based document management system has two files: english.html and japanese.html (assume the characters in 'japanese.html' are actually Japanese characters). Trying to access the files using the URL's http://documentmanagementsystem.com/english.html and http://documentmanagementsystem.com/japanese.html would invalidate the HTTP 1.1 spec because the Japanese characters in the second URL are not 8 bit ASCII characters. A method is proposed in this article to replace the path and filename part of the URL with a GUID identifier in such a way that any embedded links (e.g. relative) resolved by the client program (browser) from within the document will result in a correct and manageable URL to those linked documents.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 42% of the total text.

Page 1 of 3

Method to Make Double-Byte and Single-Byte Identifiers URL Addressable

The HTTP 1.1 Specification (RFC 2616) specifies that a subset of 8 bit ASCII characters are the only valid content in a URL. This restriction becomes a problem when a URI containing double-byte character content (Japanese characters, Chinese characters, etc.) needs to be addressable. For instance, if a web-based document management system has two files: english.html and japanese.html (assume the characters in 'japanese.html' are actually Japanese characters). Trying to access the files using the URL's http://documentmanagementsystem.com/english.html and http://documentmanagementsystem.com/japanese.html would invalidate the HTTP 1.1 spec because the Japanese characters in the second URL are not 8 bit ASCII characters.

The initial solution to this problem is to use another valid identifier for the documents. Let's assume that in our document management system, every document contains a system identifier that consists of ASCII characters. The system identifiers for our documents are '1' and '2' for 'english.html' and 'japanese.html' respectively. Now, we can address the files using the following valid URL's http://documentmanagementsystem.com/1 http://documentmanagementsystem.com/2. The system can use the id to resolve the actual files and deliver the content.

The next scenario to consider is when the files themselves refer to other content. Let's assume that both english.html and japanese.html refer to the image 'some_image.jpg'. The contents of english.html and japanese.html are shown below:

english.html: <html>

<head><title>My English Document</title></head>

<body>

<img src="some_image.jpg"/>

</body> </html>

japanese.html: <html>

<head><title>My Japanese Document</title></head>

<body>

<img src="some_image.jpg"/>

</body> </html>

The structure of the document management system is as follows:

.... projects/

proj-1000/

....

proj-1357/

documents/

english.html (id - 1)

japanese.html (id - 2)

some_image.jpg (id - 3)

images/

myimage.jpg (id - 4)

This solution breaks down since the browser detects the reference to 'some_image.jpg', replaces the system identifier in the URL with 'some_image.jpg' (

1

Page 2 of 3

http://documentmanagementsystem.com/some_image.jpg) and requests the contents of that URL. The document management system could not find the image since it has lost the system identifier and cannot look up the path of the resource with the system identifier of '2'. It also cannot simply request the some_image.jpg file because the file itself is not in the root directory, it actually resides under /projects/proj-1357/documents but there is no way to determine that from the URL.

This again breaks down when other relative references are requested. Let's change the contents of the html files to the following:

english.html: <html>

<head><title>My English Document</title></head>

<body>

<img src="../images/myimage.jpg"/>

</body> </html>

japanese.html: <html>

<head><title...