Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Efficient and convenient scheme for storing string values in a Dynamic Language implementation

IP.com Disclosure Number: IPCOM000180302D
Original Publication Date: 2009-Mar-06
Included in the Prior Art Database: 2009-Mar-06
Document File: 4 page(s) / 86K

Publishing Venue

IBM

Abstract

Programming languages need to have a scheme to store string values such as a sequence of ASCII characters. Many dynamic language implementations store their representation of a string as a simple array of bytes without any defined encoding scheme. In these languages it is the responsibility of the application programmer to keep track of the encoding. This article discusses a method of efficiently handling strings which might be either raw binary data, or encoded characters.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 36% of the total text.

Page 1 of 4

Efficient and convenient scheme for storing string values in a Dynamic Language implementation

Many dynamic language implementations store their representation of a string as a simple array of bytes without any defined encoding scheme. In these languages it is the responsibility of the application programmer to keep track of the encoding. Thus, for example, the application programmer might know that the contents of a particular variable are a string encoded in the UTF8 encoding and manipulate the string quantity in a way that is consistent with that knowledge. Other string variables may hold strings encoded in other encodings or data which represents a non "string value" such as an image encoded in the JPEG format.

    Specifically, two such languages are PHP and Ruby. Each of these languages has a reference implementation which is coded in C as a bespoke Virtual Machine designed to run that specific language. This article applies to PHP but the concepts are generalisable to other similar languages.

    There are a number of ongoing projects to provide implementations of dynamic languages which execute in a managed runtime environment. One example of such an environment is the Java* Virtual Machine. This disclosure is written from the perspective of the JVM but the concepts may be generalised to other managed environments such as the Microsoft** CLR.

    One of the key motivations behind implementing dynamic languages on the JVM is interoperability with other JVM languages and in particular Java.

    Java stores string values as a java.lang.String class. These are strings using a UTF-16 encoding of the unicode standard. Since unicode can represent any character, it is not necessary for the program to be aware of any other encoding information. Java stores arbitrary byte data as a byte array, abbreviated to byte[].

    The problem that this article addresses is related to the way that PHP String quantities are stored in a JVM implementation of PHP. The constraints that apply are:
1. Must be able to store any quantity that can be stored in a PHP string: character data in any encoding and raw binary data such as a JPEG
2. Must be able to interoperate with Java code so that Java code can pass a Java string to a PHP program or extension and Java code can receive a PHP string as a Java string. This is important since there is a significant amount of useful functionality already implemented in Java and operating on Java Strings.
3. Must be able to interoperate with C language based code that uses an array of bytes in some arbitrary encoding to represent strings and also to represents binary objects such as jpeg data.
4. It must be the case that when two strings are compared, if they represent the same string value the equality test returns true irrespective of the route through which the string entered the runtime. Thus the string "cat" which is placed into a variable from some C code as a sequence of byte values must be...