Character Conversion data length calculations.
Original Publication Date: 2004-Nov-05
Included in the Prior Art Database: 2004-Nov-05
There are a range of encoding schemas used to represent character data within computer systems. Some of these use different numbers of bytes to represent specific characters. When data is converted from one encoding to another there is usually a need to estimate the size of the storage area that the transformed data is placed into from the amount of source data and the encoding schemas used in the conversion. When either or both encoding schemas represent variable size characters, then the storage area used to convert the data into can be calculated from the maximum character size and the number of characters. This method generally leaves unused space in the target buffer and, when the amount of data is large, this unused storage can become significant. This disclosure outlines a method of accurately calculating the size of this storage area prior to the conversion operation.
Character Conversion data length calculations .
Character data, which is often referred to as text, is not represented in the same way across all software products or platforms. It is stored as a series of numbers in memory or on disk, and transported as a stream of bytes across networks. However, these sequences are made up of integer values which can vary in size and can be interpreted in different ways from one system to another. A number of standards have evolved to describe character data, such as ASCII, EBCDIC, and Unicode. In ASCII each character is represented by a seven bit binary value, whereas in EBCDIC, eight bits are used.
The number of bits limits the number of different characters that can be held in a character set. To overcome this limitation, some character sets make use of two bytes of data per character while others have a variable number of bytes. An example of this is UTF-8, which is one of the forms of Unicode, and makes us of between one and four bytes of data for its characters, the high order bits in each byte indicating whether the next byte in a sequence forms part of the current character. A more detailed description of the character encoding standards can be found in the paper entitled 'A brief introduction to code pages and Unicode' ,which is available from the IBM* Developerworks web site.
Character data conversion is a process used to transform characters from one encoding schema to another. Typically some source data generated on one platform is transferred to another platform which does not support its encoding directly and requires the data to be transformed before it can be interpreted. The process that carries out this conversion often requires a separate storage area to transform the data into. Calculating the size of this target area is difficult when one or both character set encoding makes use of multiple bytes to represent individual characters.
When the size of the data is relatively small or when a system is not excessively constrained for memory use then one solution to this problem is to take a worst case view. This estimates the target size from that of the source buffer multiplied by a factor which represents the largest number of bytes that can be used to make up a single character. So, for instance if the source data was made up of single byte characters and the target encoding was one to four bytes of data then the target buffer could be estimated as four times the size of the source data. However by adopting this approach there will be some wasted space in the target buffer, and in many instances, this may form a significant part of the buffer. The problem then becomes much worse when the source data is very large and an alternative approach may be desirable.
Another technique to resolve this issue is for the requestor of the conversion to provide a buffer into which the target data is constructed. If this proves not to be large enough the requestor can then get another buffer a...