Browse Prior Art Database

String-Searching Algorithm for Mixed Single-Byte Character Set/Double-Byte Character Set Data Stream

IP.com Disclosure Number: IPCOM000034658D
Original Publication Date: 1989-Mar-01
Included in the Prior Art Database: 2005-Jan-27
Document File: 2 page(s) / 48K

Publishing Venue

IBM

Related People

Liu, JM: AUTHOR

Abstract

This algorithm searches for any occurrence of a character string in a mixed single- and double-byte character sets. Many algorithms have been designed to search for any occurrence of a character string in a text. However, those algorithms were designed under the assumption that texts are in a single-coded character set, e.g., EBCDIC characters. Those algorithms may fail when applied to a data stream with mixed single- and double-byte character sets. A mixed SBCS/DBCS (Single-Byte Character Set/Double-Byte Character Set) data stream, as described here, is a data stream which uses SO/SI (shift out/shift in) control characters to separate SBCS from DBCS. SO control character indicates shifting from SBCS to DBCS characters and SI control character indicates shifting back to SBCS characters.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 2

String-Searching Algorithm for Mixed Single-Byte Character Set/Double- Byte Character Set Data Stream

This algorithm searches for any occurrence of a character string in a mixed single- and double-byte character sets. Many algorithms have been designed to search for any occurrence of a character string in a text. However, those algorithms were designed under the assumption that texts are in a single-coded character set, e.g., EBCDIC characters. Those algorithms may fail when applied to a data stream with mixed single- and double-byte character sets. A mixed SBCS/DBCS (Single-Byte Character Set/Double-Byte Character Set) data stream, as described here, is a data stream which uses SO/SI (shift out/shift in) control characters to separate SBCS from DBCS.

SO control character indicates shifting from SBCS to DBCS characters and SI control character indicates shifting back to SBCS characters.

The SO control character is paired with SI control character except for the very last SO control character. All double-byte substrings contain an even number of bytes. The algorithm to search for the Mth (M > 0) occurrence of a string S in a mixed data stream T is described as the following steps (see the figure): - Initializes position P (= 1) and the occurrence count (= 0). - Searches for the first character of S in T starting from a position P. Three cases are considered: - If the first byte of S is SO, SO control character is ignored. The next two bytes following SO control character is...