Browse Prior Art Database

Validation of Double-Byte Character Sets Text for Tokenization in a Language Sensitive Editing System

IP.com Disclosure Number: IPCOM000105162D
Original Publication Date: 1993-Jun-01
Included in the Prior Art Database: 2005-Mar-19
Document File: 4 page(s) / 102K

Publishing Venue

IBM

Related People

Storisteanu, A: AUTHOR

Abstract

Disclosed is a method and algorithm to carry out the validation of text for tokenization purposes in a programmable workstation language sensitive editing system that supports double-byte character sets (DBCS), with minimal system performance impact.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Validation of Double-Byte Character Sets Text for Tokenization in a Language Sensitive Editing System

      Disclosed is a method and algorithm to carry out the validation
of text for tokenization purposes in a programmable workstation
language sensitive editing system that supports double-byte character
sets (DBCS), with minimal system performance impact.

      A typical problem in writing software for systems that support
DBCS is to avoid splitting the pairs of DBCS character bytes in
operations that are carried out on the text (from text display and
cursor positioning, to more complex ones like tokenization in
language sensitive editors).

      This is particularly a concern when applied to large amounts of
text, where a vast amount of processing must be carried out in a
short time so as not to tie up the system, any performance impact
being critical for the perceived response and usability of the
system.  This disclosure is concerned primarily with the validity of
splitting the text for tokenization purposes.

      Tokenization implies any splitting of the text into tokens of
information, such as setting up text ranges for the purposes of
associating language sensitive information to the tokens, as is done
in the IBMfi SAAfi AD/Cyclefi CoOperative Development Environment
(CODE).  The edit control setting up the text ranges (associating
visual style and other language sensitive information to the text in
the edit buffer) must receive valid insertion points for the
operation, i.e., on correct DBCS character boundaries.

      This method applies to any text that must be tokenized in a
language sensitive editor, particularly to source files of
fixed-format column-sensitive programming languages, such as RPG.

      The validation function is called when it is established, at
run-time, that the system environment is DBCS.  It is carried out by
scanning the line of text, and verifying the location of any DBCS
characters in the active code page encountered in the text, against a
vector of 'hot' character locations for the line.  This vector
describes the line tokens in terms of positions that are invalid for
the first byte of DBCS characters.  Refer to the example in the
Figure.

      Validation Vector - The validation vector is set with the
positions in line where there should be no first DBCS bytes in the
text, for the required tokens to be valid.

      The table with the tokens for the line (starting position,
length), used for the tokenization of the text, is also used for
setting the validation vector.  This table may be part...