Browse Prior Art Database

Automated Determination of a Text File's National Language Support Code Page

IP.com Disclosure Number: IPCOM000122867D
Original Publication Date: 1998-Jan-01
Included in the Prior Art Database: 2005-Apr-04
Document File: 4 page(s) / 183K

Publishing Venue

IBM

Related People

Wilder, JF: AUTHOR

Abstract

The following disclosure applies to Personal Computer (PC) software products that implement National Language Support (NLS) for presentation of textual information to humans. Disclosed is an automated method for determining, for a given language, whether a PC text file contains national characters from an ASCII or ANSI code page. Throughout this disclosure, references to the code page of a file or text refer to the type of code page--ASCII or ANSI--used to encode the national characters in a text file.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 31% of the total text.

Automated Determination of a Text File's National Language Support
Code Page

      The following disclosure applies to Personal Computer (PC)
software products that implement National Language Support (NLS) for
presentation of textual information to humans.  Disclosed is an
automated method for determining, for a given language, whether a PC
text file contains national characters from an ASCII or ANSI code
page.  Throughout this disclosure, references to the code page of a
file or text refer to the type of code page--ASCII or ANSI--used to
encode the national characters in a text file.

      This method applies to PC files containing text characters from
Single-Byte Character Set (SBCS) languages.  Each character of a SBCS
language is encoded in a PC text file as one of 256 eight-bit code
points.  The code points and the characters assigned to the code
points are defined in both an ASCII and ANSI code page pertaining to
that language.

      The code-point assignments of the national characters differ
between ASCII and ANSI code pages.  For example, the Latin small
letter "a" with grave is encoded as code point 85 (hexadecimal) in
the Latin-1  ASCII code page 850 and code point E0 (hexadecimal) in
the corresponding  Latin-1 ANSI code page 1252.

      A PC operating system may require that files containing text be
in only one code page--ASCII or ANSI--for correct presentation of the
text.  Such is the case for the IBM* Operating System/2* (OS/2) and
Microsoft Windows** operating systems.  OS/2 and Windows correctly
display or print text when the source file for the text is in an
ASCII or ANSI code page, respectively.

      When developing international software that contains SBCS text
information, such as on-line help or PC dialog panels, the
information is translated from the original development language,
such as English,  to other languages.  Moreover, a software product
may be developed so that it can be built to run on different
operating systems, such as OS/2  and Windows, from the same set of
translated source text files.  Each source text files exists in only
one code page, either ASCII or ANSI.

      A source text file may need to be converted from one code page
to the other, for example during the build process.  This occurs when
the code page of the source text file differs from what the target
operating system requires.  Specifically, each source text file in an
ASCII code page must be converted to the corresponding ANSI code page
when building the product to run on Windows.

      The existing code page of each source file must, therefore, be
known or determined in order to differentiate which files need to be
converted.  Unless the code page of each source text file is
predefined as part of the software development process, an automated
means for determining the code page of the files is needed.

      Disclosed is an algorithm that automatically determines whether
a text file...