Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Method for Extracting Text String from Binary File

IP.com Disclosure Number: IPCOM000113926D
Original Publication Date: 1994-Oct-01
Included in the Prior Art Database: 2005-Mar-27
Document File: 2 page(s) / 51K

Publishing Venue

IBM

Related People

Watanabe, H: AUTHOR

Abstract

This article describes a method for extracting text elements from binary file, which may contain DBCS characters. Recently, many data formats are used in computer system, and there are requirements to search documents written in versatile data formats. This method can be used for unknown data format in such cases.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 70% of the total text.

Method for Extracting Text String from Binary File

      This article describes a method for extracting text elements
from binary file, which may contain DBCS characters.  Recently, many
data formats are used in computer system, and there are requirements
to search documents written in versatile data formats.  This method
can be used for unknown data format in such cases.

      A binary file to be dealt with in this document is assumed to
conform to the following conditions:

o   A datum is in 1 byte unit.
o   A file is not compressed.

      As for files containing only SBCS characters, it is very easy
task to extracting text elements, because it just find a sequence of
valid SBCS characters.  As for files containing DBCS characters,
however, there is a difficulty such that the first byte is a valid

DBCS codepoint but it is actually a control code for the file.
Therefore, the algorithm is shown as follows:
 1.  Find a sequnce of valid SBCS characters
 2.  Find a sequnce of valid DBCS characters
 3.  For each DBCS sequence found in the step 2, check if a sequence
     starting from its second byte is a sequence of valid DBCS
     characters.

    Still, there are many noise text elements, such as a single
    character.  Therefore, such noise text elements are eliminated by
    the following procedures:
 4.  Eliminate 1 character of text
 5.  Eliminate a text in which the same character iterates more than
2
    times.
 6.  ...