Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Database Search and Compression Technique

IP.com Disclosure Number: IPCOM000121607D
Original Publication Date: 1991-Sep-01
Included in the Prior Art Database: 2005-Apr-03
Document File: 4 page(s) / 111K

Publishing Venue

IBM

Related People

Grumbar, LC: AUTHOR

Abstract

Described is a computer software database search and compression facility that minimizes vocabulary searching time and provides high compression with minimum decompression time. The compression technique is particularly useful for text databases because the text can be tokenized. The facility provides instant rejection of not found tokens and a fast literal searching method using wildcard characters.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Database Search and Compression Technique

      Described is a computer software database search and
compression facility that minimizes vocabulary searching time and
provides high compression with minimum decompression time.  The
compression technique is particularly useful for text databases
because the text can be tokenized.  The facility provides instant
rejection of not found tokens and a fast literal searching method
using wildcard characters.

      The database search technique organizes data in table form
where all data is arranged in ascending order.  Based on the nature
of the organized data, if the token match is found in the table, then
the identification (ID) of the token is used to search the rest of
the data, instead of using the token itself.  An ID is associated
with each token in the database.  Once the token is found, further
searching is performed using the numeric ID instead of literally
searching for the token.

      The data structures used by the document display processor are
as follows:
      -   Cell library vocabulary buffer - byte buffer
      -   Cell library vocabulary Offset table - array of words
      -   Table of contents array - array of structures
      -   Cell offset table - array of structures
      -   Index array - array of structures
      -   Cell Vocabulary table array - array of words
      -   Cell display information array - array of bytes
      -   Master document file header structure
      -   Cell library file header structure

      The figures show how a sample document vocabulary is stored in
the data structures.  Fig. 1a shows the cell library vocabulary table
(CLVT) buffer and how information is stored in the data structure
buffers.  The sample document vocabulary consists of the words:
cell, contains, copy, following, is, line, lines, number, occur, of,
one, text, that, the, then, this, two.  Fig. 1b shows the CLVT offset
table. The CLVT buffer and the CLVT offset table are read into memory
during the display processor's initialization.  The maximum size of
the CLVT is 64K words and the average document may contain 3,000 to
4,000 unique words.  Assuming an average word size of seven
characters, the average size of the CLVT buffer will be approximately
30K.

      Fig. 2a shows the cell vocabulary table (CVT) and Fig. 2b shows
the cell display information (CDI) array.  In this case, the cell
contains lines:  "This is line number one" and...