Browse Prior Art Database

Efficient Search Mechanism by Swath-Wise Decomposition in Text Files

IP.com Disclosure Number: IPCOM000113596D
Original Publication Date: 1994-Sep-01
Included in the Prior Art Database: 2005-Mar-27
Document File: 2 page(s) / 75K

Publishing Venue

IBM

Related People

Viswanathan, M: AUTHOR

Abstract

Searching for keywords (based on user input) in a database of compressed text files is an inefficient task for search engines since the files to be searched have to be completely decompressed before any string matching can be performed. It is, therefore, more cost-effective to decompress the file in swaths or blocks than to decompress the whole file before word or phrase comparisons are made.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Efficient Search Mechanism by Swath-Wise Decomposition in Text Files

      Searching for keywords (based on user input) in a database of
compressed text files is an inefficient task for search engines since
the files to be searched have to be completely decompressed before
any string matching can be performed.  It is, therefore, more
cost-effective to decompress the file in swaths or blocks than to
decompress the whole file before word or phrase comparisons are made.

      Searching for keywords in text files in databases is an
expensive task compounded by the fact that files are usually stored
in compressed form.  The traditional approach to searching involves
looking for a word or phrase provided by a user in a file or set of
files by decompressing all of the files one-by-one until a match is
found.  If these files are in compressed form, they must first be
decompressed, searched, and then compressed and stored again.  The
performance of such a search engine can be improved significantly by
decompressing each file only to the extent needed.

      In compressing a file, a fixed-size block (or swath) of the
file is scanned for redundancy and replaced creating the decompressed
file.  The size of the file block after compression is usually
written out to the compressed file for easy decompression.  The
scheme outlined here assumes that we know the compression scheme used
and that the size of the compressed file buffers are also known.  The
algorithm is as follows:
      For each input word
           Decompress swath one at a time;
            Search for word in decompressed fraction of file;
            If no match is found, continue;
            Else report successful match.
      end

      Analysis: The worst-case performance of this algorithm (when
all the swaths in the file have to be decompressed to find a
particular match) is equal to the average-case performance of the
traditional approach (s...