Browse Prior Art Database

Method and System for a Dictionary-based Memory Reduction to Improve Compression Performance of Lossless Narrative Text

IP.com Disclosure Number: IPCOM000252145D
Publication Date: 2017-Dec-18
Document File: 5 page(s) / 85K

Publishing Venue

The IP.com Prior Art Database

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 44% of the total text.

Method and System for a Dictionary-based Memory Reduction to Improve Compression Performance of Lossless Narrative Text

Disclosed is a method and system for a dictionary-based memory reduction to improve compression performance of lossless narrative text. The method and system creates an efficient dictionary that is used to improve compression performance of commonly used compression methods. The method and system converts alphabetical words into shorter representations of the words that are converted back to the original representation, using the dictionary. Further, the method and system provides an enhanced ability to compress and decompress all types of information available in a textual representation.

Also, the method and system improves compression performance of all known lossless text algorithms, is multilingual and can be applied on any language including right-to-left languages such as, but not limited to, Hebrew and improves the performance of existing natural language processing methods.

In accordance with the method and system, the shorter representation of a word is entirely based on the original expression and not on an arbitrary selection of a series of characters. Due to the consistency of using the same word in English many times and at the same exact number, order and type of characters, an efficient shorter representation of each word is created using the method and system, at the same time, reducing the number of collisions in hash tables.

For example, consider the word “Washington” which digitally appeared, was stored, and presented billions of times in books, news sites, and emails. The method and system stores “Washington” as, for example, “Wa6on”, keeping the first and the last pairs of letters of Washington (“Wa” and “on” respectively). The content in between the two pairs is then replaced with the number of removed characters (“shingt” is converted to the number “6”). Therefore, instead of storing “Washington” as an expression with 10 bytes, a shorter expression, “Wa6on”, an expression with 5 bytes is stored, thus saving 50% of the storage, while the tradeoff is the computational time required to compress and decompress the word “Washington”. Further, the method and system relies on the fact that many (not all) words that exist in any language are unique, and certain characters may be removed and easily retrieved, that is, no other word other than “Washington” in English starts with “Wa”, ends with “on”, and has 6 characters in between the two pairs.

In accordance with an embodiment, the method and system performs text compression and text decompression using a predefined dictionary as follows.

FIG. 1 illustrates a flowchart of a process for creating the dictionary in accordance with an embodiment of the method and system.

Figure 1

As illustrated in FIG. 1, in order to create the dictionary, the method and system receives a large collection of textual data such as, but not limited to, books, websites and extracts a...

Processing...
Loading...