Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Technique for condensing the size of text documents for storage and transfer

IP.com Disclosure Number: IPCOM000130445D
Publication Date: 2005-Nov-09
Document File: 4 page(s) / 47K

Publishing Venue

The IP.com Prior Art Database

Abstract

Text documents can be stored on or transferred between computing devices. Much of the text document contains repeated terms, phrases, text strings and word endings that are sent or stored in the original form for the end user. Opportunity exists to reduce the disk storage space required and file transfer size of a text document by identifying repeated text strings in the document and substituting them for a single character symbolic code. Compression techniques to reduce file size of computing documents and pictures have been used in the industry for some time. ZIP files seem to be the simplest example of binary file compression. It will be beneficial to reduce file size without binary file compression thus opening the door to a new text compression technique that could also be used with binary compression to further reduce file size, Spoken and written languages have common words, grammar rules and syntax that enable people to compose communications using known rules. Some rules and words are more commonly used than others thus providing the opportunity for the text compression technique in this patent application. The common theme is that a software substitution tool may be applied to any text file that would search the file for common character strings used and substitute the found text string instances with an ASCII character symbol. Each character in a text file represents one byte of information. More than 100 ASCII characters are available to substitute into a text file to replace known text strings. Obviously reversing the substitution procedure restores the text document to its original form. A sample ASCII symbol table is as follows: http://www.asciitable.com/ Solution: The proposed solution is to develop a simple substitution program that could be applied to reduce text file size on mobile devices and the host data servers. Savings can come from reduce file storage, smaller packet size and increasing the speed of network access. There are several options to implement this solution. These options include: Option 1: ASCII Substitution Consider that in any language there are popular words and phrases. What if the popular words and the space that follows the word are substituted for an ASCII symbol? This allows the word and space following to be represented by a character that takes the same amount of document space as the space did originally. Consider the top 100 or top 1000 English word (e.g., http://esl.about.com/library/vocabulary/bl1000_list1.htm) , the program would search the document and replace the word and the space that follows with a symbol representing the text string. Because the program only runs on the top 100 words, it would be quick to run. Example 1A: SW would search a document for the word “the “ (which is the word the followed by a space) and substitute it with the character %. This is why the fox is the fastest animal…. Becomes This is why %fox is %fastest animal… Savings is 6 characters or 6 bytes. Option 2: Multi-Lingual Substitution Consider the same example as Option 1, but with a different language. French possibilities would be “les “ “le “ etc. Masculine and feminine expressions in the French language are handled differently by the grammatical rules but also allow a good opportunity to substitute text strings. Option 3: Morphemes Morphemes are used in some languages to modify a core words meaning. The word “try” can have its meaning modified by adding affixes like “s, ed, ing” to the end of word. The substitution program in this case would search documents and substitute ASCII characters for common morphemes. Alternately, the best return on substitution to reduce file size may be a mixture of common words and common morphemes. Example 3A: Substitute “the “ and “ing “ for ASCII characters. Where is the book I was trying to sell…. Becomes Where is %book I was try&to sell…. A savings of 6 bytes. Option 4: Context specific substitution Consider substitution of character strings that are most commonly used within a document or corporate database. The software tool would need to analyze the document and determine the most commonly used items that would yield the greatest return on investment during a substitution. Note that in an elaborate program common phrases could also be found and substituted to further reduce file size. A Case study has been completed on the book “War and Peace” to give an idea of the powerful opportunity.

This text was extracted from a Microsoft Word document.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 28% of the total text.

TEXT CONDENSING METHOD

Technique for condensing the size of text documents for storage and transfer

Disclosed Anonymously

Text documents can be stored on or transferred between computing devices. Much of the text document contains repeated terms, phrases, text strings and word endings that are sent or stored in the original form for the end user.

Opportunity

exists to reduce the disk storage space required and file transfer size of a text document by identifying repeated text strings in the document and substituting them for a single character symbolic code.

Compression techniques to reduce file size of computing documents and pictures have been used in the industry for some time. ZIP files seem to be the simplest example of binary file compression.  It will be beneficial to reduce file size without binary file compression thus opening the door to a new text compression technique that could also be used with binary compression to further reduce file size,

Spoken and written languages have common words, grammar rules and syntax that enable people to compose communications using known rules. Some rules and words are more commonly used than others thus providing the opportunity for the text compression technique in this patent application.

The common theme is that a software substitution tool may be applied to any text file that would search the file for common character strings used and substitute the found text string instances with an ASCII character symbol. Each character in a text file represents one byte of information. More than 100 ASCII characters are available to substitute into a text file to replace known text strings. Obviously reversing the substitution procedure restores the text document to its original form.

A sample ASCII symbol table is as follows: http://www.asciitable.com/

Solution:

The proposed solution is to develop a simple substitution program that could be applied to reduce text file size on mobile devices and the host data servers. Savings can come from reduce file storage, smaller packet size and increasing the speed of network access.

There are several options to implement this solution. These options include:

Option 1: ASCII Substitution

Consider that in any language there are popular words and phrases. What if the popular words and the space that follows the word are substituted for an ASCII symbol?

This allows the word and space following to be represented by a character that takes the same amount of document space as the space did originally.

Consider the top 100 or top 1000 English word (e.g., http://esl.about.com/library/vocabulary/bl1000_list1.htm) , the program would search the document and replace the word and the space that follows with a symbol representing the text string. Because the program only runs on the top 100 words, it would be quick to run.

Example 1A:

SW would search a document for the word “the “ (which is the word the followed by a space) and substitute it with the character %.

This is wh...