Browse Prior Art Database

Algorithm for efficient archiving and extraction of sparse files Disclosure Number: IPCOM000238428D
Publication Date: 2014-Aug-26
Document File: 4 page(s) / 67K

Publishing Venue

The Prior Art Database


Method for efficient archiving and extraction of sparse files.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 41% of the total text.

Page 01 of 4

Algorithm for efficient archiving and extraction of sparse files

Background of the Problem:
How to archive and extract sparse files efficiently, so that the resulting archive consumes same amount of storage space, as consumed by a sparse file on the disk.

The sparse file uses lesser number of blocks of storage space then indicated by there size. When an application tries to archive a sparse file, it will need the amount of space indicated by its size, hence wasting lots of space on the backup medium. For example, if we have to create an archive of five sparse files, each of size 1 GB, we would need at least 5 GB of space on the backup medium, even though the sparse files actually occupy much lesser disk space. A sparse file may be of size 1 GB, but it might consume only 4096 bytes on disk, then amount of space needed to archive it on the backup medium should be 4096 bytes only, same as what is consumed on the disk.

Existing Solution:

There are not many solutions currently to this problem, only the GNU tar provides this capability. However, it has a serious drawback. In order to determine if the file is sparse, GNU tar has to read the file before trying to archive it, so in total the file is read twice. So, always the time needed to process all files with the "sparse" option is roughly twice the time needed to archive them without the "sparse" option. So even if there are no sparse files among the files to be archived, the current solution is going to take twice the time.

Reference for the drawback:

This algorithm will provide a method to efficiently archive sparse file, and use the same amount of space on the backup medium, as actually consumed by sparse file on the disk. Also it will provide a method to restore from the archive, so that finally we create the exact sparse file on the target medium, when we are extracting from the archive.

The solution is to achieved by using two entities first a sparse header and secondly a sparse map to store information about sparse file.

The sparse header, will contain information about how to read the archive, and meta-data of the sparse map.

The sparse map will contain the information that which block of the file contains valid data, and which block contains zeros indicating a hole in the file. The sparse map is dynamically created when we read the file. As we are reading data from file we will be simultaneously writing it on the archive, and creating the map as well. This will save an extra read of the file.

The sparse map will be a bit wise map, where each bit will contain info about one block of the file, thus a sparse_map of only one byte in size is sufficient for files up to 32768 bytes in size, saving the storage space needed on the backup medium.

Once we are done with reading of the file, and map is created, we will place the map in archive at a designated place, just ahead the data of the file we have written earlier.

To ext...