Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Optimization of RSync Differential Algorithm for Document Management Software

IP.com Disclosure Number: IPCOM000199124D
Publication Date: 2010-Aug-26
Document File: 6 page(s) / 236K

Publishing Venue

The IP.com Prior Art Database

Abstract

Any time a client program sends an updated version of a file to a server, uploading the modified version can take an unwarranted amount of time if either the bandwidth of the connection is low or if the file itself is very large. Currently, RSync is an algorithm which can take a copy of the original file and the new file and determine the minimal amount of differences between the two files and send only those changes to the server. In its present form, RSync does not take into consideration the bandwidth of the current connection, the potential for block collisions and data loss, or any concern of the type of file being updated.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 53% of the total text.

Page 1 of 6

Optimization of RSync Differential Algorithm for Document Management Software

Running RSync, especially for large files, is very computationally expensive and in high bandwidth scenarios the cost of running the algorithm is sometimes outweighed by simply uploading the file in its entirety. In addition, any time hashes of bytes are being compared to one another the potential exists for collisions between those hashes. Because RSync relies on comparing hashes of blocks, there is a risk, albeit extremely minute, that the newly generated document on the server does not match the client copy. Lastly, because of the nature of compressed files (where a small change in the file will recompress to an almost completely different file) RSync performs very poorly on compressed files and takes no consideration to avoid these files, essentially wasting valuable processing time.

The majority of the base differential algorithm is heavily based if not an exact copy of RSync. The algorithm used is detailed in the flow charts below. The three critical improvements are those that have been listed above. All improvements are designated by grey boxes in the flow charts below, and each improvement is detailed under its individual section after the flow charts.

Step 1: Downloading the file from the server.

Figure 1. Downloading the file from the server and computing the needed hashblocks.

Step 2: Computing the Delta Encoding to send back to the server. (After modification of the file)

1

[This page contains 1 picture or other non-text object]

Page 2 of 6

Figure 2. Computing the delta that will be sent to the server.

Step 3: Reconstruct the file on the server.

2

[This page contains 1 picture or other non-text object]

Page 3 of 6

Figure 3. Reconstructing the file on the server (also used to reconstruct locally for data integrity).

Bandwidth Efficiency:

Varying the algorithm based on bandwidth is a very concrete way to ensure that running the algorithm is the most efficient way to send up the file. The trickiest part about the process is keeping a record of the run times of the algorithm on files of a certain type and of a certain size. Once a substantial list of run times is compiled the algorithm can then perform a simple lookup to approximate the time it would take to run on a file of a certain size. Additionally the algorithm

3

[This page contains 1 picture or other non-text object]

Page 4 of 6

can simpl

        divide the bandwidth into the entire file size, compare the two time values, and decide whether it is optimal to run the algorithm. Even if there is some fluctuation in run times, over time the numbers will average out. Additionally the decision will have to take into account that if the two times are close or exactly the same sending the entire...