Browse Prior Art Database

Intra-document diff tool to identify and compare cut-and-pasted text

IP.com Disclosure Number: IPCOM000178880D
Original Publication Date: 2009-Jan-28
Included in the Prior Art Database: 2009-Jan-28
Document File: 3 page(s) / 30K

Publishing Venue

IBM

Abstract

Disclosed is a system, method, and apparatus for comparing two or more proper subsections within one or more files. The user supplies at least one file (by name, handle, content, etc.). The invention analyzes the file and extracts pairs of similar chunks. It compares these chunks as if they were solitary files, presenting the differences to the user in any of the formats of which existing comparison programs are capable.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 55% of the total text.

Page 1 of 3

Intra-document diff tool to identify and compare cut -and-pasted text

Developers frequently copy-and-paste code blocks, both within and across source files. (For the sake of this writeup, assume that the option of modularizing that code into its own function has been analyzed and rejected.) The original and/or the copies may subsequently be subjected to changes. For various reasons, one often finds oneself wanting to compare two blocks of code with the same origin. However, there currently exists no known tool to make this feasible. To emulate the task, one would have to copy-and-paste each chunk into its own file and then use existing tools to compare those two files in their entirety. (If the two code blocks are in separate files, one might be able to run existing diff programs against the two files in their entirety and just "ignore" the maelstrom of visual noise generated by the comparison of remainders of the files. However, existing tools are just as likely to yield ALL noise in this case.) What is needed is an integrated way to compare subsections of text within the same or different files.

Disclosed is a system, method, and apparatus for comparing two or more proper subsections within one or more files. The user supplies at least one file (by name, handle, content, etc.). The invention analyzes the file and extracts pairs of similar chunks. It compares these chunks as if they were solitary files, presenting the differences to the user in any of the formats of which existing comparison programs are capable.

The invention comprises a program capable of:
(1) Comparing two strings and generating a "similarity coefficient". (In this proof-of-concept program, the disclosure divided the longest common subsequence (LCS), a well-known string comparison metric, by the average length of the two strings being compared.)
(2) Recursively applying the comparison algorithm (1) to parallel pairs of subsequences of the input text.
(3) Isolating parallel pairs of text which exceed a predetermined tunable "similarity threshold" and feeding these two bodies of text into a comparison program. (This portion of the invention would be expected to "plug in" seamlessly to your favorite diff program.)

It would be considered an obvious...