Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Control Statement-Driven utility for comparison of "physically different" but "logically similar" datasets

IP.com Disclosure Number: IPCOM000020174D
Original Publication Date: 2003-Oct-29
Included in the Prior Art Database: 2003-Oct-29
Document File: 3 page(s) / 75K

Publishing Venue

IBM

Abstract

A generalized data set comparison is performed which identifies "logical matching" in datasets which may not "physically match". Control statements are utilized by which the algorithms for what constitutes a logical match can be defined. Ignoring the expected differences allows focus on unknown differences, or confirmation of a logical match between records. A comparison of the net differences between two files can be achieved by ignoring or bypassing expected differences, where these expected differences are: o Predictable o Can be qualified and described by control statements

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 32% of the total text.

Page 1 of 3

  Control Statement-Driven utility for comparison of "physically different" but "logically similar" datasets

  The ability to compare two files to determine whether the contents match is a very common function. Tools such as SUPERC are included with ISPF on z/OS systems, and are very good at determining whether two files match, and at identifying the line at which a mismatch is determined. However, while some tools may have some function exceeding the following , the existing tools are tailored to be most useful on datasets with the following characteristics:
a) Files contain Character data
b) Data layouts where the data to be compared begins in the same relative column in the files to be compared
c) Data layouts where the entire records are to be compared
d) Data layouts where every record is to be compared In actual practice, datasets may "logically match" for certain purposes where they don't exactly "physically match". Tools such as SUPERC are very good at instances where the two files being compared are similarly formatted and assumed to be relatively close in terms of the contained. SUPERC will find the differences between the two files. However, if the two files are formatted completely differently, SUPERC is not as good at recognizing the similarities. Some examples of these format differences which would indicate mismatches include:
a) Data in one file may begin in a different column than data in another file
b) Character strings which are not expected to match, such as creation date/timestamps
c) Perhaps only selected character strings within a record are expected to match, and the remaining data on the record is insignificant for the comparison
d) There are extraneous records in a file which are not to be included in the comparison
e) Perhaps the majority of the records in a file are to be considered extraneous, and only a small subset of the records are expected to match. The current invention is a generalized data set comparison which identifies "logical matching" in datasets which may not "physically match". The invention utilizes control statements by which the user can define the algorithms for what constitutes a logical match. It allows a user to get a comparison of the net differences between two files by being able to ignore or bypass expected differences, where these expected differences are:
o Predictable
o Can be qualified and described by control statements Ignoring the expected differences allows a user to focus on unknown differences, or to confirm a logical match between records. Another main use of the invention would be to compare datasets which would be difficult to compare otherwise due to their size and that the contents are not easily readable . Examples of the use include the following: o Comparison of image copy datasets, IMS logs, or data extractions where data is predominantly hexadecimal.
o These datasets have very long record lengths. Further, these datasets often contain date/time stamps, sequence numbers...