Browse Prior Art Database

Control Statement-Driven Utility for Comparison of "Physically Different" but "Logically Similar" Datasets

IP.com Disclosure Number: IPCOM000049740D
Original Publication Date: 2005-Feb-09
Included in the Prior Art Database: 2005-Feb-09
Document File: 3 page(s) / 26K

Publishing Venue

IBM

Abstract

A generalized data set comparison is performed which identifies "logical matching" in datasets which may not "physically match". Control statements are utilized by which the algorithms for what constitutes a logical match can be defined. Ignoring the expected differences allows focus on unknown differences, or confirmation of a logical match between records. A comparison of the net differences between two files can be achieved by ignoring or bypassing expected differences, where these expected differences are: o Predictable o Can be qualified and described by control statements

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 33% of the total text.

Page 1 of 3

Control Statement-Driven Utility for Comparison of "Physically Different" but "Logically Similar" Datasets

    The ability to compare two files to determine whether the contents match is a very common function. Tools such as IBM SUPERC are included with IBM ISPF on z/OS systems, and are very good at determining whether two files match, and at identifying the line at which a mismatch is determined. However, while some tools may have some function exceeding the following , the existing tools are tailored to be most useful on datasets with the following characteristics:
a) Files contain character data,
b) Data layouts where the data to be compared begins in the same relative column in the files to be compared,
c) Data layouts where the entire records are to be compared, or
d) Data layouts where every record is to be compared.

    In actual practice, datasets may "logically match" for certain purposes where they don't exactly "physically match". Tools such as SUPERC are very good at instances where the two files being compared are similarly formatted and assumed to be relatively close in terms of the contained data. SUPERC will find the differences between the two files. However, if the two files are formatted completely differently, SUPERC is not as good at recognizing the similarities. Some examples of these format differences which would indicate mismatches include:
a) Data in one file begins in a different column than data in another file.
b) Character strings are not expected to match, such as creation date/timestamps.
c) Only selected character strings within a record are expected to match, and the remaining data on the record is insignificant for the comparison.
d) There are extraneous records in a file which are not to be included in the comparison.
e) The majority of the records in a file are to be considered extraneous, and only a small subset of the records are expected to match.

    The current invention is a generalized data set comparison which identifies "logical matching" in datasets which may not "physically match". The invention utilizes control statements by which the user can define the algorithms for what constitutes a logical match. It allows a user to get a comparison of the net differences between two files by being able to ignore or bypass expected differences, where these expected differences are: o Predictable o Can be qualified and described by control statements such that ignoring the expected differences allows a user to focus on unknown differences, or to confirm a logical match between records.

    Another main use of the invention would be to compare datasets which would be difficult to compare otherwise due to their size and that the contents are not easily readable. Examples of the use include the following: o Comparison of image copy datasets, IMS logs, or data extractions where data is predominantly hexadecimal. o Datasets having very long record lengths. Further, these datasets often contain date/time

1

Page 2 of 3

stamp...