Browse Prior Art Database

A method for identifying binary versions using hashcodes

IP.com Disclosure Number: IPCOM000211627D
Publication Date: 2011-Oct-14
Document File: 3 page(s) / 64K

Publishing Venue

The IP.com Prior Art Database

Abstract

The version of a software product is commonly identified by a hard-coded version string. However a product is often composed of multiple binary files, some of which might be updated individually. Then there is a disconnect between version number and product content so that the reported product version is not completely accurate. This article describes a system for: 1) Identifying the installed binary files in a reliable manner, such that the stated version can be believed; 2) Determining what has been changed within a product if the installation has been modified. At compile time, the compiler hashes the source that it is compiling and inserts hash codes for each compiled unit into its output binary file. Programmatically examining the hash codes in the binary files then allows the build version to be verified against the known set of hash values reported for a given build version. Hash codes that have changed identify parts of the binary installation that have been recompiled from revised source code.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 41% of the total text.

Page 01 of 3

A method for identifying binary versions using hashcodes

The version of a product is commonly identified by a hard-coded version string. However a product is often composed of multiple binary files, some of which may be recompiled by the user or someone else. For example, in Java a product may comprise many jar files, each of which typically contains many compiled .class files. If a user changes a jar file (which may have been provided as a patch by the product's service team) the product's reported version number is not affected. This disconnect between version number and product content means the service team or customer cannot trust the reported product version value. Not only may they have forgotten that the original, distributed product has changed, but also in what way it has been changed.

    Prior art exists for determining whether a binary has been tampered with. For example, the md5 checksum for a binary can be computed and compared with the value stated by the product's distributor. If the computed and expected values fail to match then the binary is not as expected. However it is not possible to tell what has been changed in the binary. A further drawback is that the checksum is not included with the binary so that the product's distributor must be sure to update the reported value for the checksum every time they update the binary.

What is needed is a system for:

    1) Identifying a build in a reliable manner, such that the stated version or build number can be believed.

    2) Determining what has been changed within a product if the binary has been modified.

    At compile time, the compiler hashes the source that it is compiling and inserts hash codes for each compiled unit into its output binary file. The compiled unit in this context could be a method, a class or some larger unit.

    Depending on the granularity of the hashing, multiple hashcodes may be aggregated and inserted into the binary. This would likely form a tree structure of hashcodes - for example, in Java the compiler might hash all classes' source at the method level, then at the class level, and accumulate those before storing them in the .class binary file. When inspecting the hashcodes later, if the .class hashcodes match then the class is known to match. If the hashcodes do not match, then the individual method hashes can be checked to find exactly where the class differs.

    Programmatically examining the hash codes in the binary files then allows the build version to be verified against the known set of hash values reported for a given build version.

Advantages:

    1) Build identification at fine grained level - provides a non-manual versioning system that is reliable and can link changes in binaries to specific changes in source code. The more granular the hashing, the more it's possible to pinpoint the exact code that generated the binaries.

    2) Information is consistent between compilations - hashing from source means that changing compiler optimisation levels, or...