Browse Prior Art Database

Automatic Pedigree Checks for Source Code

IP.com Disclosure Number: IPCOM000186168D
Original Publication Date: 2009-Aug-12
Included in the Prior Art Database: 2009-Aug-12
Document File: 2 page(s) / 77K

Publishing Venue

IBM

Abstract

The field of this disclosure is open source software. This solution describes a way to avoid inadvertently copying open source software into a commercial product. This is vitally important for commercial software since open source licenses are not always suitable for commercial developments (for example, GPL). The solution disclosed seeks to solve this problem by analysing the source code of open source projects and comparing it with the commercial product.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 2

Automatic Pedigree Checks for Source Code

Open source is an approach to design, development, and distribution offering practical accessibility to a product's source (goods and knowledge). Some consider open source as one of various possible design approaches, while others consider it a critical strategic element of their operations. Before open source became widely adopted, developers and producers used a variety of phrases to describe the concept; the term open source gained popularity with the rise of the Internet, which provided access to diverse production models, communication paths, and interactive communities.

    The open source model of operation and decision-making allows concurrent input of different agendas, approaches and priorities, and differs from the more closed, centralized models of development. The principles and practices are commonly applied to the peer production development of source code for software that is made available for public collaboration. The result of this peer-based collaboration is usually released as open-source software, however open source methods are increasingly being applied in other fields of endeavour, such as Biotechnology.

    More information about open source software is available here: http://en.wikipedia.org/wiki/Open

_source

    There are many pitfalls with using open source software in commercial products. The most widely acknowledged issue is with the GNU Public License (GPL). The GPL has characteristics such that software derived from GPL code automatically acquire the same licensing requirements, such as free distribution: http://en.wikipedia.org/wiki/Open-source

_license

    The solution proposes a system such that code committed to a product source code repository is automatically validated. Validation ensures that the software has no similarities with existing software in a collection of pre-indexed open source projects. This provides some additional guarantees that software in a commercial product does not have any pedigree in open source software such as GPL.

    There are two parts to this solution: the indexing engine and the validation runtime.

Indexing Engine

    The indexing engine runs on a central server and automatically pulls down the latest code from various open source projects. These are stored and indexed ready for fast lookup. This indexing is not regular keyword indexing such as might be used to index web pages. Instead it collects groups of tokens from the source code, reduces the tokens and hashes the resulting string. Each hash is stored along with the original source code metadata (including the file name and location). The indexing algorithm works with a moving window over the source code stream as follows:
First the original Java* source code:

java_import('java.lang.String');
$signature = new JavaSignature(JAVA_BYTE | JAVA_ARRAY);
$string = new String($signature, "Hello World!");

The first stage is white space removal:

java_import('jav...