Browse Prior Art Database

Discovering Code Cloning in Software Source Code by Fingerprinting

IP.com Disclosure Number: IPCOM000239208D
Original Publication Date: 2014-Oct-20
Included in the Prior Art Database: 2014-Oct-20
Document File: 3 page(s) / 45K

Publishing Venue

Linux Defenders

Related People

Armijn Hemel: AUTHOR

Abstract

This document describes a simple code clone detection method that extracts string constants and other idenfitiers (function names, method names, variable names, etc.) from source code and compares these identifiers to a knowledgebase with identifiers extracted from publicly available open source software with the aim of finding copied code.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 01 of 3

Discovering code cloning in software source code by fingerprinting

Introduction

Source code, such as released under an open source license, is often reused, either unmodified, or with (slight) modifications. Unmodified verbatim reuse can easily be detected by computing checksums of files and comparing these to a list of known checksums from a knowledgebase. Modified source code reuse can be detected using a number of techniques often referred to as "code clone detection" in scientific literature.

This document describes a simple code clone detection method that extracts string constants and other idenfitiers (function names, method names, variable names, etc.) from source code and compares these identifiers to a knowledgebase with identifiers extracted from publicly available open source software with the aim of finding copied code.

Tags

Linux, FreeBSD, *BSD, Unix, Android, Java, software, software development, open source, code clone detection

Detailed description: extracting identifiers from source code, creating a knowledgebase and fingerprinting source code

Identifiers (string constants, variable names, function names, method names and so on) can be used to fingerprint a source code file and see where the code could have originated from by comparing the identifiers to a knowledgebase of identifiers and metainformation. Even though the method can be easily defeated by changing identifiers it has a few advantages: it is easy to understand and very fast.

A similar method for detecting code provenance in binary files has been described at the Mining Software Repositories 2011 conference in the paper "Finding Software License Violations Through Binary Code Clone Detection" [1] and has been implemented in various tools like for example the Binary Analysis Tool and documented in other publications[...