Browse Prior Art Database

Lightweight Scientific Workflow and Metadata Tracking

IP.com Disclosure Number: IPCOM000244334D
Publication Date: 2015-Dec-02
Document File: 3 page(s) / 49K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a design for a lightweight, configuration-free system that automatically tracks the exploratory work performed by researchers. The core novelty of the system is a monitoring program interposed between the user and the operating system, which automatically tracks and records the user's interactions with the underlying computer system, both directly and through processes (executing programs) invoked by the user.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 39% of the total text.

Page 01 of 3

Lightweight Scientific Workflow and Metadata Tracking

Much of the bioinformatics and other computational scientific work involve manually running a variety of applications over previously captured biological data. Initial investigation of a data set is often performed in an ad hoc , exploratory manner; researchers often will not know exactly what to expect and will regularly spend time simply "playing" with the data, to develop an intuition for it. This typically means running the data through a variety of command line tools , while also tuning relevant parameters.

One practical problem is that this process produces a large number of directories and files, in various formats, and manually tracking the provenance of these artifacts requires significant effort. A large number of workflow systems that support the repeated application of complex scientific pipelines , and automatically track associated products are available. These systems, however, necessarily require users to define

workflows ahead of time, which runs counter to the needs of researchers still operating at the earliest stages of an investigation, or individuals developing tools in parallel with an analysis.

The novel contribution is a design for a lightweight, configuration-free system that automatically tracks the exploratory work performed by researchers. The system allows users to run existing, arbitrary programs from the command line, in a familiar fashion,

without requiring any integration or instrumentation of those programs . Furthermore, this system provides the means for users to interrogate it about any of the processes and products that have been generated, and to generalize the captured information into reusable pipelines.

The core novelty of the system is a monitoring program interposed between the user and the operating system, which automatically tracks and records the user's interactions with the underlying computer system, both directly and through processes (executing programs) invoked by the user. Automatic process tracking applies transitively: it includes processes invoked by other monitored processes. By collecting these interactions, the monitor can maintain a record of which processes have been run, including the corresponding binaries and command line arguments , and which files and directories were read, written, and otherwise manipulated.

The proposed monitoring system allows the user to query the history of process execution, as well as individual files and directories, revealing the recorded associations between programs and files. This removes the burden of this tedious bookkeeping from the user. Furthermore, because all program invocations are recorded, the monitor is able to produce a succinct log of a set of related program invocations (e.g., all of the programs which read from a particular file), which can take the form of an executable script. Thus, the monitoring system is able to automatically produce repeatable , executable work...