Browse Prior Art Database

Automatically Identifying Defunct Sources, Consumers, and Transformations in Data Processing Systems

IP.com Disclosure Number: IPCOM000242174D
Publication Date: 2015-Jun-23
Document File: 3 page(s) / 29K

Publishing Venue

The IP.com Prior Art Database

Abstract

Described is a method for automatically identifying stale data, redundant outputs, and redundant processing steps in data processing systems.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 46% of the total text.

Page 01 of 3

Automatically Identifying Defunct Sources , , Processing Systems

An enterprise continuously adds data sources, consumers, and transformations to its systems, but does not know when these assets become outdated and should be removed. This has a number of undesirable consequences:

1. A stale data source might be supplying data that is no longer of interest, or invalid data.

2. Government and international regulations (e.g. the Basel Accords) may require an enterprise to demonstrate veracity in its financial reporting, e.g. by proving that all data sources are current.

3. A data sink may no longer be required, e.g. an unused application, or a BI report that is never viewed. Regulations may be breached if protected data is processed or stored for no purpose.

4. Resources may be wasted on processing stale and unneeded data.

Described is a method for automatically identifying stale data, redundant outputs, and redundant processing steps in data processing systems.

The key novelties of the invention compared to existing solutions are:

1. Use of multiple data sources and fuzzy logic to automatically identify and classify problematic assets, selecting the appropriate data sources for each type of asset.

2. Automatic classification of all types of data processing asset.

3. Use of lineage to recursively identify problematic assets across data processing systems. Terminology

1. A lineage graph is a data flow diagram describing the flow of data through and across data processing systems. Lineage nodes may be classified by the direction of data flow:

1. a source of data,

2. a destination for data (sink),

3. or a data transformer.

2. Nodes may also be associated with metadata. This could be defined by data governance products (such as business terms or rules), by the data sources themselves (such as database schema), or custom properties to guide the automatic classification. Custom properties may be managed by an ETL tool, governance tools, or some other means.

3. "Defunct" is used as a generic term for any node identified as being problematic . Implementation Details / Claims

1. A lineage graph is generated covering the data processing systems to be examined, including ETL projects.

2. Nodes are classified as active or passive.

1. An active node is one that:


• pushes fresh data into the graph if it is a producer,
• retrieves data as required if it is a consumer,

• or is a transformer.

All other nodes, e.g. files and databases, are passive.

2. The type of an edge node is not always sufficient for determining its classification. For

1

,

Consumers

Consumers,

and Transformations in Data

and Transformations in Data


Page 02 of 3

example, a report may be a passive file, or it may be an application that retrieves data on demand. Some nodes may exhibit characteristics of both, such as a reporting application which retrieves data the first time a report is viewed, then caches it for future views.

3. An active node is considered defunct if it has n...