Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

De-identify sensitive information from ETL Job design and diagnostic information

IP.com Disclosure Number: IPCOM000238009D
Publication Date: 2014-Jul-25
Document File: 6 page(s) / 71K

Publishing Venue

The IP.com Prior Art Database

Abstract

The Extract-Transform-Load job design and diagnostic information such as run time logs may contain some of the sensitive information like name of the developer who created the job, IP addresses of source and target database machines, user name and passwords (in encrypted form), OS user who created the job and machine host name on which the job was created etc. Following is an example to collect the sensitive information from the ETL Job design and diagnostic information and de-identify it. To maintain consistency and understandability of the job design and related run time logs, de-identify all occurrences of the sensitive info with the same value at all places respectively. Also, the identified sensitive information is persisted in a dynamic data dictionary for organization wide reuse.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 29% of the total text.

Page 01 of 6

De-identify sensitive information from ETL Job design and diagnostic information

Background:

An Extract-Transform-Load (ETL) tool provides a mechanism to create ETL jobs, compile them and provides an environment to run the developed and compiled jobs. ETL jobs are designed to extract data from one or more databases and transform the data that is extracted based on business logic and finally load the data to one or more databases. After designing, the ETL jobs are compiled and then the compiled binaries are executed in a production environment. The ETL job design may contain sensitive information such as:-

    - Name and address of the developer who created the job - When the job was created and its revision history - Email addresses and phone numbers - IP addresses of source and target database machines - User names, passwords of the machines, the OS user who created the job and design host names etc.

The job run time diagnostic information such as run time logs may also contain the above mentioned sensitive information.

Problem Description:

The sensitive information that is embedded deeply within the job design and diagnostic information goes outside secure corporate boundaries when sharing the jobs and logs. The sensitive information may be exposed to unauthorized parties. And this results in confidentiality and network security issues.

Many of the existing solutions mask the data to some character (such as * or #). In such an approach, it becomes difficult to establish relationships between the files and the corresponding resources. Example - In a Grid or MPP network, if some sensitive information like host name is masked, it becomes difficult to establish the number of hosts involved and debugging becomes difficult. So instead of masking the host names, it could be prudent to de-identify the host names.

One approach is to de-identify the sensitive information from the ETL job design and diagnostic information. Design a generic tool that takes the ETL job design, diagnostic information and a centralized dynamic data dictionary for identifying sensitive information as input and produces the job design and diagnostic information with de-identified data

    - Parse the ETL job design and diagnostic information and identify the sensitive information based on the standard criteria as well as the user provided criteria through a centralized data dictionary and either remove the sensitive info and/or de-identify to other unrelated values. Simultaneously update the centralized dictionary with the identified sensitive information for reuse

    - To maintain consistency and understandability of the job design and related run time logs, de-identify all occurrences of the sensitive info with the same value at all places respectively

1


Page 02 of 6

Solution:

Design a generic tool that takes the ETL job design, diagnostic information and a centralized dynamic data dictionary for identifying sensitive information as input and produces the job design and...