Browse Prior Art Database

Automated Discovery And Masking Of Sensitive Data Disclosure Number: IPCOM000240280D
Publication Date: 2015-Jan-21
Document File: 5 page(s) / 44K

Publishing Venue

The Prior Art Database


Disclosed is a solution to fully automate the masking of sensitive fields in structured data stores so that no user input is required beyond the location of the input data to mask.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 22% of the total text.

Page 01 of 5

Automated Discovery And Masking Of Sensitive Data

Masking (or de-identifying) sensitive data is an important aspect of managing data stores. Service providers or data storage solutions must be able to ensure that the customer's sensitive data is protected. However, to continue development, these organizations still need to provide test data that has the properties of "real" data.

In order to mask structured, sensitive data, a data management storage system must have knowledge of the input file(s) format, schema, fields to mask, and masking rules. When users are required to satisfy these requirements they must manually provide all of the above information. This is time-consuming, error prone and requires a level of familiarity with the product details that can serve as an inhibitor to adoption.

The system disclosed herein accelerates and streamlines the process of data masking.

The disclosed system automates the masking of sensitive data within structured files. The system analyzes input data to discover the schema(s) of the input data, and then classifies the data to uncover fields containing sensitive data. In response, the system generates a masking plan by selecting appropriate masking rules, so that sensitive data is masked in a way that preserves as much authenticity as possible. For example, credit card numbers are replaced with pseudo-accurate credit card numbers.

More specifically, the disclosed system processes a set of input files containing records that may have sensitive data. It determines the file type of input data and the structure/schema of the input data. The system may then locate sensitive data within the input files and optionally apply contextual analysis on sensitive data. The system can preserve these characteristics when masking. For nested file types such as Extensible Markup Language (XML), the system can optionally locate the primary record structure and only process this data and not other structures in the file. The primary record structure is generally identified as the path to the most significant repeating structure within the set of input files.

The disclosed system generates a sensible masking plan that provides appropriate masking algorithms/rules to employ on specific fields. This masking plan, in turn, can be consumed by a masking engine in order to perform the actual masking of sensitive data.

The system disclosed herein analyzes input data and automatically generates the configuration file/boot-strapping necessary to fully automate the masking process.

Inputs to the System

The disclosed system for automatically masking record/field-based sensitive data, receives input including a set of files containing structured data. The input data may be homogeneous (i.e. same schema/structure/format) in embodiments of the disclosed system. The set of input files can be described with a regular expression, for example:
(1) customer*.xml or (2) log*2014.csv. Additionally, a user may supply arguments for...