Browse Prior Art Database

Method for Mining Metadata in Data Lake/ Data Repository to Enable Self-driven Analytics

IP.com Disclosure Number: IPCOM000247087D
Publication Date: 2016-Aug-03
Document File: 5 page(s) / 91K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a method to analyze metadata as a precursor to a large, comprehensive analysis of large data content, in order to derive a better analysis and to understand significant data element types and the relationships.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 26% of the total text.

Page 01 of 5

Method for Mining Metadata in Data Lake / /

Data Repository to Enable Self

Data Repository to Enable Self - -driven

driven

Analytics

Most data warehouses are created for siloed sets of data and applications. One of the main objectives of a data lake is to provide a common platform for disparate data

warehouses with rich data cataloging, tagging, lineage, and governance principles. However, even as value is evident in combining different data sets, problems are inherent in the process.

Managing large amounts of data is a large task for organizations, and understanding large amounts of data is an even bigger problem. The metadata repository contains data that describes the data that resides in the repository stores. The typical metadata has two important attributes: a smaller footprint than the actual data content repositories and metadata repositories that contain descriptions of the actual data repository.

These descriptions are not accessible in the actual repositories. These metadata repositories are populated following a curation process. Expert data stewards and subject matter experts (SMEs) spend enormous amounts of time registering data into the data lake following an important step of describing the content . All of this content is maintained in the ecosystem in a repository called the metadata repository . This metadata actually contains natural language sentences, making it easier to interpret, data mine, and harvest new insights.

Large data sets from different sources often pose another challenge along with opportunity. Even if rich metadata is present, it is often very difficult to understand and interpret thousands of data fields in data sets, let alone the ambiguity of human language. People use different terminology and it requires strong domain knowledge and business vision to understand the contents. Despite having an understanding of the data, an analyst must also have deep analytical skills to match the data types and potential analysis types. This process is often very challenging and might not scale-up, even with any manual process.

None of the existing practices and approaches analyzes data to generate insights about data types . Typically, data is used by human and machine-enabled data scientists with help from domain business experts. This often results in an incomplete and incorrect analysis process and outcomes.

To address these issues, a method is needed for an automated, multi-disciplinary, knowledge-supported analysis, which also reviews the data types. Such a method can add a new dimension to analytics as a pre-cursor to the analysis of the real data content.

The novel contribution is amethod to analyze metadata as a precursor to a large, comprehensive analysis of large data content, in order to derive a better analysis and to understand significant data element types and the relationships. This method of

1


Page 02 of 5

mining/analyzing the metadata helps determine the type(s) of data analysis that i...