Browse Prior Art Database

Self-Service Tool For Unstructured Data Quality

IP.com Disclosure Number: IPCOM000237654D
Publication Date: 2014-Jul-01
Document File: 4 page(s) / 36K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a tool to determine the quality of unstructured data for analysis by extending the application of an existing self-service data integration concept to data quality. Using the new system and method, the analyst can point to a source and obtain a data quality report in a few clicks, without depending on information technology personnel to assist with the sorting of Big Data.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 31% of the total text.

Page 01 of 4

Self-

-Service Tool For Unstructured Data Quality

Service Tool For Unstructured Data Quality

The disclosed system and method brings information extraction from big data sources into the same environment with application of quality measurement processes.

Information architects, data scientists, business analysts, etc. use information available from various sources to provide better analyses and business insights. The better the quality of the information in the sources, then the better the results are. It is valuable to understand the quality and richness of the information available in various sources so that analysts can select the best resources. The challenge is that, at times, these information sources are coming from big data and evaluating the quality of that data is difficult.

One way of evaluating the quality of the information from the big data sources (e.g., an email repository), is to extract and load the required data into conventional structured databases and then take it from there for further quality processing. This requires many resources and may not be seamless as information extraction from big data sources and quality analysis process are running on different, disparate environments. These processes need to be made compatible and then converged. This is a complex and resource-intensive endeavor.

The disclosed system and method enables business users to perform data quality assessment on unstructured data sources in a simplified manner (i.e. with just a few clicks). The system uses an unstructured data extraction stage in an Extract, Transform, Load (ETL) tool, which helps extract useful data (e.g., phone numbers, credit card numbers, etc.) as well as key phrases from unstructured documents. In an embodiment of the present invention, an existing tool for self-service data integration is extended to data quality. This feature allows users to obtain needed information with a few simple clicks that automate a series of steps contained within the information server. This tool allows business users to provision data in a simplified and governed manner without being dependent on IT teams. Using the disclosed system and method, the analyst can point to a source and obtain the data quality report in a few clicks.

The disclosed system feeds unstructured data information to data quality tools and applies predefined domain specific quality rules to the provided data. IT users can define and group annotators to create annotation sets relevant to a domain/sub-domain and then make those sets available to business users. Business users can select an unstructured data source, the data type and domain, and then generate the quality report based on predefined quality rules. The entire process works in a simplified, automated manner, allowing a business user to quickly infer the quality of an unstructured data source.

In an exemplary embodiment of the present invention, a business analyst in a bank

wants to perform analytics on h...