Browse Prior Art Database

Integrated Text and Data Analysis in a Warehouse

IP.com Disclosure Number: IPCOM000125961D
Original Publication Date: 2005-Jun-24
Included in the Prior Art Database: 2005-Jun-24
Document File: 7 page(s) / 216K

Publishing Venue

IBM

Abstract

Today there exists systems for the analysis of data or text independently, but not both in an integrated fashion. This disclosure describes novel processes and technology for the integrated analysis of text and data in data and document warehouses. By building on top of existing data mining techniques proven for business intelligence, this system provides a means for deeper understanding and more complex insights from both the text and the data in a broad set of business environments, such as financials, customer relationship management (CRM) and life sciences. The use of data cubes to analyze and identify interesting characteristics within business data is well established. We integrate the notion of a document cube, introduce the systematic linkage and analysis of the data and document cubes through shared dimensions, and introduce the ability to add new dimensions to the document cube dynamically. These integrated capabilities provide novel capabilities for the integrated analysis of text and data within a systematic principled framework.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 25% of the total text.

Page 1 of 7

Integrated Text and Data Analysis in a Warehouse

By building on top of existing data mining techniques proven for business intelligence, this system provides a means for deeper understanding and more complex insights from both the text and the data in a broad set of business environments, such as financials, customer relationship management (CRM) and life sciences. The use of data cubes to analyze and identify interesting characteristics within business data is well established. We integrate the notion of a document cube, introduce the systematic linkage and analysis of the data and document cubes through shared dimensions, and introduce the ability to add new dimensions to the document cube dynamically. These integrated capabilities provide novel capabilities for the integrated analysis of text and data within a systematic principled framework.

Step 0: Definitions of Star Schema (and variants), Fact Table, and Dimension Table

First, we need to give a basic understanding of how the data and text is organized and stored in the database, also refered to generically as the data model. Our data model is implemented using a star schema[1], however it should be noted that our invention would work equally well with variants of the star schema such as a snowflake . A basic star schema consists of a fact table at its center and a corresponding set of dimension tables. A fact table is a normalized table that consists of a set of measures or facts and a set of attributes represented by foreign keys into a set of dimension tables. The measures are typically numeric and additive (or at least semi-additive). Because fact tables can have a very large number of rows, great effort is made to keep the columns as concise as possible. A dimension table is a highly de-normalized table that contains the unique descriptive attributes of each fact table entry. These attributes can consist of multiple hierarchies as well as simple attributes. Shown below, in Figure 1, are examples of dimension tables for an example Product, Geography, and Date. Also shown is a sample fact table containing foreign keys into the shown dimension tables and also two measures, revenue and units.

Figure 1. Sample dimension tables and fact table.

Step 1: Define the application specific data model
a) Identify data sources, dimensions and facts
b) Identify document sources, dimensions and facts

1

[This page contains 1 picture or other non-text object]

Page 2 of 7

c) Identify 'shared dimensions' for both data and documents

For each instantiation of this invention, the data model must be defined. The data and text each come from one or multiple source systems. For both the data and the text, the information that is to be analyzed must be identified within the source system to be modeled as either a fact or dimension. We have provided examples in Figures 1 and 2. The data can be handled using standard data warehousing techniques, which usually involves identifying the appropriate colum...