Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Method and System for Directly Loading Data from Hadoop Distributed File System (HDFS) to Oracle Database

IP.com Disclosure Number: IPCOM000238487D
Publication Date: 2014-Aug-28
Document File: 6 page(s) / 409K

Publishing Venue

The IP.com Prior Art Database

Related People

Apun Hiran: INVENTOR [+3]

Abstract

A method and system is disclosed for directly loading data from Hadoop* Distributed File System (HDFS) to an Oracle** database. The method and system reads multiple HDFS files at the same time and loads the multiple HDFS files into the Oracle database.

This text was extracted from a Microsoft Word document.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 47% of the total text.

Method and System for Directly Loading Data from Hadoop Distributed File System (HDFS) to Oracle Database

Abstract

A method and system is disclosed for directly loading data from Hadoop* Distributed File System (HDFS) to an Oracle** database.  The method and system reads multiple HDFS files at the same time and loads the multiple HDFS files into the Oracle database.

Description

Currently, the data from HDFS is pulled into a Network File System (NFS), which acts as a staging area.  Thereafter, the data is loaded into an Oracle database via one or more external tables.

In prior art techniques, the process of loading data from HDFS to an Oracle database is quite complicated.  The process of loading data from HDFS requires an Extract-Transform-Load (ETL) host, one or more launcher hosts and one or more database hosts.  Further, the process includes one or more serialization points enabling download of only one feed at a time.  Moreover, the logs of the process are scattered across different systems.  The process also requires an additional NFS based disk space as a staging area to store the output generated in HDFS before loading the data to the Oracle database.

Disclosed is a method and system for directly loading data from HDFS to an Oracle database.  The method and system reads multiple HDFS files at the same time and loads the multiple HDFS files into the Oracle database.

With Oracle 10g***, the method and system specifies a script as part of a preprocessor in an external table using the Oracle 10g’s Data Pump Access Driver (DPAD).  The DPAD moves data from within one or more external flat files into the Oracle database.  The method and system uses selected commands only to create the external table for moving the data.  Also, the method and system allows for Data Manipulation Language (DML) operations to be applied during the data import as opposed to applying DML operations after data is loaded into the Oracle database.

The method and system utilizes an executable script to transform the data residing in the one or more flat files before loading the data into the Oracle database.  The output for the executable script should be the standard output of the Operating System (OS) for the Oracle database to read it.

FIG.1 illustrates accessing Hadoop output via Oracle’s external table in accordance with the method and system disclosed herein.

Figure 1

With reference to FIG. 1, the method and system utilizes a shell script to run a CURL command to access one or more files on the HDFS and redirects the output to a standard output on the Oracle database host via part-r-00000-n files.  The method and system, then, loads the part-r-00000-n files using CREATE TABLE AS SELECT (CTAS).

Consider a scenario where a single file is loaded in compressed / uncompressed form from HDFS to the Oracle database.  FIG. 2 illustrates Data Definition Language (DDL) for compressed and uncompressed data, in accordance with the scenario.

Figure 2

As shown...