Browse Prior Art Database

Detecting Types Of Data In A Big Data Environment

IP.com Disclosure Number: IPCOM000237653D
Publication Date: 2014-Jul-01
Document File: 5 page(s) / 93K

Publishing Venue

The IP.com Prior Art Database

Abstract

A method and system is disclosed for detecting a data type in a big data environment. The method and system includes inferring data types by sampling a subset of data and providing a schema along with each data file.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 38% of the total text.

Page 01 of 5

Detecting Types Of Data In A Big Data Environment

Disclosed is a method and system for detecting a data type in a big data environment. The method and system includes inferring data types by sampling a subset of data and providing a schema along with each data file.

In an embodiment of the present invention, a data file is read and the first N rows of the data file are sampled to determine a schema. The schema is then used to read rest of the rows of the data file. For example, a data file with the following data is considered and the first 10 rows after header row are sampled.

CUSTOMERID,LOYALTYSCORE,FIRSTNAME,LASTNAME,GENDER,CITY,STATE,ZIP

107031,0.103,Sandra,Rodriguez,F,Hillcrest,CA,92103 107054,0.228,Judy,Taylor,F,,, 107347,0.344,James,Gunderson,M,Bayside,CA,95524 107552,0.103,Ben,Seito,M,Truckee,CA,96161 107788,0.356,Mary,Romero,F,Sebastopol,CA,95472 107915,0.404,Derek,Johnson,M,Irvine,CA,92604 108371,0.247,Margaret,Murphy,F,Glendale,CA,91201 108803,0.286,Xavier,Sanchez,M,Huntington Beach,CA,92648 108960,0.608,Jing,Su,F,San Diego,CA,92114 109004,0.552,Kim,Nguyen,F,San Diego,CA,92123
109065 - xn,0.317,Nancy,Sofara,F,San Luis Obispo,CA,93406 109111,0.267,Amy,Johnson,F 109685,0.097,Wayne,Reitz,M,"","","" 110253,NULL,Edwin,Cho,M,San Jose,CA,95113 110667,0.423,Angel,Henandez,M,Los Angeles,CA,90021

The sample may yield the following schema; CUSTOMERID : Integer
LOYALTYSCORE : Double
FIRSTNAME : String
LASTNAME : String
GENDER : String
CITY : String
STATE : String
ZIP : Integer

This schema is used to read the rest of the rows. Rows following the initial sampling subset will have data types assigned according the rules outlined by the following pseudocode, wherein if a row contains a column value that is not of the expected data type, an attempt is made to convert the data to the nearest acceptable type, and

wherein if no such conversion exists, then the null value is assigned.

1


Page 02 of 5

pseudocode for processing rows after the schema has been discovered:

for: every row after the sample
for: every column in the row
if: the column type is one of the whole number numeric types: INTEGER, LONG, BIGINTEGER

if: the observed value is determined to be NULL or a STRING of only whitespace characters.
then: return NULL

if: the observed value is determined to be the same type as the pre-determined column type (ex. column type is INTEGER and value is also INTEGER)
then: do nothing. Keep the incoming value unchanged.

if: the observed value is determined to be any of the whole number numeric types but NOT the pre-determined column type: (ex. column type is INTEGER but value is LONG)
then: convert the value to the pre-determined column type if it can be represented. If converting from a larger type to a smaller type when the value

will not fit, then assign NULL to the datum.

if: the observed value is determined to be any of the decimal numeric types then: truncate the decimal portion to yield the nearest whole number. If the

whole number can be represented...