Browse Prior Art Database

Method and System for Quarantining Data in a Real Time Batch Processing Environment

IP.com Disclosure Number: IPCOM000197951D
Publication Date: 2010-Jul-23
Document File: 3 page(s) / 41K

Publishing Venue

The IP.com Prior Art Database

Related People

Aravind Srinivasan: INVENTOR [+2]

Abstract

A method and system for quarantining data in a real time batch processing environment is disclosed. The method involves quarantining suspicious/unknown data in order to obtain a high quality data. Further, reconciliation of the quarantined data is performed once a quality determination of the data is over.

This text was extracted from a Microsoft Word document.
This is the abbreviated version, containing approximately 52% of the total text.

Method and System for Quarantining Data in a Real Time Batch Processing Environment

Abstract

A method and system for quarantining data in a real time batch processing environment is disclosed.  The method involves quarantining suspicious/unknown data in order to obtain a high quality data.  Further, reconciliation of the quarantined data is performed once a quality determination of the data is over.

Description

Disclosed is a method and system for quarantining data in a real time batch processing environment.  The method involves quarantining suspicious/unknown data in order to obtain a high quality data.  Thereafter, reconciliation of the quarantined data is performed once a quality determination of the data is over.

In a scenario, the data to be quarantined flows through a data pipeline.  The data pipeline is a collection of closely related processing stages that inputs raw events and produces output data feeds.  The events represent records, wherein each event captures one discrete user activity.  Further, a first stage of the data pipeline includes an input adapter.  In order to quarantine the data residing in the data pipeline, the input adapter stage of the data pipeline maintains a White List (WL) and a Black List (BL).  The WL represents a list of machines authorized for sending the events.  The events sent by such machines represent valid events.  In case the events are determined as valid events, such events are referred as live events.  The valid events are represented as the live events by setting a flag isLive = 1.  In addition, the BL represents a list of machines unauthorized for sending events.  Therefore, the events sent by such machines represent invalid events.  The invalid events are events determined as not valid.  Further, the data pipelines are required to quarantine the invalid events.

Upon maintaining the WL and the BL, events are determined to be either valid or invalid by event emitters.  In case the event emitters determines the events as valid events, a flag value of isLive is set as 1, for example, isLive = 1.  Whereas, if the events are not determined as valid or in other words if the isLive flag for the events is not set, then such events are detected as unknown/suspect events.  The suspect events are events that cannot be determined as valid events.  Such events have a flag isLive = 0.  However, a suspect event is not necessarily an invalid event.  The suspect event may be either a live event or an Invalid event.  The input adapter processes these suspect events.

Further, the hostname/machine name that generated a suspect event can be extracted from the event header.  In case the hostname/ machine name is present i...