Browse Prior Art Database

System and method for maintaining consistency in a continuous ingestion environment with different ingestion profiles

IP.com Disclosure Number: IPCOM000236139D
Publication Date: 2014-Apr-08
Document File: 2 page(s) / 39K

Publishing Venue

The IP.com Prior Art Database

Abstract

Most natural language processing systems run a batch 'ingestion' process over a large body of data that treats all documents as relatively equal. These ingestion systems may run on-demand or continuously but do not generally prioritize certain ingestions over others. Additionally, these systems do not take care to segment their ingestions into atomic units of work that can safely be ingested in parallel. The system and method described below creates safe transaction boundaries of ingestion and a set of priority queues allowing multiple ingestions to run in parallel.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 01 of 2

System and method for maintaining consistency in a continuous ingestion environment with different ingestion profiles

Review of problem

A natural language processing system may be used in a domain that requires continuous ingestion. Within that domain there may be different "ingestion profiles" that all compete for resources but cannot be processing data at the same time. The profiles may have different quality of service or priority demands, and in fact may introduce data inconsistency errors if they are not properly isolated.

In a specific example, consider the ingestion system for a hypothetical medical solution. There are two different modes of continuous ingestion (periodical, a nightly-"breadth" process, and discovery, an hourly-"depth" process) and one mode of on-demand ingestion (where a doctor creates a new document to be considered immediately). Each of these ingestions does structured and natural language processing (NLP) ingestion and performs additional calculations that rely on a single ingestion "transaction".

Previous solutions set up careful scheduling when each kind of ingestion profile was allowed. However, there is a need to be able to run any of the profiles at any time and be assured that a) the most important profiles run first and b) data consistency is maintained.

Summary of invention


The core idea is to define coherent "chunk identifiers" that can be ingested independently of one another, and set up a system of priority queues to run the ingestion profiles in the correct priority order while maintaining consistency. These ingestion profiles can vary from batch and schedule/event-driven to manual and user/event-driven.

In the hypothetical medical example, the chunk identifier is the patient's Medical Record...